Policy Routing: Rule-Based Routing

Why policy routing

The standard [[routing-table|routing table]] selects the next hop by destination IP using longest-prefix-match. That is not enough when you have:

Two uplink providers -- traffic from source A goes to Telecom, source B goes to Cogent
Source-based routing -- "traffic from 10.0.1.0/24 exits via provider X"
Per-tenant routing in a multitenant system ([[cni-plugins|CNI]], OpenStack)
NAT split -- a packet with fwmark=0x100 goes through a NAT machine, others go directly
Transit traffic through VPN for one namespace, local path for everything else
VRF-like isolation on a router

The solution is multiple routing tables plus selection rules. That is policy routing.

RPDB: Routing Policy Database

Linux maintains a set of rules (ip rule list) that are evaluated in priority order. Each rule says "if match, use this routing table." The first match wins.

Default state:

$ ip rule list

0:      from all lookup local

32766:  from all lookup main

32767:  from all lookup default

Three predefined tables:

local (255) -- host addresses and broadcasts. Populated by the kernel automatically
main (254) -- everything you add with the normal ip route add
default (253) -- rarely used, intended as a fallback

You can add your own tables. Table names and numbers live in /etc/iproute2/rt_tables:

# echo "100 isp_a" >> /etc/iproute2/rt_tables

# echo "200 isp_b" >> /etc/iproute2/rt_tables

Names are a convenience layer; the kernel works with numbers 0-255 (with lwt extensions, higher values are possible).

Multiple tables: two-uplink example

Two external links: eth0 (10.0.1.1 to ISP A) and eth1 (10.0.2.1 to ISP B). Goal: traffic from 192.168.10.0/24 exits via A, traffic from 192.168.20.0/24 exits via B.

Step 1: populate the tables:

ip route add default via 10.0.1.254 dev eth0 table isp_a

ip route add 10.0.1.0/24 dev eth0 src 10.0.1.1 table isp_a

ip route add default via 10.0.2.254 dev eth1 table isp_b

ip route add 10.0.2.0/24 dev eth1 src 10.0.2.1 table isp_b

Step 2: add rules:

ip rule add from 192.168.10.0/24 table isp_a priority 1000

ip rule add from 192.168.20.0/24 table isp_b priority 1001

Now, when forwarding a packet with src 192.168.10.5, the kernel hits the rule at priority 1000 and uses isp_a: default via 10.0.1.254 on eth0.

Rule selectors

ip rule add accepts many selectors:

Selector	What it matches
`from <prefix>`	source IP falls inside the prefix
`to <prefix>`	destination IP falls inside the prefix
`iif <name>`	packet arrived on this interface
`oif <name>`	packet exits through this interface
`tos <value>`	DSCP/TOS byte
`fwmark <mark>`	netfilter mark on the packet
`uidrange <a-b>`	UID of the process (for locally generated traffic)
`l3mdev`	L3 master device (for VRF)
`ipproto <proto>`	IP protocol (TCP/UDP/...)

Actions:

lookup <table> -- use routing table N
goto <priority> -- jump to another rule
nop -- do nothing, continue to next rule
blackhole, unreachable, prohibit -- drop the packet

fwmark and policy routing: a common pattern

Suppose you want HTTP traffic to go through a VPN and everything else to go directly. Combine iptables/[[nat|netfilter]] marking with policy routing:

# mark HTTP traffic

iptables -t mangle -A OUTPUT -p tcp --dport 80 -j MARK --set-mark 0x100

iptables -t mangle -A OUTPUT -p tcp --dport 443 -j MARK --set-mark 0x100

# routing for marked traffic

ip route add default dev wg0 table 100

ip rule add fwmark 0x100 table 100 priority 500

# marked HTTP/HTTPS now goes into wg0 (WireGuard tunnel)

This pattern appears in:

Split-tunnel VPN (only specific traffic goes through the tunnel)
Transparent proxy (mark a packet, route it to a local proxy)
DDoS scrubbing (mark suspicious traffic, send it down a separate path)
Container networking ([[cni-plugins|CNI]]: mark per namespace, separate table per namespace)

Reverse path filter and policy routing

Policy routing often produces asymmetric routing: a packet arrives on eth0 but the reply leaves on eth1. The default rp_filter=1 (strict mode) compares the source address against the reverse route. When they do not match, the packet is dropped.

The fix is loose mode:

sysctl -w net.ipv4.conf.all.rp_filter=2

sysctl -w net.ipv4.conf.eth0.rp_filter=2

sysctl -w net.ipv4.conf.eth1.rp_filter=2

In loose mode the kernel accepts any source address that is reachable through any interface. Without this, asymmetric routing simply does not work.

VRF: Virtual Routing and Forwarding

Linux 4.3 and later provide VRF-lite: virtual routing instances. Each VRF has its own set of interfaces and its own routing table, isolated from the others.

# create VRF "tenant-a" with table 100

ip link add vrf-a type vrf table 100

ip link set vrf-a up

# assign an interface to the VRF

ip link set eth1 master vrf-a

# routes are added into table 100

ip route add default via 10.0.1.1 dev eth1 table 100

A process can be bound to a VRF with ip vrf exec tenant-a curl .... VRF is used in:

Multitenant routing on a single Linux router
Management plane separation (management traffic through a mgmt-vrf)
Cumulus Linux, SONiC, and FRR-based routers

Source address for outgoing traffic

When a host has multiple IP addresses, the kernel selects the source address based on the routing table. To force a specific source:

ip route add 8.8.8.8 via 10.0.1.254 src 10.0.1.99

Or specify it explicitly at the application level:

curl --interface 10.0.1.99 https://example.com

ping -I 10.0.1.99 8.8.8.8

This matters for multi-IP hosts where different services bind to different addresses (mail on one address, web on another).

Packet processing order (simplified)

For an outgoing packet:

Process socket to OUTPUT (mangle, nat, filter chains)
Routing decision: ip rule list to table to ip route show table N
POSTROUTING (mangle, nat)
Transmission on the link

For a forwarded packet:

PREROUTING (mangle, nat, conntrack)
Routing decision (forwarding)
FORWARD chain
POSTROUTING

Setting fwmark in PREROUTING or OUTPUT works with policy routing (a rule matching fwmark) because the mark is set before the routing decision.

Troubleshooting

Rule added but not working -- check the priority. Lower number means higher priority. If from all lookup main (32766) matches before your rule, move your rule to a priority below 32766.
Route added to a table but traffic bypasses it -- confirm your rule points to that table: ip rule list | grep <table>.
Asymmetric routing causes drops -- rp_filter=1 is the culprit. Set it to 2 (loose mode).
ip route get <dst> shows unexpected results -- use ip route get <dst> from <src> mark <mark> to simulate your specific case through RPDB.
Rules disappear after reboot -- they are not saved automatically. Persist them via a NetworkManager dispatcher script, systemd-networkd, or a custom script in /etc/network/if-up.d/.
VRF and services -- a process does not see the VRF until it is started with ip vrf exec <name> or uses the SO_BINDTODEVICE socket option.

Useful commands

ip route flush cache -- flush the routing cache (a legacy practice; the cache was removed in Linux 3.6, but the command still exists)
ip rule list table <name> -- show which rules point to a given table
ip route show table all -- show all routes in all tables
suppress_prefixlength -- a rule that matches only when the routing decision produced a prefix shorter than N; used in L3 VPNs to prevent a default route from overriding more-specific routes

Why policy routing

The standard [[routing-table|routing table]] selects the next hop by destination IP using longest-prefix-match. That is not enough when you have:

Two uplink providers -- traffic from source A goes to Telecom, source B goes to Cogent
Source-based routing -- "traffic from 10.0.1.0/24 exits via provider X"
Per-tenant routing in a multitenant system ([[cni-plugins|CNI]], OpenStack)
NAT split -- a packet with fwmark=0x100 goes through a NAT machine, others go directly
Transit traffic through VPN for one namespace, local path for everything else
VRF-like isolation on a router

The solution is multiple routing tables plus selection rules. That is policy routing.

RPDB: Routing Policy Database

Linux maintains a set of rules (ip rule list) that are evaluated in priority order. Each rule says "if match, use this routing table." The first match wins.

Default state:

$ ip rule list

0:      from all lookup local

32766:  from all lookup main

32767:  from all lookup default

Three predefined tables:

local (255) -- host addresses and broadcasts. Populated by the kernel automatically
main (254) -- everything you add with the normal ip route add
default (253) -- rarely used, intended as a fallback

You can add your own tables. Table names and numbers live in /etc/iproute2/rt_tables:

# echo "100 isp_a" >> /etc/iproute2/rt_tables

# echo "200 isp_b" >> /etc/iproute2/rt_tables

Names are a convenience layer; the kernel works with numbers 0-255 (with lwt extensions, higher values are possible).

Multiple tables: two-uplink example

Two external links: eth0 (10.0.1.1 to ISP A) and eth1 (10.0.2.1 to ISP B). Goal: traffic from 192.168.10.0/24 exits via A, traffic from 192.168.20.0/24 exits via B.

Step 1: populate the tables:

ip route add default via 10.0.1.254 dev eth0 table isp_a

ip route add 10.0.1.0/24 dev eth0 src 10.0.1.1 table isp_a

ip route add default via 10.0.2.254 dev eth1 table isp_b

ip route add 10.0.2.0/24 dev eth1 src 10.0.2.1 table isp_b

Step 2: add rules:

ip rule add from 192.168.10.0/24 table isp_a priority 1000

ip rule add from 192.168.20.0/24 table isp_b priority 1001

Now, when forwarding a packet with src 192.168.10.5, the kernel hits the rule at priority 1000 and uses isp_a: default via 10.0.1.254 on eth0.

Rule selectors

ip rule add accepts many selectors:

Selector	What it matches
`from <prefix>`	source IP falls inside the prefix
`to <prefix>`	destination IP falls inside the prefix
`iif <name>`	packet arrived on this interface
`oif <name>`	packet exits through this interface
`tos <value>`	DSCP/TOS byte
`fwmark <mark>`	netfilter mark on the packet
`uidrange <a-b>`	UID of the process (for locally generated traffic)
`l3mdev`	L3 master device (for VRF)
`ipproto <proto>`	IP protocol (TCP/UDP/...)

Actions:

lookup <table> -- use routing table N
goto <priority> -- jump to another rule
nop -- do nothing, continue to next rule
blackhole, unreachable, prohibit -- drop the packet

fwmark and policy routing: a common pattern

Suppose you want HTTP traffic to go through a VPN and everything else to go directly. Combine iptables/[[nat|netfilter]] marking with policy routing:

# mark HTTP traffic

iptables -t mangle -A OUTPUT -p tcp --dport 80 -j MARK --set-mark 0x100

iptables -t mangle -A OUTPUT -p tcp --dport 443 -j MARK --set-mark 0x100

# routing for marked traffic

ip route add default dev wg0 table 100

ip rule add fwmark 0x100 table 100 priority 500

# marked HTTP/HTTPS now goes into wg0 (WireGuard tunnel)

This pattern appears in:

Split-tunnel VPN (only specific traffic goes through the tunnel)
Transparent proxy (mark a packet, route it to a local proxy)
DDoS scrubbing (mark suspicious traffic, send it down a separate path)
Container networking ([[cni-plugins|CNI]]: mark per namespace, separate table per namespace)

Reverse path filter and policy routing

The fix is loose mode:

sysctl -w net.ipv4.conf.all.rp_filter=2

sysctl -w net.ipv4.conf.eth0.rp_filter=2

sysctl -w net.ipv4.conf.eth1.rp_filter=2

In loose mode the kernel accepts any source address that is reachable through any interface. Without this, asymmetric routing simply does not work.

VRF: Virtual Routing and Forwarding

Linux 4.3 and later provide VRF-lite: virtual routing instances. Each VRF has its own set of interfaces and its own routing table, isolated from the others.

# create VRF "tenant-a" with table 100

ip link add vrf-a type vrf table 100

ip link set vrf-a up

# assign an interface to the VRF

ip link set eth1 master vrf-a

# routes are added into table 100

ip route add default via 10.0.1.1 dev eth1 table 100

A process can be bound to a VRF with ip vrf exec tenant-a curl .... VRF is used in:

Multitenant routing on a single Linux router
Management plane separation (management traffic through a mgmt-vrf)
Cumulus Linux, SONiC, and FRR-based routers

Source address for outgoing traffic

When a host has multiple IP addresses, the kernel selects the source address based on the routing table. To force a specific source:

ip route add 8.8.8.8 via 10.0.1.254 src 10.0.1.99

Or specify it explicitly at the application level:

curl --interface 10.0.1.99 https://example.com

ping -I 10.0.1.99 8.8.8.8

This matters for multi-IP hosts where different services bind to different addresses (mail on one address, web on another).

Packet processing order (simplified)

For an outgoing packet:

Process socket to OUTPUT (mangle, nat, filter chains)
Routing decision: ip rule list to table to ip route show table N
POSTROUTING (mangle, nat)
Transmission on the link

For a forwarded packet:

PREROUTING (mangle, nat, conntrack)
Routing decision (forwarding)
FORWARD chain
POSTROUTING

Setting fwmark in PREROUTING or OUTPUT works with policy routing (a rule matching fwmark) because the mark is set before the routing decision.

Troubleshooting

Rule added but not working -- check the priority. Lower number means higher priority. If from all lookup main (32766) matches before your rule, move your rule to a priority below 32766.
Route added to a table but traffic bypasses it -- confirm your rule points to that table: ip rule list | grep <table>.
Asymmetric routing causes drops -- rp_filter=1 is the culprit. Set it to 2 (loose mode).
ip route get <dst> shows unexpected results -- use ip route get <dst> from <src> mark <mark> to simulate your specific case through RPDB.
Rules disappear after reboot -- they are not saved automatically. Persist them via a NetworkManager dispatcher script, systemd-networkd, or a custom script in /etc/network/if-up.d/.
VRF and services -- a process does not see the VRF until it is started with ip vrf exec <name> or uses the SO_BINDTODEVICE socket option.

Useful commands

ip route flush cache -- flush the routing cache (a legacy practice; the cache was removed in Linux 3.6, but the command still exists)
ip rule list table <name> -- show which rules point to a given table
ip route show table all -- show all routes in all tables
suppress_prefixlength -- a rule that matches only when the routing decision produced a prefix shorter than N; used in L3 VPNs to prevent a default route from overriding more-specific routes

Why policy routing

RPDB: Routing Policy Database

Multiple tables: two-uplink example

Rule selectors

fwmark and policy routing: a common pattern

Reverse path filter and policy routing

VRF: Virtual Routing and Forwarding

Source address for outgoing traffic

Packet processing order (simplified)

Troubleshooting

Useful commands

§ команды

§ см. также

Policy Routing: Rule-Based Routing

Why policy routing

RPDB: Routing Policy Database

Multiple tables: two-uplink example

Rule selectors

fwmark and policy routing: a common pattern

Reverse path filter and policy routing

VRF: Virtual Routing and Forwarding

Source address for outgoing traffic

Packet processing order (simplified)

Troubleshooting

Useful commands

§ команды

§ см. также