Why policy routing
The standard [[routing-table|routing table]] selects the next hop by destination IP using longest-prefix-match. That is not enough when you have:
- Two uplink providers -- traffic from source A goes to Telecom, source B goes to Cogent
- Source-based routing -- "traffic from 10.0.1.0/24 exits via provider X"
- Per-tenant routing in a multitenant system ([[cni-plugins|CNI]], OpenStack)
- NAT split -- a packet with fwmark=0x100 goes through a NAT machine, others go directly
- Transit traffic through VPN for one namespace, local path for everything else
- VRF-like isolation on a router
The solution is multiple routing tables plus selection rules. That is policy routing.
RPDB: Routing Policy Database
Linux maintains a set of rules (ip rule list) that are evaluated
in priority order. Each rule says "if match, use this routing table."
The first match wins.
Default state:
$ ip rule list
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
Three predefined tables:
- local (255) -- host addresses and broadcasts. Populated by the kernel automatically
- main (254) -- everything you add with the normal
ip route add - default (253) -- rarely used, intended as a fallback
You can add your own tables. Table names and numbers live in
/etc/iproute2/rt_tables:
# echo "100 isp_a" >> /etc/iproute2/rt_tables
# echo "200 isp_b" >> /etc/iproute2/rt_tables
Names are a convenience layer; the kernel works with numbers 0-255 (with lwt extensions, higher values are possible).
Multiple tables: two-uplink example
Two external links: eth0 (10.0.1.1 to ISP A) and eth1 (10.0.2.1 to ISP B).
Goal: traffic from 192.168.10.0/24 exits via A, traffic from 192.168.20.0/24
exits via B.
Step 1: populate the tables:
ip route add default via 10.0.1.254 dev eth0 table isp_a
ip route add 10.0.1.0/24 dev eth0 src 10.0.1.1 table isp_a
ip route add default via 10.0.2.254 dev eth1 table isp_b
ip route add 10.0.2.0/24 dev eth1 src 10.0.2.1 table isp_b
Step 2: add rules:
ip rule add from 192.168.10.0/24 table isp_a priority 1000
ip rule add from 192.168.20.0/24 table isp_b priority 1001
Now, when forwarding a packet with src 192.168.10.5, the kernel hits the
rule at priority 1000 and uses isp_a: default via 10.0.1.254 on eth0.
Rule selectors
ip rule add accepts many selectors:
| Selector | What it matches |
|---|---|
from <prefix> | source IP falls inside the prefix |
to <prefix> | destination IP falls inside the prefix |
iif <name> | packet arrived on this interface |
oif <name> | packet exits through this interface |
tos <value> | DSCP/TOS byte |
fwmark <mark> | netfilter mark on the packet |
uidrange <a-b> | UID of the process (for locally generated traffic) |
l3mdev | L3 master device (for VRF) |
ipproto <proto> | IP protocol (TCP/UDP/...) |
Actions:
lookup <table>-- use routing table Ngoto <priority>-- jump to another rulenop-- do nothing, continue to next ruleblackhole,unreachable,prohibit-- drop the packet
fwmark and policy routing: a common pattern
Suppose you want HTTP traffic to go through a VPN and everything else to go directly. Combine iptables/[[nat|netfilter]] marking with policy routing:
# mark HTTP traffic
iptables -t mangle -A OUTPUT -p tcp --dport 80 -j MARK --set-mark 0x100
iptables -t mangle -A OUTPUT -p tcp --dport 443 -j MARK --set-mark 0x100
# routing for marked traffic
ip route add default dev wg0 table 100
ip rule add fwmark 0x100 table 100 priority 500
# marked HTTP/HTTPS now goes into wg0 (WireGuard tunnel)
This pattern appears in:
- Split-tunnel VPN (only specific traffic goes through the tunnel)
- Transparent proxy (mark a packet, route it to a local proxy)
- DDoS scrubbing (mark suspicious traffic, send it down a separate path)
- Container networking ([[cni-plugins|CNI]]: mark per namespace, separate table per namespace)
Reverse path filter and policy routing
Policy routing often produces asymmetric routing: a packet arrives on
eth0 but the reply leaves on eth1. The default rp_filter=1 (strict mode)
compares the source address against the reverse route. When they do not match,
the packet is dropped.
The fix is loose mode:
sysctl -w net.ipv4.conf.all.rp_filter=2
sysctl -w net.ipv4.conf.eth0.rp_filter=2
sysctl -w net.ipv4.conf.eth1.rp_filter=2
In loose mode the kernel accepts any source address that is reachable through any interface. Without this, asymmetric routing simply does not work.
VRF: Virtual Routing and Forwarding
Linux 4.3 and later provide VRF-lite: virtual routing instances. Each VRF has its own set of interfaces and its own routing table, isolated from the others.
# create VRF "tenant-a" with table 100
ip link add vrf-a type vrf table 100
ip link set vrf-a up
# assign an interface to the VRF
ip link set eth1 master vrf-a
# routes are added into table 100
ip route add default via 10.0.1.1 dev eth1 table 100
A process can be bound to a VRF with ip vrf exec tenant-a curl ....
VRF is used in:
- Multitenant routing on a single Linux router
- Management plane separation (management traffic through a mgmt-vrf)
- Cumulus Linux, SONiC, and FRR-based routers
Source address for outgoing traffic
When a host has multiple IP addresses, the kernel selects the source address based on the routing table. To force a specific source:
ip route add 8.8.8.8 via 10.0.1.254 src 10.0.1.99
Or specify it explicitly at the application level:
curl --interface 10.0.1.99 https://example.com
ping -I 10.0.1.99 8.8.8.8
This matters for multi-IP hosts where different services bind to different addresses (mail on one address, web on another).
Packet processing order (simplified)
For an outgoing packet:
- Process socket to
OUTPUT(mangle, nat, filter chains) - Routing decision:
ip rule listto table toip route show table N POSTROUTING(mangle, nat)- Transmission on the link
For a forwarded packet:
PREROUTING(mangle, nat, conntrack)- Routing decision (forwarding)
FORWARDchainPOSTROUTING
Setting fwmark in PREROUTING or OUTPUT works with policy routing (a rule
matching fwmark) because the mark is set before the routing decision.
Troubleshooting
- Rule added but not working -- check the priority. Lower number means
higher priority. If
from all lookup main(32766) matches before your rule, move your rule to a priority below 32766. - Route added to a table but traffic bypasses it -- confirm your rule
points to that table:
ip rule list | grep <table>. - Asymmetric routing causes drops -- rp_filter=1 is the culprit. Set it to 2 (loose mode).
ip route get <dst>shows unexpected results -- useip route get <dst> from <src> mark <mark>to simulate your specific case through RPDB.- Rules disappear after reboot -- they are not saved automatically. Persist them via a NetworkManager dispatcher script, systemd-networkd, or a custom script in /etc/network/if-up.d/.
- VRF and services -- a process does not see the VRF until it is started
with
ip vrf exec <name>or uses theSO_BINDTODEVICEsocket option.
Useful commands
ip route flush cache-- flush the routing cache (a legacy practice; the cache was removed in Linux 3.6, but the command still exists)ip rule list table <name>-- show which rules point to a given tableip route show table all-- show all routes in all tables- suppress_prefixlength -- a rule that matches only when the routing decision produced a prefix shorter than N; used in L3 VPNs to prevent a default route from overriding more-specific routes