"Your TCP packet didn't travel the world. It never even left your L1 cache."
– Someone who knows what sk_buff does
TL;DR;
- The zero-copy loopback stack is Linux's internal optimization for sending data to yourself over TCP (usually via localhost or 127.0.0.1).
- When the kernel sees both ends of a TCP connection live in the same process or host, it short-circuits the data path, bypassing: The NIC, The driver stack, Checksum offload, Actual IP routing.
- In ideal cases, data goes from your userspace buffer to the receiver's buffer without any copy at all.
- Result: near-RAM speed socket communication, in a POSIX-compliant API.
History: From syscalls to shared memory
Before this optimization, localhost TCP followed the full stack:
- Send via write()
- Packetized into sk_buff
- Routed to the loopback interface (lo)
- Re-assembled and delivered to the socket receive buffer
Multiple copies happen:
- User → kernel (send)
- Kernel → socket buffer
- Socket buffer → kernel receive
- Kernel receive → user (read())
That's 4 copies – for a local call! Completely unnecessary if both sockets are local.
So kernel devs asked: "What if we cheat?"
Enter: The Loopback Fast Path
Modern Linux (since ~3.10+, improved through 5.x) includes an internal optimization:
If both endpoints of a TCP socket are local and on the loopback interface, and certain conditions are met:
- No IPsec
- No netfilter (iptables)
- No QoS
- No congestion control tricks
… then the kernel will internally pass the data directly from sender socket to receiver socket, bypassing the entire IP stack.
What Does Zero-Copy Actually Mean?
You probably heard of:
- mmap()
- splice()
- sendfile()
- io_uring
These are user-to-kernel zero-copy techniques.
Loopback zero-copy is even more fundamental:
It's in-kernel socket-to-socket zero-copy.
No user interaction. No syscalls.
Data is enqueued in one socket and instantly dequeued from the other.
Code Example: How to Trigger It
This is all you need:
int listener = socket(AF_INET, SOCK_STREAM, 0);
bind(listener, ...127.0.0.1...);
listen(listener, 1);
int client = socket(AF_INET, SOCK_STREAM, 0);
connect(client, ...127.0.0.1...);
int server = accept(listener, ...);
// Now client <--> server is a local TCP pair
Then:
char msg[] = "zero-copy!";
write(client, msg, sizeof(msg));
char buf[32];
read(server, buf, sizeof(buf));
If your kernel is smart (Linux 4.16+), this write() and read() are zero-copy from one socket buffer to another.
Kernel Path Dissection
Internally, Linux checks this path:
- tcp_sendmsg(): The core TCP send routine
- Detects: loopback → skips device output
- sk->sk_data_ready on the receiving socket fires immediately
- Data directly passed to peer's receive queue (sk_receive_queue)
- If receiver is blocked in recv(), it gets woken instantly
- Data appears in user buffer without ever being "routed"
Check this file in the kernel:
net/ipv4/tcp_loopback.c
You'll find logic like:
if (dst->dev->flags & IFF_LOOPBACK) {
// short-circuit the stack
}
Performance? It's Insane.
Let's benchmark:
iperf3 -c 127.0.0.1
You'll see:
[ ID] Interval Transfer Bandwidth
[ 5] 0.00-1.00 sec 11.2 GBytes 96.3 Gbits/sec
Yes, 96 Gbit/sec over TCP.
No, you're not imagining it.
Deep Mode: Using splice() Over Loopback
Try this:
int pipefds[2];
pipe(pipefds);
splice(filefd, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE);
splice(pipefds[0], NULL, sockfd, NULL, 65536, SPLICE_F_MOVE);
If sockfd is a loopback TCP socket and kernel supports it, this goes:
- From file → kernel page cache
- Into pipe buffer
- Into the peer socket
- All without copying
It's zero-copy loopback + zero-copy syscall = kernel-magic streaming.
Zero-Copy in io_uring
Newer Linux (5.6+) supports:
io_uring_prep_send_zc()
With loopback optimization, this means:
- App never copies
- Kernel never copies
- Data path = DMA → socket queue → peer buffer → syscall complete
You're basically writing userland TCP that beats most RPC frameworks.
Security Implication: Same-Host Visibility
Because data never leaves the host, loopback sockets can:
- Avoid TLS (if you dare)
- Rely on Unix domain security
- Skip MTU fragmentation logic
- Skip route tables, NAT, IP rules
This makes them ideal for internal microservice RPC, e.g., using gRPC over loopback.
When Zero-Copy Loopback Fails
It degrades to regular copy path if:
- You use iptables rules on lo
- You set tcp_checksum=1
- You route via different VRF or namespace
- You add congestion control modules (bbr, cubic)
- You mix blocking and non-blocking socket flags incorrectly
Run strace, use perf, use tcpdump – and verify you're not hitting the slow path.
Experimental: Building In-Memory TCP via veth
Want to emulate loopback across two namespaces?
- Create
veth0 <--> veth1
- Assign each to a netns
- Use TCP_NODELAY, TCP_NOTSENT_LOWAT, and TCP_CORK to simulate zero-copy over shared memory
It's not quite loopback-fast, but lets you measure stack behavior under controlled topologies.
Why It Matters: Rethinking IPC
Most people use:
- Unix domain sockets
- Named pipes
- Shared memory
- gRPC over HTTP/2
But:
TCP over loopback with zero-copy beats all of them in flexibility + performance.
You get:
- Stream semantics
- POSIX compliance
- Congestion control
- No serialization step
- Kernel-managed queues
- Transparent upgrade path to real TCP
It's basically shared memory with TCP semantics.
Final Thoughts
The zero-copy loopback stack is the ultimate example of kernel optimization: a place where network semantics and memory locality collide, and the kernel says:
"Oh, you're just talking to yourself? Fine. I won't even hit RAM."
It gives you:
- Wire-speed localhost RPC
- Real TCP semantics, with none of the NIC pain
- A fast path through the kernel that mimics what RDMA and DPDK do, without leaving userspace.
You're not sending packets anymore.
You're passing pointers inside the kernel.
Further Reading
- Linux Kernel: tcp_output.c, tcp_loopback.c, tcp_write_xmit()
- perf record + perf trace on 127.0.0.1
- Use strace -e trace=network -T to see syscall timings
- io_uring man pages
Closing
Want more? I can write about:
- How loopback interacts with epoll()
- Benchmarking AF_UNIX vs loopback
- Using TCP_FASTOPEN over localhost
Say the word. We'll go even deeper.