"Your TCP packet didn't travel the world. It never even left your L1 cache."
– Someone who knows what sk_buff does

TL;DR;

  • The zero-copy loopback stack is Linux's internal optimization for sending data to yourself over TCP (usually via localhost or 127.0.0.1).
  • When the kernel sees both ends of a TCP connection live in the same process or host, it short-circuits the data path, bypassing: The NIC, The driver stack, Checksum offload, Actual IP routing.
  • In ideal cases, data goes from your userspace buffer to the receiver's buffer without any copy at all.
  • Result: near-RAM speed socket communication, in a POSIX-compliant API.

History: From syscalls to shared memory

Before this optimization, localhost TCP followed the full stack:

  • Send via write()
  • Packetized into sk_buff
  • Routed to the loopback interface (lo)
  • Re-assembled and delivered to the socket receive buffer

Multiple copies happen:

  • User → kernel (send)
  • Kernel → socket buffer
  • Socket buffer → kernel receive
  • Kernel receive → user (read())

That's 4 copies – for a local call! Completely unnecessary if both sockets are local.

So kernel devs asked: "What if we cheat?"

Enter: The Loopback Fast Path

Modern Linux (since ~3.10+, improved through 5.x) includes an internal optimization:

If both endpoints of a TCP socket are local and on the loopback interface, and certain conditions are met:

  • No IPsec
  • No netfilter (iptables)
  • No QoS
  • No congestion control tricks

… then the kernel will internally pass the data directly from sender socket to receiver socket, bypassing the entire IP stack.

What Does Zero-Copy Actually Mean?

You probably heard of:

  • mmap()
  • splice()
  • sendfile()
  • io_uring

These are user-to-kernel zero-copy techniques.

Loopback zero-copy is even more fundamental:

It's in-kernel socket-to-socket zero-copy.
No user interaction. No syscalls.
Data is enqueued in one socket and instantly dequeued from the other.

Code Example: How to Trigger It

This is all you need:

int listener = socket(AF_INET, SOCK_STREAM, 0);
bind(listener, ...127.0.0.1...);
listen(listener, 1);

int client = socket(AF_INET, SOCK_STREAM, 0);
connect(client, ...127.0.0.1...);

int server = accept(listener, ...);

// Now client <--> server is a local TCP pair

Then:

char msg[] = "zero-copy!";
write(client, msg, sizeof(msg));

char buf[32];
read(server, buf, sizeof(buf));

If your kernel is smart (Linux 4.16+), this write() and read() are zero-copy from one socket buffer to another.

Kernel Path Dissection

Internally, Linux checks this path:

  • tcp_sendmsg(): The core TCP send routine
  • Detects: loopback → skips device output
  • sk->sk_data_ready on the receiving socket fires immediately
  • Data directly passed to peer's receive queue (sk_receive_queue)
  • If receiver is blocked in recv(), it gets woken instantly
  • Data appears in user buffer without ever being "routed"

Check this file in the kernel:

net/ipv4/tcp_loopback.c

You'll find logic like:

if (dst->dev->flags & IFF_LOOPBACK) {
    // short-circuit the stack
}

Performance? It's Insane.

Let's benchmark:

iperf3 -c 127.0.0.1

You'll see:

[ ID] Interval       Transfer     Bandwidth
[  5] 0.00-1.00 sec  11.2 GBytes  96.3 Gbits/sec

Yes, 96 Gbit/sec over TCP.

No, you're not imagining it.

Deep Mode: Using splice() Over Loopback

Try this:

int pipefds[2];
pipe(pipefds);

splice(filefd, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE);
splice(pipefds[0], NULL, sockfd, NULL, 65536, SPLICE_F_MOVE);

If sockfd is a loopback TCP socket and kernel supports it, this goes:

  • From file → kernel page cache
  • Into pipe buffer
  • Into the peer socket
  • All without copying

It's zero-copy loopback + zero-copy syscall = kernel-magic streaming.

Zero-Copy in io_uring

Newer Linux (5.6+) supports:

io_uring_prep_send_zc()

With loopback optimization, this means:

  • App never copies
  • Kernel never copies
  • Data path = DMA → socket queue → peer buffer → syscall complete

You're basically writing userland TCP that beats most RPC frameworks.

Security Implication: Same-Host Visibility

Because data never leaves the host, loopback sockets can:

  • Avoid TLS (if you dare)
  • Rely on Unix domain security
  • Skip MTU fragmentation logic
  • Skip route tables, NAT, IP rules

This makes them ideal for internal microservice RPC, e.g., using gRPC over loopback.

When Zero-Copy Loopback Fails

It degrades to regular copy path if:

  • You use iptables rules on lo
  • You set tcp_checksum=1
  • You route via different VRF or namespace
  • You add congestion control modules (bbr, cubic)
  • You mix blocking and non-blocking socket flags incorrectly

Run strace, use perf, use tcpdump – and verify you're not hitting the slow path.

Experimental: Building In-Memory TCP via veth

Want to emulate loopback across two namespaces?

  1. Create veth0 <--> veth1
  2. Assign each to a netns
  3. Use TCP_NODELAY, TCP_NOTSENT_LOWAT, and TCP_CORK to simulate zero-copy over shared memory

It's not quite loopback-fast, but lets you measure stack behavior under controlled topologies.

Why It Matters: Rethinking IPC

Most people use:

  • Unix domain sockets
  • Named pipes
  • Shared memory
  • gRPC over HTTP/2

But:

TCP over loopback with zero-copy beats all of them in flexibility + performance.

You get:

  • Stream semantics
  • POSIX compliance
  • Congestion control
  • No serialization step
  • Kernel-managed queues
  • Transparent upgrade path to real TCP

It's basically shared memory with TCP semantics.

Final Thoughts

The zero-copy loopback stack is the ultimate example of kernel optimization: a place where network semantics and memory locality collide, and the kernel says:

"Oh, you're just talking to yourself? Fine. I won't even hit RAM."

It gives you:

  • Wire-speed localhost RPC
  • Real TCP semantics, with none of the NIC pain
  • A fast path through the kernel that mimics what RDMA and DPDK do, without leaving userspace.

You're not sending packets anymore.

You're passing pointers inside the kernel.

Further Reading

  • Linux Kernel: tcp_output.c, tcp_loopback.c, tcp_write_xmit()
  • perf record + perf trace on 127.0.0.1
  • Use strace -e trace=network -T to see syscall timings
  • io_uring man pages

Closing

Want more? I can write about:

  • How loopback interacts with epoll()
  • Benchmarking AF_UNIX vs loopback
  • Using TCP_FASTOPEN over localhost

Say the word. We'll go even deeper.