"You thought send() writes to the socket immediately? Hah. Welcome to the world of TCP_NOTSENT_LOWAT, where even writing is subject to backpressure math."
TL;DR;
- TCP_NOTSENT_LOWAT is a Linux-specific socket option.
- It sets a low-watermark for unsent data in the TCP send buffer.
- If the amount of unsent data ≥ this watermark, write() will block (or return EAGAIN if non-blocking).
- It allows fine-grained control over pacing, avoiding excess buffer bloat.
- It is critical for sendfile loops, high-throughput streaming, and zero-copy TCP.
Historical Background: The Problem With write()
Let's rewind.
Traditionally, write() to a TCP socket just copies data into the kernel send buffer and returns immediately. The kernel handles:
- Segmenting
- Congestion control
- Actual packet transmission
But that's the problem:
You can write gigabytes into a socket before anything is sent.
If you're using sendfile() or streaming large files, you might be dumping MBs into the kernel's buffer – creating:
- High memory usage
- Unbounded latency
- Poor pacing control
The Motivation: App-Level Backpressure
Enter: TCP_NOTSENT_LOWAT
A way to say: "Only let me write more if the kernel has actually sent my previous data."
Added in Linux v4.14 (2017) — modern, sharp, beautiful.
Before this, your only controls were:
- SO_SNDBUF: Total send buffer size
- select()/poll(): Writable if any space available
But that's crude.
With TCP_NOTSENT_LOWAT, you say:
"Don't return from write() until unsent data in kernel < X bytes."
Now your app is aware of transmission, not just buffering.
Code Example: Streaming Data with TCP_NOTSENT_LOWAT
#include <netinet/tcp.h>
int lowat = 65536; // 64 KB
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));
Now, write() to that socket will:
- Buffer up to 64KB unsent data
- Then block (or return EAGAIN) until the kernel transmits some
This is backpressure, TCP-style.
Real Use Case: Userspace Sendfile Replacement
You're building a custom sendfile() loop:
while (offset < filesize) {
ssize_t n = read(filefd, buf, BUFSZ);
ssize_t written = write(sockfd, buf, n);
if (written < 0 && errno == EAGAIN) {
poll(...); // Wait until TCP_NOTSENT_LOWAT allows more writes
}
offset += written;
}
Now your buffer never overflows, pacing matches transmission, and memory usage stays flat.
- No more flooding the send buffer.
- No more TCP sawtooth pain.
- Just smooth, flow-controlled streaming.
How the Kernel Tracks It
The kernel tracks "unsent bytes" separately from the total send buffer.
Unsent bytes = bytes written via send() but not yet handed to the network layer (i.e., not yet turned into packets).
Internally:
- The kernel subtracts each outgoing segment.
- Once unsent data < TCP_NOTSENT_LOWAT, your socket becomes writable again.
You can confirm with tcp_info:
struct tcp_info info;
socklen_t len = sizeof(info);
getsockopt(sockfd, IPPROTO_TCP, TCP_INFO, &info, &len);
printf("Unacked: %u\n", info.tcpi_unacked);
But note: unsent ≠ unacked. You want unqueued bytes – deep in the TCP stack.
Important Details
- Default TCP_NOTSENT_LOWAT is INT_MAX – effectively disabled.
- Works with send(), write(), writev(), sendmsg()
- Doesn't affect sendfile() (unless you splice through userland).
- Only affects blocking behavior – doesn't change send buffer limits.
Why This Is Powerful
1. App-Level Rate Limiting
You can control how much in-flight data you allow:
// Only allow 32KB of untransmitted data at a time
int lowat = 32768;
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));
This keeps your transmission window tight. Ideal for:
- Satellite links
- Congested mobile connections
- Low-latency IoT devices
2. Precise Non-Blocking Streaming
Most streamers do:
while (write(fd, buf, len) > 0);
That'll fill the entire send buffer.
Better:
setsockopt(fd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));
Now you back off when data hasn't been sent. Congestion-aware in userland.
3. Zero-Copy Threshold Tuning
You're building a mmap()-based packet generator?
Want to avoid copying more data into the kernel until the previous data was transmitted?
TCP_NOTSENT_LOWAT = user-space congestion control primitive.
You can now build a full transport-layer-aware, custom-paced sender without touching kernel code.
Comparison with Other Socket Options
Socket Option Controls Purpose
------------- -------- -------
SO_SNDBUF Total buffer size Max memory for socket
SO_SNDLOWAT Writable threshold Wakes poll() only after X bytes
TCP_CORK Send coalescing Delay packetization
TCP_NODELAY Disables Nagle Immediate send
TCP_NOTSENT_LOWAT Max unsent data in buffer Pacing and streaming control
Pro tip: Combine TCP_NOTSENT_LOWAT + TCP_CORK = pipelined, batched transmission with send-level flow control.
Deep Experimental Test
Try this:
- Set TCP_NOTSENT_LOWAT = 16KB
- Run a server with non-blocking writes
- Start writing a 10MB file via writev()
- Observe poll() will wake only when kernel has flushed 16KB
You now control exactly when you're allowed to enqueue more – and you don't need to busy-wait or guess buffer states.
This is socket-as-flow-regulator mode.
Advanced Use: Simulated Sliding Window Protocol in Userland
Use TCP_NOTSENT_LOWAT to implement your own flow window logic.
int window_size = 8 * 1024; // 8KB "send window"
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &window_size, sizeof(window_size));
while (sending) {
// only write if we have permission
if (poll(sockfd, POLLOUT, 0) > 0) {
write(sockfd, next_chunk, chunk_size);
}
}
It's like writing a miniature version of TCP on top of TCP.
Why? Because pacing is power.
Final Thoughts
TCP_NOTSENT_LOWAT is what happens when Linux gives you a lever into the TCP transmission pipeline itself. It's not about buffering. It's about when data is considered "sent" and when your app is allowed to write more.
In a world of zero-copy pipes, congestion-controlled QUIC stacks, and kernel bypass, this is the Linux socket API's answer to backpressure elegance.
It lets you:
- Build your own userland congestion control
- Control memory footprint per stream
- Pseudo-packetize data via pacing
- Get deterministic latency for streaming workloads
This isn't just a socket flag. It's an instrument of precision.
Further Reading
- Linux Kernel Patch: Add TCP_NOTSENT_LOWAT
- Kernel source: net/ipv4/tcp_output.c, tcp_sendmsg_locked()
- QUIC backpressure design notes (for contrast)
- Wireshark filters: tcp.analysis.bytes_in_flight
Closing
TCP_NOTSENT_LOWAT is the fix for the lie that write() tells you.
It doesn't mean "sent". It means "copied somewhere and forgotten". This option is how you force the kernel to admit when data is actually leaving the box.
If you stream large responses, care about pacing, or don't want to shovel megabytes into a black hole, this is the lever you reach for.
It's not buffering. It's backpressure with teeth.
Next up
- SO_RCVLOWAT – receive-side backpressure and why nobody uses it correctly
- TCP_INQ – finding out how much data is really waiting in the socket
- MSG_ZEROCOPY – sending without copying and paying for it later
We can keep digging all the way into the syscall abyss. See you all later!
