"You thought send() writes to the socket immediately? Hah. Welcome to the world of TCP_NOTSENT_LOWAT, where even writing is subject to backpressure math."
TL;DR;
- TCP_NOTSENT_LOWAT is a Linux-specific socket option.
- It sets a low-watermark for unsent data in the TCP send buffer.
- If the amount of unsent data ≥ this watermark, write() will block (or return EAGAIN if non-blocking).
- It allows fine-grained control over pacing, avoiding excess buffer bloat.
- It is critical for sendfile loops, high-throughput streaming, and zero-copy TCP.
Historical Background: The Problem With write()
Let's rewind.
Traditionally, write() to a TCP socket just copies data into the kernel send buffer and returns immediately. The kernel handles:
- Segmenting
- Congestion control
- Actual packet transmission
But that's the problem:
You can write gigabytes into a socket before anything is sent.
If you're using sendfile() or streaming large files, you might be dumping MBs into the kernel's buffer – creating:
- High memory usage
- Unbounded latency
- Poor pacing control
The Motivation: App-Level Backpressure
Enter: TCP_NOTSENT_LOWAT
A way to say: "Only let me write more if the kernel has actually sent my previous data."
Added in Linux v4.14 (2017) — modern, sharp, beautiful.
Before this, your only controls were:
- SO_SNDBUF: Total send buffer size
- select()/poll(): Writable if any space available
But that's crude.
With TCP_NOTSENT_LOWAT, you say:
"Don't return from write() until unsent data in kernel < X bytes."
Now your app is aware of transmission, not just buffering.
Code Example: Streaming Data with TCP_NOTSENT_LOWAT
#include <netinet/tcp.h>
int lowat = 65536; // 64 KB
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));
Now, write() to that socket will:
- Buffer up to 64KB unsent data
- Then block (or return EAGAIN) until the kernel transmits some
This is backpressure, TCP-style.
Real Use Case: Userspace Sendfile Replacement
You're building a custom sendfile() loop:
while (offset < filesize) {
ssize_t n = read(filefd, buf, BUFSZ);
ssize_t written = write(sockfd, buf, n);
if (written < 0 && errno == EAGAIN) {
poll(...); // Wait until TCP_NOTSENT_LOWAT allows more writes
}
offset += written;
}
Now your buffer never overflows, pacing matches transmission, and memory usage stays flat.
- No more flooding the send buffer.
- No more TCP sawtooth pain.
- Just smooth, flow-controlled streaming.
How the Kernel Tracks It
The kernel tracks "unsent bytes" separately from the total send buffer.
Unsent bytes = bytes written via send() but not yet handed to the network layer (i.e., not yet turned into packets).
Internally:
- The kernel subtracts each outgoing segment.
- Once unsent data < TCP_NOTSENT_LOWAT, your socket becomes writable again.
You can confirm with tcp_info:
struct tcp_info info;
socklen_t len = sizeof(info);
getsockopt(sockfd, IPPROTO_TCP, TCP_INFO, &info, &len);
printf("Unacked: %u\n", info.tcpi_unacked);
But note: unsent ≠ unacked. You want unqueued bytes – deep in the TCP stack.
Important Details
- Default TCP_NOTSENT_LOWAT is INT_MAX – effectively disabled.
- Works with send(), write(), writev(), sendmsg()
- Doesn't affect sendfile() (unless you splice through userland).
- Only affects blocking behavior – doesn't change send buffer limits.
Why This Is Powerful
1. App-Level Rate Limiting
You can control how much in-flight data you allow:
// Only allow 32KB of untransmitted data at a time
int lowat = 32768;
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));
This keeps your transmission window tight. Ideal for:
- Satellite links
- Congested mobile connections
- Low-latency IoT devices
2. Precise Non-Blocking Streaming
Most streamers do:
while (write(fd, buf, len) > 0);
That'll fill the entire send buffer.
Better:
setsockopt(fd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));
Now you back off when data hasn't been sent. Congestion-aware in userland.
3. Zero-Copy Threshold Tuning
You're building a mmap()-based packet generator?
Want to avoid copying more data into the kernel until the previous data was transmitted?
TCP_NOTSENT_LOWAT = user-space congestion control primitive.
You can now build a full transport-layer-aware, custom-paced sender without touching kernel code.
Comparison with Other Socket Options
Socket Option Controls Purpose
------------- -------- -------
SO_SNDBUF Total buffer size Max memory for socket
SO_SNDLOWAT Writable threshold Wakes poll() only after X bytes
TCP_CORK Send coalescing Delay packetization
TCP_NODELAY Disables Nagle Immediate send
TCP_NOTSENT_LOWAT Max unsent data in buffer Pacing and streaming control
Pro tip: Combine TCP_NOTSENT_LOWAT + TCP_CORK = pipelined, batched transmission with send-level flow control.
Deep Experimental Test
Try this:
- Set TCP_NOTSENT_LOWAT = 16KB
- Run a server with non-blocking writes
- Start writing a 10MB file via writev()
- Observe poll() will wake only when kernel has flushed 16KB
You now control exactly when you're allowed to enqueue more – and you don't need to busy-wait or guess buffer states.
This is socket-as-flow-regulator mode.
Advanced Use: Simulated Sliding Window Protocol in Userland
Use TCP_NOTSENT_LOWAT to implement your own flow window logic.
int window_size = 8 * 1024; // 8KB "send window"
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &window_size, sizeof(window_size));
while (sending) {
// only write if we have permission
if (poll(sockfd, POLLOUT, 0) > 0) {
write(sockfd, next_chunk, chunk_size);
}
}
It's like writing a miniature version of TCP on top of TCP.
Why? Because pacing is power.
Final Thoughts
TCP_NOTSENT_LOWAT is what happens when Linux gives you a lever into the TCP transmission pipeline itself. It's not about buffering. It's about when data is considered "sent" and when your app is allowed to write more.
In a world of zero-copy pipes, congestion-controlled QUIC stacks, and kernel bypass, this is the Linux socket API's answer to backpressure elegance.
It lets you:
- Build your own userland congestion control
- Control memory footprint per stream
- Pseudo-packetize data via pacing
- Get deterministic latency for streaming workloads
This isn't just a socket flag. It's an instrument of precision.
Further Reading
- Linux Kernel Patch: Add TCP_NOTSENT_LOWAT
- Kernel source: net/ipv4/tcp_output.c, tcp_sendmsg_locked()
- QUIC backpressure design notes (for contrast)
- Wireshark filters: tcp.analysis.bytes_in_flight
Closing
Need a guide to SO_RCVLOWAT, TCP_INQ, or Linux MSG_ZEROCOPY?
Just say the word.
We can keep digging all the way into the syscall abyss.