"You thought send() writes to the socket immediately? Hah. Welcome to the world of TCP_NOTSENT_LOWAT, where even writing is subject to backpressure math."

TL;DR;

  • TCP_NOTSENT_LOWAT is a Linux-specific socket option.
  • It sets a low-watermark for unsent data in the TCP send buffer.
  • If the amount of unsent data ≥ this watermark, write() will block (or return EAGAIN if non-blocking).
  • It allows fine-grained control over pacing, avoiding excess buffer bloat.
  • It is critical for sendfile loops, high-throughput streaming, and zero-copy TCP.

Historical Background: The Problem With write()

Let's rewind.

Traditionally, write() to a TCP socket just copies data into the kernel send buffer and returns immediately. The kernel handles:

  • Segmenting
  • Congestion control
  • Actual packet transmission

But that's the problem:

You can write gigabytes into a socket before anything is sent.

If you're using sendfile() or streaming large files, you might be dumping MBs into the kernel's buffer – creating:

  • High memory usage
  • Unbounded latency
  • Poor pacing control

The Motivation: App-Level Backpressure

Enter: TCP_NOTSENT_LOWAT

A way to say: "Only let me write more if the kernel has actually sent my previous data."

Added in Linux v4.14 (2017) — modern, sharp, beautiful.

Before this, your only controls were:

  • SO_SNDBUF: Total send buffer size
  • select()/poll(): Writable if any space available

But that's crude.

With TCP_NOTSENT_LOWAT, you say:

"Don't return from write() until unsent data in kernel < X bytes."

Now your app is aware of transmission, not just buffering.

Code Example: Streaming Data with TCP_NOTSENT_LOWAT

#include <netinet/tcp.h>

int lowat = 65536; // 64 KB
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));

Now, write() to that socket will:

  • Buffer up to 64KB unsent data
  • Then block (or return EAGAIN) until the kernel transmits some

This is backpressure, TCP-style.

Real Use Case: Userspace Sendfile Replacement

You're building a custom sendfile() loop:

while (offset < filesize) {
    ssize_t n = read(filefd, buf, BUFSZ);

    ssize_t written = write(sockfd, buf, n);
    if (written < 0 && errno == EAGAIN) {
        poll(...); // Wait until TCP_NOTSENT_LOWAT allows more writes
    }
    offset += written;
}

Now your buffer never overflows, pacing matches transmission, and memory usage stays flat.

  • No more flooding the send buffer.
  • No more TCP sawtooth pain.
  • Just smooth, flow-controlled streaming.

How the Kernel Tracks It

The kernel tracks "unsent bytes" separately from the total send buffer.

Unsent bytes = bytes written via send() but not yet handed to the network layer (i.e., not yet turned into packets).

Internally:

  • The kernel subtracts each outgoing segment.
  • Once unsent data < TCP_NOTSENT_LOWAT, your socket becomes writable again.

You can confirm with tcp_info:

struct tcp_info info;
socklen_t len = sizeof(info);
getsockopt(sockfd, IPPROTO_TCP, TCP_INFO, &info, &len);
printf("Unacked: %u\n", info.tcpi_unacked);

But note: unsent ≠ unacked. You want unqueued bytes – deep in the TCP stack.

Important Details

  • Default TCP_NOTSENT_LOWAT is INT_MAX – effectively disabled.
  • Works with send(), write(), writev(), sendmsg()
  • Doesn't affect sendfile() (unless you splice through userland).
  • Only affects blocking behavior – doesn't change send buffer limits.

Why This Is Powerful

1. App-Level Rate Limiting

You can control how much in-flight data you allow:

// Only allow 32KB of untransmitted data at a time
int lowat = 32768;
setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));

This keeps your transmission window tight. Ideal for:

  • Satellite links
  • Congested mobile connections
  • Low-latency IoT devices

2. Precise Non-Blocking Streaming

Most streamers do:

while (write(fd, buf, len) > 0);

That'll fill the entire send buffer.

Better:

setsockopt(fd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &lowat, sizeof(lowat));

Now you back off when data hasn't been sent. Congestion-aware in userland.

3. Zero-Copy Threshold Tuning

You're building a mmap()-based packet generator?

Want to avoid copying more data into the kernel until the previous data was transmitted?

TCP_NOTSENT_LOWAT = user-space congestion control primitive.

You can now build a full transport-layer-aware, custom-paced sender without touching kernel code.

Comparison with Other Socket Options

Socket Option      Controls                   Purpose
-------------      --------                   -------
SO_SNDBUF          Total buffer size          Max memory for socket
SO_SNDLOWAT        Writable threshold         Wakes poll() only after X bytes
TCP_CORK           Send coalescing            Delay packetization
TCP_NODELAY        Disables Nagle             Immediate send
TCP_NOTSENT_LOWAT  Max unsent data in buffer  Pacing and streaming control

Pro tip: Combine TCP_NOTSENT_LOWAT + TCP_CORK = pipelined, batched transmission with send-level flow control.

Deep Experimental Test

Try this:

  • Set TCP_NOTSENT_LOWAT = 16KB
  • Run a server with non-blocking writes
  • Start writing a 10MB file via writev()
  • Observe poll() will wake only when kernel has flushed 16KB

You now control exactly when you're allowed to enqueue more – and you don't need to busy-wait or guess buffer states.

This is socket-as-flow-regulator mode.

Advanced Use: Simulated Sliding Window Protocol in Userland

Use TCP_NOTSENT_LOWAT to implement your own flow window logic.

int window_size = 8 * 1024; // 8KB "send window"

setsockopt(sockfd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &window_size, sizeof(window_size));

while (sending) {
    // only write if we have permission
    if (poll(sockfd, POLLOUT, 0) > 0) {
        write(sockfd, next_chunk, chunk_size);
    }
}

It's like writing a miniature version of TCP on top of TCP.

Why? Because pacing is power.

Final Thoughts

TCP_NOTSENT_LOWAT is what happens when Linux gives you a lever into the TCP transmission pipeline itself. It's not about buffering. It's about when data is considered "sent" and when your app is allowed to write more.

In a world of zero-copy pipes, congestion-controlled QUIC stacks, and kernel bypass, this is the Linux socket API's answer to backpressure elegance.

It lets you:

  • Build your own userland congestion control
  • Control memory footprint per stream
  • Pseudo-packetize data via pacing
  • Get deterministic latency for streaming workloads

This isn't just a socket flag. It's an instrument of precision.

Further Reading

  • Linux Kernel Patch: Add TCP_NOTSENT_LOWAT
  • Kernel source: net/ipv4/tcp_output.c, tcp_sendmsg_locked()
  • QUIC backpressure design notes (for contrast)
  • Wireshark filters: tcp.analysis.bytes_in_flight

Closing

Need a guide to SO_RCVLOWAT, TCP_INQ, or Linux MSG_ZEROCOPY?

Just say the word.

We can keep digging all the way into the syscall abyss.