Hello
Over the past few years, I have built several high throughput & low latency web services with Tokio. Along the way I've learned how to get Tokio to vertically scale as high as possible. This blog is the first in a series I expect to write about lessons learned on the path to fitting over a million RPC's per second into each server. Many of these are generally applicable outside of Tokio - but for now, I'll focus on how to make code written with Tokio go fast.
How many threads should I use with Tokio?
How many threads should you configure in a multi-threaded server?
I often see this tuned via trial and error, and have also been guilty of doing that. For web services running on a server, EC2, GCP Compute Engine, etc., there is a quick approximation that is hard to beat. On a typical Linux aarch64 or x86 server, the ideal Tokio thread count is n-2
, where n
is the number of cores available on the machine.
An "ideal" thread count is the minimum number of threads required to reach the maximum throughput the machine can deliver.
But I want to use more threads.
Why shouldn't you just throw lots of threads at the problem and let the operating system work it out? After all, n
is the default thread count in Tokio!
First, recall that Tokio is a general purpose async/await executor. It is broadly used across market domains. When you know what your workload looks like, you can customize certain aspects to suit your needs.
Second, recall that a CPU is a bunch of cores, where each can only work on one thread at a time.
Third, recall that you have an operating system.
At n
or greater, you incur unavoidable task switching cost, and duplicate scheduler overhead - both of which are unproductive waste. Task switching is easier to think about, and it's enough to get a useful mental model. The more threads you add, the more the operating system has to shift memory in and out of the CPU to pretend like all the threads are getting worked on at the same time. This switching work does not produce results for your API server. It only adds to the work the CPU has to do to produce results.
Sure, task switching on modern hardware is pretty cheap, especially when it's just a one-off. But as you add threads, the cost adds up quickly and your server becomes sluggish. You may still keep about the same throughput, but your tail latency will get higher and higher as you add threads.
Your operating system has work it has to do - flushing dirty pages, running network interrupts, and other stuff. Probably your service has background work too - flushing log batches, sending metrics & traces, running health checks.
There's not an exact science as to why n-2
is the best number, but in each of my services it is. Both Rust and Kotlin, n-1
does not increase throughput but it does raise tail latency. n-3
reduces throughput.
You might find n-1
or n-3
is better in your situation, depending on the details of your service and server.
If you are using more than n
threads with Tokio and you find switching to n-2
threads slows your server down, it's time to go hunting for a synchronization bug! It might be in your code or a library, but something is keeping your server from moving forward at its natural pace.
Illustration
Here is an example benchmark showing how thread count affects async web service latency, for a given load.
This benchmark shows the average latency for a task on a server, from 1 thread to 24 threads. The server’s threads are driven to saturation at 200 total concurrency. So 200 / 24 = 8.3 concurrent tasks per thread at the far right of the graph, and 200 / 1 = 200 concurrent tasks per thread at the left.
From this picture, you can see that I have around 8-10 cores in my cpu. In fact I have 10. So, I can do at most 10 things in parallel.
The server gets faster as you add threads, to a point. Past 8 threads there is little extra throughput to be gained. Past 10 threads, high percentile latency will suffer whenever the server has something else to do, due to task switching and low on-die cache hit rates. Restricting the service to 8 threads leaves 2 cores available for background work, and reduces the frequency of task switches for the service threads.
This benchmark models a well-behaved async web service. In reality, there are often synchronization bottlenecks that throw a wrench in this ideal. Sometimes these bottlenecks are hidden in libraries, and often adding threads past n-2
only slows you down further.
Tuning thread count this is a squishy black art, but n-2
threads, pinned or unpinned, repeatedly proves the best overall configuration for services I write. When writing cooperative multitasking systems, add threads with care!
2025-01-29