Why hyper coroutine version is slower than the future version

Last time I post a blog Boosting Hyper with MAY that you can see the coroutine version is slower than the future version with a single working thread. I’m curious about why it’s slow. Some questions arise in my head. Maybe the context switching cost is too high? Or maybe the logic of thread version is not as optimized as the future version. After all the thread version of hyper is not actively developed compared with the master branch.

So I decide to profile the server to see what actually happens.

Install the profiling tools

I use cargo-profile, it’s not actively developed now, but usable.

I use the following command to install it on my ubuntu vm

1
2
$ sudo apt-get install valgrind
$ cargo install cargo-profiler

Modify the echo server

We need to let the server exit normally to make the profile tool happy. Just insert this code in the echo server example

1
2
3
4
5
6
7
8
9
fn main() {
......
// insert the code to let the server exit after 10s
std::thread::spawn(|| {
std::thread::sleep(std::time::Duration::from_secs(10));
std::process::exit(0);
});
println!("Listening on http://127.0.0.1:3000");
}

Running the profile

We use the release version to see what happened with hyper.

1
2
3
$ cd hyper
$ cargo build --example=hello --release
$ cargo profiler callgrind --bin target/release/examples/hello

And run the client in another terminal at the same time.

1
$ wrk http://127.0.0.1:3000 -d 10 -t 2 -c 20

After 10 seconds you will see the result printed out by cargo-profile.

The may version result

1
2
3
4
5
6
7
8
9
10
70,276,542 (8.9%) ???:_..std..io..Write..write_fmt..Adaptor....a$C$..T....as..core..fmt..Write..::write_str 
-----------------------------------------------------------------------
64,212,448 (8.2%) memcpy-sse2-unaligned.S:__memcpy_sse2_unaligned
-----------------------------------------------------------------------
61,684,172 (7.8%) ???:_..hyper..header..NewlineReplacer....a$C$....b....as..core..fmt..Write..::write_str
-----------------------------------------------------------------------
55,352,952 (7.0%) ???:_..std..io..buffered..BufWriter..W....as..std..io..Write..::write
-----------------------------------------------------------------------
47,665,042 (6.1%) ???:_..hyper..http..h1..HttpWriter..W....as..std..io..Write..::write
-----------------------------------------------------------------------

The future version result

1
2
3
4
5
6
7
8
9
10
11
76,417,403 (12.0%) memcpy-sse2-unaligned.S:__memcpy_sse2_unaligned
-----------------------------------------------------------------------
71,438,100 (11.2%) memset.S:__GI_memset
-----------------------------------------------------------------------
32,188,192 (5.0%) ???:_..futures..future..map_err..MapErr..A$C$..F....as..futures..future..Future..::poll
-----------------------------------------------------------------------
27,514,102 (4.3%) ???:hyper::proto::h1::parse::_..impl..hyper..proto..Http1Transaction..for..hyper..proto..ServerTransaction..::parse
-----------------------------------------------------------------------
22,228,235 (3.5%) ???:tokio_core::reactor::Core::poll
-----------------------------------------------------------------------
19,713,445 (3.1%) ???:_..hyper..proto..dispatch..Dispatcher..D$C$..Bs$C$..I$C$..B$C$..T$C$..K....::poll_read

Conclusion

Apparently the hyper thread version is not fully optimized. It spend it’s most time to format strings. While the future version spend most of it’s time to copy memory which makes sense that the server is just echo back strings.

And we also notice that the future based version spend a noticeable amount of time to call the poll method by the framework

(3.5%) ???:tokio_core::reactor::Core::poll

May version context switch runtime is not that heavy, I found this in the profile

(1.8%) ???:_..F..as..generator..gen_impl..FnBox..::call_box

P.S.

For a fare comparison for simple http echo server, I think the most proper candidates are tokio_minihttp and the http example in may project. They don’t evolve too much frameworks and just do the echo things.

Below is the result on my ubuntu vm

tokio_minihttp(1 thread)

1
2
3
4
5
6
7
8
9
10
$ wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20
Running 10s test @ http://127.0.0.1:8080
2 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 262.43us 818.35us 17.52ms 97.63%
Req/Sec 61.71k 9.21k 79.79k 84.50%
1228552 requests in 10.01s, 124.19MB read
Requests/sec: 122691.03
Transfer/sec: 12.40MB
wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20 1.80s user 7.25s system 90% cpu 10.017 total

may http example(1 thread)

1
2
3
4
5
6
7
8
9
10
$ wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20
Running 10s test @ http://127.0.0.1:8080
2 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 502.23us 2.12ms 28.01ms 95.76%
Req/Sec 87.78k 8.60k 113.63k 79.50%
1748461 requests in 10.02s, 155.07MB read
Requests/sec: 174571.22
Transfer/sec: 15.48MB
wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20 1.80s user 9.59s system 113% cpu 10.019 total

may http example(2 thread)

1
2
3
4
5
6
7
8
9
10
$ wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20
Running 10s test @ http://127.0.0.1:8080
2 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 119.37us 673.66us 15.79ms 98.46%
Req/Sec 180.75k 18.13k 197.66k 87.00%
3599209 requests in 10.01s, 319.22MB read
Requests/sec: 359663.27
Transfer/sec: 31.90MB
wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20 3.68s user 15.86s system 195% cpu 10.010 total