7-10us for what is a hashtable set/get is really, really bad
I can get a packet out to a switch and back to another machine and in 1-2us
gtirloni 1 days ago [-]
Do you mean 1-2ms?
eqvinox 1 days ago [-]
No, 1-2us is correct for that — in a datacenter, with cut-through switching.
gtirloni 1 days ago [-]
That's really impressive. I need to update myself on this topic. Thanks.
mickg10 1 days ago [-]
In reality - with decent switches at 25g - and no fec - node to node is reliably under 300ns (0.3 us)
znyboy 1 days ago [-]
Considering that 300 light-nanoseconds is about 90m, getting a response (or even just one-way) in that time is essentially running right at the limits of physics/causality.
24 hours ago [-]
davekeck 1 days ago [-]
Out of curiosity, how is that measured across machines?
(The first thing that comes to my mind would be to use an oscilloscope with two probes, one to each machine, but I’m guessing that’s not it.)
toast0 1 days ago [-]
Measure the round trip and divide by two for the approximate one way time. It'd be really neat to measure the time it takes for a packet to travel in one direction, but it's somewhere between hard and impossible[1]; a very short path has less room to be asymetric though.
[1] If the clocks are synchronized, you can measure send time on one end, and receive time on the other. But synchronizing clocks involves estimating the time it takes for signals to pass im each direction, typically assuming each direction takes half the round trip.
pkhuong 19 hours ago [-]
You can use something like White Rabbit (https://en.wikipedia.org/wiki/White_Rabbit_Project) to keep clocks in sync. That still involves estimates, but a dedicated time sync network can do things like make sure all the cables are the same length.
1 days ago [-]
jiggawatts 1 days ago [-]
Meanwhile the best network I’ve ever benchmarked was AWS and measured about 55µs for a round trip!
What on earth are you using that gets you down to single digits!?
Galanwe 20 hours ago [-]
> the best network I’ve ever benchmarked was AWS and measured about 55µs for a round trip
What is "a network" here?
Few infrastructures are optimised for latency, most are geared toward providing high throughput instead.
In fact, apart from HFT, I don't think most businesses are all that latency sensitive. Most infrastructure providers will give you SLAs of high single or low double digits microseconds from Mahwa/Carteret to NY4, but these are private/dedicated links. There's little point to optimising latency when your network ends up on internet where the smallest hops are milliseconds away.
jiggawatts 12 hours ago [-]
> There's little point to optimising latency when your network ends up on internet where the smallest hops are milliseconds away.
That's just plain wrong. Lower latency always improves everything. Not just responsiveness, but also bandwidth! Because of TCP slow-start and congestion control algorithms, lower latency directly results in higher throughputs.
Not to mention that these latencies add up, which is especially important with chatty microservices applications. Don't forget that typical TCP+HTTPS connections require something like 5 round trips, and that's assuming that the DNS record is already cached! Add in firewalls, load balancers, proxies, side-cars, ingress, and who knows what else, suddenly you're staring down the barrel of 15 millisecond latencies before the data can exit the data centre.
The threshold for "instant" response is 16.7 ms end-to-end, including refreshing the HTML DOM and painting pixels to the screen.
Google and AWS knows this, which is why their data centre networking have ~50µs latencies, some of the best in the industry.
Everyone else: "Nah, don't bother!"
Galanwe 3 hours ago [-]
I think you're getting pissed of at a strawman. Everyone obviously _care_ about latency. All things equal, better latency always makes things better, there is no arguing with that.
Yet, that doesn't mean latency is at the same priority spot on everyone's list. If you're using TCP on internet, you have already put latency far down in your concerns. That doesn't make you _not want_ better latency, but that does make it a _nice to have_.
There's no obvious shortcut to latency that doesn't involve either loosing on reliability (not requiring ordered messages, not re-requesting dropped messages), or loosing throughput (not assembling small messages on bigger ones), or limiting yourself to private links.
If you do all the above (as in TCP over the internet), then you've made no sacrifice for latency over throughput and resiliency, which to me makes latency a nice to have, but certainly not a primary concern.
dahfizz 17 hours ago [-]
The key is that blibbe is talking about switches. Modern switches can process packets at line rate.
If you're working in AWS, you almost certainly are hitting a router, which is comparably slower. Not to mention you are dealing with virtualized hardware, and you are probably sharing all the switches & routers along your path (if someone else's packet is ahead of yours in the queue, you have to wait).
crest 24 hours ago [-]
I assume 1-3 hops of modern switches without congestion. Given 100Gb/s lanes these numbers are possible if you get all the bottlenecks out of the way. The moment you hit a deep queue the latency explodes.
jiggawatts 23 hours ago [-]
So, are you talking about theoretical latencies here based on bandwidths and cable lengths, or actual measured latencies end-to-end between hosts?
I know that "in principle" the physics of the cabling allows single digit microseconds, but I've never seen it anywhere near that low even with cross-over cables with zero switches in-path!
eqvinox 22 hours ago [-]
You need high bandwidth links (time to get the entire packet across starts to matter), run on bare metal (or have very well working HW virtualisation support), and tune NIC parameters and OS processing appropriately. But it's practically achievable.
Switches in these scenarios (e.g. 25GE DC targeted) are pretty predictable and add <1μs (unless misconfigured)
jiggawatts 12 hours ago [-]
> But it's practically achievable.
I've never seen this in practice. Maaaaybe with Infiniband and custom-written apps that use a proprietary SDK.
I'd love to see references to actual benchmarks.
24 hours ago [-]
blibble 21 hours ago [-]
that's because cloud networks are complete shit
this is xilinux/mellanox cards with kernel bypass and cut-through switches with busy-waiting
in reality, in a prod system
joeblubaugh 1 days ago [-]
It’s really frustrating that the HotOS paper itself has no details about the benchmarking, and the blog post just says “redis benchmark”. What was the system setup? Persistence options? What was ported to demikernel? The client writing, the server reading from the NIC? Based on the problem specified in the paper, I assume its reading from the NIC that was implemented in DemiOS
FridgeSeal 1 days ago [-]
This is a super cool idea, and it’s something that sounds fun to play with/try out.
Therefore, I eagerly await the inevitable influx of:
- “you don’t need it”
- “you’re not FAANG enough to justify it”,
-“seems overly complicated my Python-on-Ubuntu-is-good-enough and who needs more”
Style comments telling us why we shouldn’t have fun things like this.
Anyone got anymore comments to add to the bingo-card?
wmf 1 days ago [-]
Preemptive cynicism is even worse than regular cynicism.
dijksterhuis 1 days ago [-]
if you personally want to play with it, go ahead.
i think my personal feeling is that those sorts of comments you listed come out of the woodwork more when the comments section starts turning into an "oh man, this should be the standard for everyone" kind of discussion, which is never the case and is usually the point of those kinds of replies.
at least they are when i reply with those kinds of comments anyway
kd913 21 hours ago [-]
What is being asked for already exists? It is called Onload.
it is my understanding that io_uring is the generalized open source implementation of this, although i do not think it bypasses the kernel fib trie like openonload does...
gpderetta 17 hours ago [-]
Aside for onload being open source, not really. AF_XDP is the generalized, hardware agnostic, version of kernel bypass.
In addition to bypass onload also provides a full IP/TCP user space stack and non-intrusive support for existing binaries using the standard BSD socket interface (incidentally onload also supports XDP now).
io_uring is really for asynchronous communication with the kernel.
crest 24 hours ago [-]
For a such an interface to be feasible to support in common open source infrastructure it needs a pure software implementation for testing and development purposes. Even better something along the lines of coz to even model performance by throttling down everything else proportionally.
r00tbeer 1 days ago [-]
See https://irenezhang.net/papers/demikernel-sosp21.pdf for a more thorough paper on the Demikernel from 2021. There are some great ideas for improving the kernel interface while still allowing efficient DPDK-style pipelines.
Gollapalli 1 days ago [-]
This is great! I think that there are a lot of latency sensitive applications which really do need to spare the kernel latency.
secondcoming 19 hours ago [-]
I looked at using DPDK on some of our GCP instances but it requires setting up a second VPC, which was one hurdle too much.
I’m hoping that io_uring makes all of this unnecessary anyway.
I recall reading a paper where someone noticed that for every packet the Linux kernel receives it has to check if any application has opened a raw socket. Raw sockets are initially needed to allow DHCP to work, so once your machine has been assigned an IP address you can (probably) turn this service off and so give the kernel less work to do. (My memory of the exact details may be sketchy).
Polizeiposaune 16 hours ago [-]
DHCP issues address leases, not permanent assignments; leases have an expiration time (and earlier suggested renewal/rebind times). So the DHCP client must periodically renew -- if the tenant doesn't renew (perhaps because the DHCP client has been disabled), the DHCP service may lease the address to another tenant.
If the DHCP server hasn't moved to a new address this renewal can be done over unicast using the leased address - however, if the client doesn't receive a response from the server the client state machine will eventually discard the leased address and fall back to broadcast with an all-zeros source address (which is presumably what requires a raw socket).
The DHCP client implementation in question likely keeps the raw socket open for potential future use in this case. A client might be able to close the raw socket and reopen it later (but security folks might also want it to drop the privilege required to reopen the raw socket, and it might be hard to have an ironclad guarantee that the raw socket can be reopened later on a machine that's short on free kernel memory..).
secondcoming 16 hours ago [-]
Not on GCP's GCE at least
Matthias247 17 hours ago [-]
io_uring reduces the overhead of system calls - but it doesn't do anything to reduce the overhead of the actual networking stack.
If your send/receive calls spend most CPU time in going through routing/fragmentation/filter/BPF/etc path in the networking stack, then uring (or other APIs which just reduce the system call overhead, like SendMmsg/Recvmmsg for UDP) might only make a small difference. Source: Lots of profiling while implementing QUIC libraries.
An alternative to DPDK that allows to bypass the kernel networking stack would be AF_XDP.
4 days ago [-]
Rendered at 10:46:12 GMT+0000 (UTC) with Wasmer Edge.
I can get a packet out to a switch and back to another machine and in 1-2us
(The first thing that comes to my mind would be to use an oscilloscope with two probes, one to each machine, but I’m guessing that’s not it.)
[1] If the clocks are synchronized, you can measure send time on one end, and receive time on the other. But synchronizing clocks involves estimating the time it takes for signals to pass im each direction, typically assuming each direction takes half the round trip.
What on earth are you using that gets you down to single digits!?
What is "a network" here?
Few infrastructures are optimised for latency, most are geared toward providing high throughput instead.
In fact, apart from HFT, I don't think most businesses are all that latency sensitive. Most infrastructure providers will give you SLAs of high single or low double digits microseconds from Mahwa/Carteret to NY4, but these are private/dedicated links. There's little point to optimising latency when your network ends up on internet where the smallest hops are milliseconds away.
That's just plain wrong. Lower latency always improves everything. Not just responsiveness, but also bandwidth! Because of TCP slow-start and congestion control algorithms, lower latency directly results in higher throughputs.
Not to mention that these latencies add up, which is especially important with chatty microservices applications. Don't forget that typical TCP+HTTPS connections require something like 5 round trips, and that's assuming that the DNS record is already cached! Add in firewalls, load balancers, proxies, side-cars, ingress, and who knows what else, suddenly you're staring down the barrel of 15 millisecond latencies before the data can exit the data centre.
The threshold for "instant" response is 16.7 ms end-to-end, including refreshing the HTML DOM and painting pixels to the screen.
Google and AWS knows this, which is why their data centre networking have ~50µs latencies, some of the best in the industry.
Everyone else: "Nah, don't bother!"
Yet, that doesn't mean latency is at the same priority spot on everyone's list. If you're using TCP on internet, you have already put latency far down in your concerns. That doesn't make you _not want_ better latency, but that does make it a _nice to have_.
There's no obvious shortcut to latency that doesn't involve either loosing on reliability (not requiring ordered messages, not re-requesting dropped messages), or loosing throughput (not assembling small messages on bigger ones), or limiting yourself to private links.
If you do all the above (as in TCP over the internet), then you've made no sacrifice for latency over throughput and resiliency, which to me makes latency a nice to have, but certainly not a primary concern.
If you're working in AWS, you almost certainly are hitting a router, which is comparably slower. Not to mention you are dealing with virtualized hardware, and you are probably sharing all the switches & routers along your path (if someone else's packet is ahead of yours in the queue, you have to wait).
I know that "in principle" the physics of the cabling allows single digit microseconds, but I've never seen it anywhere near that low even with cross-over cables with zero switches in-path!
Switches in these scenarios (e.g. 25GE DC targeted) are pretty predictable and add <1μs (unless misconfigured)
I've never seen this in practice. Maaaaybe with Infiniband and custom-written apps that use a proprietary SDK.
I'd love to see references to actual benchmarks.
this is xilinux/mellanox cards with kernel bypass and cut-through switches with busy-waiting
in reality, in a prod system
Therefore, I eagerly await the inevitable influx of:
- “you don’t need it”
- “you’re not FAANG enough to justify it”,
-“seems overly complicated my Python-on-Ubuntu-is-good-enough and who needs more”
Style comments telling us why we shouldn’t have fun things like this.
Anyone got anymore comments to add to the bingo-card?
i think my personal feeling is that those sorts of comments you listed come out of the woodwork more when the comments section starts turning into an "oh man, this should be the standard for everyone" kind of discussion, which is never the case and is usually the point of those kinds of replies.
at least they are when i reply with those kinds of comments anyway
https://github.com/Xilinx-CNS/onload
In addition to bypass onload also provides a full IP/TCP user space stack and non-intrusive support for existing binaries using the standard BSD socket interface (incidentally onload also supports XDP now).
io_uring is really for asynchronous communication with the kernel.
I’m hoping that io_uring makes all of this unnecessary anyway.
I recall reading a paper where someone noticed that for every packet the Linux kernel receives it has to check if any application has opened a raw socket. Raw sockets are initially needed to allow DHCP to work, so once your machine has been assigned an IP address you can (probably) turn this service off and so give the kernel less work to do. (My memory of the exact details may be sketchy).
If the DHCP server hasn't moved to a new address this renewal can be done over unicast using the leased address - however, if the client doesn't receive a response from the server the client state machine will eventually discard the leased address and fall back to broadcast with an all-zeros source address (which is presumably what requires a raw socket).
The DHCP client implementation in question likely keeps the raw socket open for potential future use in this case. A client might be able to close the raw socket and reopen it later (but security folks might also want it to drop the privilege required to reopen the raw socket, and it might be hard to have an ironclad guarantee that the raw socket can be reopened later on a machine that's short on free kernel memory..).
If your send/receive calls spend most CPU time in going through routing/fragmentation/filter/BPF/etc path in the networking stack, then uring (or other APIs which just reduce the system call overhead, like SendMmsg/Recvmmsg for UDP) might only make a small difference. Source: Lots of profiling while implementing QUIC libraries.
An alternative to DPDK that allows to bypass the kernel networking stack would be AF_XDP.