NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
New high-quality hash measures 71GB/s on M4 (github.com)
jandrewrogers 5 hours ago [-]
Without getting too deep in the technical weeds, I will assert that rapidhash is an excellent small-key hash function and approximately the state-of-the-art for that purpose. There is nothing else better that I am aware of in this scope.

For bulk hashing there are better functions, but XXH3, GXHash, etc are not among them, being both slower and weaker than the state-of-the-art. Hash functions that can’t pass quality tests are not serious contenders.

kragen 4 hours ago [-]
I feel like I'm missing an enormous amount of context here, and I can't be the only one.

How much better is it than old standbys like hashpjw, FNV1A, or djb's favorite¹, and how? It's apparently 350 lines of code instead of 1–10 lines of code. That seems like an important way that it's worse. And it comes with a copyright license that requires you to add it to your documentation, so the risk of forgetting that creates a risk of copyright infringement. So, how big is the benefit?

How much faster is it hashing a small key like the string "__umulh"? Is it actually slower? (For those needing an introduction to hash tables, covering why that question makes sense, Chris Wellons is always excellent: https://nullprogram.com/blog/2025/01/19/)

Why does it have an array commented "default secret parameters" in the public GitHub repo? Are we supposed to change those? Do they provide some kind of security, and if so, what kind, against what threat model?

The readme and https://scholar.google.com/scholar?hl=es&as_sdt=0%2C5&q=rapi... are no help here. https://github.com/rurban/smhasher?tab=readme-ov-file#summar... explains some of the questions at greater length but of course not rapidhash's particular answers.

______

¹ while (s[i]) h = h * 33 + s[i++]; about 11 machine-code instructions in its full-fledged form: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...

ajross 13 minutes ago [-]
> How much better

That's the real rub. Hashed data structures were a revelation in the early 90's when we all started using perl. None of this work really changes any of that calculus. It's not like we're all going to be chatting at the water cooler about "Remember the days when all we had was djb2?! How the world has changed!"

Still interesting, but unless you're directly involved in a heavily benchmarked runtime, it's probably not worth trying to integrate.

cb321 3 hours ago [-]
I'm not sure I can answer your other questions like copyrights & secrets, but the "how is it faster?" is actually pretty easy and worth knowing/saying to anyone unaware - the hash you present is "1 byte at a time". All fast modern functions competing in this space try to do at least 8 bytes at a time, but perhaps a whole 64B cache line. They do tend to have high "startup/finalize" overheads, though which can matter (can easily be dominant) for use in hash tables with modest key lengths instead of hashing large files.

The rest of this comment is kind of ELI5 (I hope!) not because I think @kragen needs that, but because maybe other readers will benefit and I know he leans towards long form text, and @jandrewrogers basically alludes to this stuff in his comment opening this sub-thread. EDIT: as do a few other posters here like at least @curiouscoding https://news.ycombinator.com/item?id=44012740

______

As for benefit - that always depends on "Compared to what??" in your use cases, what CPU you expect things to run on, and so much else. As just a few examples - last I knew, the SMHasher performance tests blocked inlining which already makes them unrealistic comparisons, esp. with really tiny code footprint ones like your FNV1a example. Your higher level table structure might also be caching the hash code for table resize/as a comparison prefix for lookups. So, how much you have to hash at all may depend on doing that, predictability of table sizes, etc. How effective the comparison prefix idea even is can depend upon how common long prefixes are in your key sets which my be literally unknown - could be pathnames from `find` with tons of prefix similarity or none at all.

Furthermore, the entire culture of presentation of performance in this space is sadly "rate" - the reciprocal of what is easi(er) to reason about, namely "time". Why? "Time is additive". Even the headline here is "GB/s". Cycles/byte or ns/byte is just easier. You want to present a time equation with perHashCost + perByteCost*nBytes and both unknown coefficients if inferred from real-times on a modern time sharing OS probably have significant statistical variation. If there is a highly non-random distribution of kicks/competition/interferences, it will usually be easier to see in time kicks. The reciprocal transformation distorts everything. Anyway, a time cost equation (like Knuth v1-3 of days of yore!) also lets you easily compute the break-even string length when rate starts to dominate fixed overheads, namely nBytes_breakEven = perHashCost/perByteCost.

It also lets you estimate an answer to your question of "when does this matter", if your strings are typically under 32B (I believe an early days limit to identifiers for many C compilers, but someone can correct me if that's wrong) and the perByteCost is 0.1 ns/byte (10 GB/s - just a round number), then break even (when per-hash starts to matters more) is 0.1*32 = 3.2 ns or 16..24 cpu cycles on most modern deployments. Said the other way, if your perHash overhead is 16..24 cycles and your hash througput is 10 GB/s, then you better hope your strings are longer than 32 bytes on average or you've probably made a bad choice.

One way in which an equation fails is that it does oversimplify the performance profile even of this simple problem. That profile often has plateaus/cliffs at cache line size and integer size boundaries, but rates fare no better there. Even so, if you are presenting a bunch of hashes, a table sortable by perHash and perByte costs would be the best starting point for most evaluations where the slope is "just an average"/effective value. A plot is best, though.

kragen 1 hours ago [-]
I wasn't asking how it could be faster; I was asking how much faster it was (or otherwise better), for small keys like "__umulh", where the startup/finalize overheads are worst. Clearly GB/s is a good measure if you're hashing files or Flash FTL blocks, but jandrewrogers specifically said it's state of the art for the purpose of small-key hashing, which to me means things like dictionary words and variable or method names. Maybe that's not what they meant?

Even on a fairly wimpy microcontroller the hash I linked above only has about four instructions of startup/finalize overhead, even without inlining: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename... (I modified it slightly because GCC was puzzlingly not reducing my array indexing to pointer arithmetic...)

cb321 24 minutes ago [-]
Fair enough. Sorry! You were pretty clear.

I continue to think that the world would be less confusing if people didn't go around citing GB/s for hash-table hashes, but rather (perHash+-err, perCost+-err) on CPU xyz. Cuts right to the heart of the matters and then folks like jandrewrogers don't have to be vague with "small" to inspire questions like yours. (Of course, CPUs really vary a lot, too.)

I also think when people measure these fast string hashes for strings > L1 CPU cache they're mostly measuring a melange of IO/memory system not the hash computation - maybe sometimes unwittingly. L1 CPU caches can deliver 100s of GB/s after all. 1 64B cache line per cycle is 64B/0.2ns = 320 GB/s. Intel's Skylake could do that a decade ago. The memory wall can be a real bear.

spiffytech 3 hours ago [-]
Can you share more about the quality tests XXH3 fails? I don't know this space deeply, so I'm having trouble reconciling why XXH3 is popular if it fails tests.
aidenn0 10 hours ago [-]
At some point, hashes are fast enough. I'm not going to be switching from xxhash to shave 40 picoseconds off of hashing small keys, and for large keys it's already a few orders of magnitude faster than my NVMe drive.
jandrewrogers 5 hours ago [-]
Quality matters. XXH3 fails something like 15% of the tests in SMHasher3. There are faster hash functions with much better quality profiles.
leumassuehtam 22 minutes ago [-]
Since you've mentioned, which hash functions are better while being faster than XXH3?
neonsunset 10 hours ago [-]
If whatever you are doing is heavy on hashmap lookups (and you are ok with not rewriting into something more bespoke+complicated) - the faster hash function and the cheaper baseline cost of calling it - the better (i.e. XXH3 can have disadvantages, with its popular impl. for dispatching to the necessary routine).

This looks like an interesting potential alternative to GxHash. GxHash is nice but sadly I found AES intrinsics to have somewhat high latency on AMD's Zen 2/3 cores, making it a loss on short strings (but until I measured it on our server hw, M4 Max sure had me fooled, it has way better latency despite retiring more AES operations!).

xkcd1963 8 hours ago [-]
Question, what do you do if there is a collision? I saw the github table also mentioned collisions
lights0123 8 hours ago [-]
They're extremely common in hashtables. You follow standard open addressing or separate chaining procedures: https://en.m.wikipedia.org/wiki/Hash_collision
LoganDark 6 hours ago [-]
Xorakios 50 minutes ago [-]
Do people still use mobile devices? I thought that was like original star trek?
meindnoch 4 hours ago [-]
Open any data structures textbook and look for "hash map" in the table of contents.
saagarjha 8 hours ago [-]
Fall back on secondary hashes or do probing
steeve 3 hours ago [-]
No, just because you’re hashing every ms doesn’t mean we all do.
realo 34 minutes ago [-]
At a few ns per hash, it means the hash is done before photons from your screen reach your eyes.

Impressive!

CyberDildonics 17 minutes ago [-]
There is a big difference between throughput and latency. If you do 8 at once with SIMD it doesn't mean you can get a single one done in 1/8th the time.
curiouscoding 6 hours ago [-]
I'd love to see some benchmarks/comparison on variable length strings. For strings with random length between 10 and 30, gxhash was significantly faster than xxhash for me. I would assume because it processes chunks of up to 32 chars at a time, avoiding branch misses.

Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.

rurban 7 hours ago [-]
It's not the fastest on smhasher anymore. umash is better and faster, and gxhash is twice as fast. It's is however very hashmap friendly because of its small size. It's a wyhash variant
nicoshev11 7 hours ago [-]
Hey Reini, the rapidhash version on the SMHasher repo is outdated. The latest version was published last Tuesday, I'll submit it to SMHasher shortly. This new version is much faster than the previous one, and still passes all tests. It should beat umash on your benchmarks.
Sesse__ 6 hours ago [-]
Am I right in that RAPIDHASH_UNROLLED is now mandatory and the function is thus much larger now?
3 days ago [-]
wpollock 11 hours ago [-]
Is rapidhash cache-friendly?
wmf 10 hours ago [-]
Hash functions don't have any data reuse so none of them are cache-friendly.
benwills 8 hours ago [-]
I think they may be asking about the CPU cache.
Dylan16807 8 hours ago [-]
You have to go out of your way to make a hash that doesn't fit into L1, so again they're all basically the same.

You'll probably end up fitting entirely inside the reorder buffer plus a sequential stream from memory, with the actual caches almost irrelevant.

benwills 8 hours ago [-]
Sure. Any worthwhile hash function will fit in the instruction cache. But there are ways to make more or less efficient use of the data cache.
Sesse__ 5 hours ago [-]
_Fitting_ in the instruction cache isn't hard, but you'd also ideally let there be room for some other code as well :-) For a hash map lookup, where the hashing is frequently inlined a couple of times, code size matters.
Dylan16807 8 hours ago [-]
> Any worthwhile hash function will fit in the instruction cache.

Yes, I'm talking about the data cache.

> But there are ways to make more or less efficient use of the data cache.

How?

You need to touch every byte of input, and nothing should be faster than going right through from start to end.

benwills 8 hours ago [-]
I don't know you're experience with hash functions, so you may already know what I'm about to say.

This is a minor example, but since you asked...

https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L6432

That's an example of a fair number of accumulators that are stored as XXHash goes through its input buffer.

Many modern hash functions store more state/accumulators than they used to. Previous generations of hash functions would often just have one or two accumulators and run through the data. Many modern hash functions might even store multiple wider SIMD variables for better mixing.

And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.

Dylan16807 8 hours ago [-]
> And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.

And there's 150+ registers in the actual chip.

But my argument is more that there isn't really an efficient or inefficient way to use L1. So unless you have an enormous amount of state, the question is moot. And if you have so much state you're spilling to L2, that's not when you worry about good or bad cache use, that's a weird bloat problem.

bandrami 6 hours ago [-]
I saw the first four words and then was disappointed by the direction it took...
coolcase 3 hours ago [-]
Or first 2 words
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 13:42:47 GMT+0000 (UTC) with Wasmer Edge.