NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Cdb: Add support for cdb64 (cdb.cr.yp.to)
wolfgang42 29 days ago [-]
CDB is an interesting format, optimized for read-heavy write-rarely[1] random lookups on slow media. This isn’t a very common requirement these days, but it’s convenient for very specific use cases.

[1] You “update” by overwriting the entire file. This is remarkably fast and means that there’s no overhead/tracking for empty space, but it does mean you probably want this to be a fairly rare operation.

I rolled my own cdb reader library for a project a few years ago, and wrote up my notes on the format and its internals here: https://search.feep.dev/blog/post/2022-12-03-cdb-file-format

stevefan1999 29 days ago [-]
Can't this be implemented as a PHF: https://en.wikipedia.org/wiki/Perfect_hash_function
a-dub 29 days ago [-]
GALACTIC SCALE QMAIL that can run efficiently on a 486 AND survive a supernova!
tptacek 29 days ago [-]
Haven't there been 64-bit ports of CDB for ages?
wolfgang42 29 days ago [-]
Yes, the modifications you need to support it are trivially obvious (literally just replace “4 bytes” with “8 bytes” everywhere in the spec) and have been implemented by a number of authors, some of which this page links to. I guess it’s nice that they’ve been “officially” acknowledged, though.
eesmith 29 days ago [-]
And update the hash algorithm, yes?
eesmith 29 days ago [-]
In answer to my own question, no, except for the trivial expansion to 64-bits. https://cdb.cr.yp.to/cdb-20251021/cdb_hash.c.html with constants at https://cdb.cr.yp.to/cdb-20251021/cdb.h.html .
justin66 28 days ago [-]
What's Bernstein working on that would have demanded this?
Bolwin 29 days ago [-]
Interesting, never heard of this before. I'm assuming the use case is when your data is too large to conveniently fit into memory?
tptacek 29 days ago [-]
It's a database for strictly exact-match lookups for very read-intensive workloads; think systems where the database only changes when the configuration changes, like email alias or domain lookups. It's very simple (a first-level hash table chaining to a second-level open-addressed hash table) and easy to get your head around, but also very limiting; an otherwise strict K-V system that uses b-trees instead of hash tables can do range queries, which you can build a lot of other stuff out of.

Most people would use Redis or SQLite today for what CDB was intended for; CDB will be faster, but for a lot of applications that speed improvement will be sub-threshold for users.

kimos 29 days ago [-]
Great reply.

What comes to mind from my experience is storing full shipping rate tables for multiple shipping providers. Those change extremely rarely but are a high throughput exact lookup in a critical path (a checkout).

But we just implemented them in SQLite and deployed that file with the application. Simple clean, effective, and fast. Maybe shipping rate data is smaller than this is intended for, but I doubt using this instead would see a consequential perf increase. Seems niche, like the domain name lookup example.

paws 29 days ago [-]
For me this answer was helpful and succinct, thank you.
dsr_ 29 days ago [-]
It is a database for when you read a lot and don't write too often; when a write might be pretty big but not frequent; when you don't want to write a database engine yourself (I.e. figure out what to write and when). And, especially, when corrupting the data would be a big problem.

And it is especially good on copy-on-write filesystems, because it is CoW itself.

bloppe 29 days ago [-]
So it's not constant?
tptacek 29 days ago [-]
The lookups are ~O(1).
renewiltord 29 days ago [-]
Nothing is truly constant lookup in number of elements in nature because we can’t pack it tighter than a sphere.
tombert 29 days ago [-]
I'm kind of surprised I hadn't heard of this, I could see this being something useful for a few projects. Historically for things in this space I've used RocksDB but RocksDB has given me headaches with unpredictable memory usage for large data sets.
tveita 29 days ago [-]
In the low-level DB space there is also https://dbmx.net/tkrzw/, of the Tokyo Cabinet / Kyoto Cabinet lineage.
waynesonfire 29 days ago [-]
cdb is a fun format to implement! highly recommend it.
binary132 29 days ago [-]
Now I’m curious about working around the writer limitations….
tptacek 29 days ago [-]
It's designed to rebuild the whole database with every write, and the format reflects that.
binary132 28 days ago [-]
yes — I have a few thoughts in mind for how to build a usable database around static chunks of that sort but I don’t suppose I’ll ever get around to it nor do I feel the need to explain myself beyond this. :)
gnabgib 29 days ago [-]
Title: cdb: Intro (please use the original title) https://news.ycombinator.com/newsguidelines.html
gchamonlive 29 days ago [-]
It'd never have crawled out of the new page with that title.
trvz 29 days ago [-]
No, the title is much better as it is.
gjvc 29 days ago [-]
weak
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 03:11:18 GMT+0000 (UTC) with Wasmer Edge.