Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Amazon S3 Adds Put-If-Match (Compare-and-Swap) (aws.amazon.com)

524 points by Sirupsen 578 days ago | 160 comments

torginus 578 days ago [-]

Ah so its not only me that uses AWS primitives for hackily implementing all sorts of synchronization primitives.

My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them. Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.

This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.

belter 578 days ago [-]

If you use hourly billed machines...Sounds like the world most expensive semaphore :-)

messe 578 days ago [-]

EC2 bills by the second.

belter 578 days ago [-]

Some...

"Your Amazon EC2 usage is calculated by either the hour or the second based on the size of the instance, operating system, and the AWS Region where the instances are launched" - https://repost.aws/knowledge-center/ec2-instance-hour-billin...

https://aws.amazon.com/ec2/pricing/on-demand/

QuinnyPig 578 days ago [-]

MacOS instances appear to be the sole remaining exception since RHEL got on board.

redeux 578 days ago [-]

Thanks Corey. Always nice to get the TL;DR from an authority on the subject.

JoshTriplett 577 days ago [-]

With a one-minute minimum, unfortunately.

torginus 578 days ago [-]

except we are actually using them :)

belter 578 days ago [-]

Just don't call them before the hour and start a different one again.Because otherwise within the hour, you will be billed for hundreds of hours...If they are of the type billed by the hour....

_zoltan_ 578 days ago [-]

this actually sounds interesting. do you precreate the workers beforehand and then just keep them in a stopped state?

torginus 578 days ago [-]

yeah. one of the goals was startup time, so It made sense to precreate them. In practice we never ran out of free machines (and if we did, I have a cdk script to make more), and inifnite scaling is a pain in the butt anyways due to having to manage subnets etc.

Cost-wise we're only paying for the EBS volumes for the stopped instances which are like 4GB each, so they cost practically nothing, we spend less than a dollar per month for the whole bunch.

zild3d 578 days ago [-]

Warm pools are a supported feature in AWS on auto scaling groups. Works as you're describing (have a pool of instances in stopped state ready to use, only pay for EBS volume if relevant) https://aws.amazon.com/blogs/compute/scaling-your-applicatio...

zerd 567 days ago [-]

If you want even fast startups restarting stopped instances is apparently faster https://depot.dev/blog/faster-ec2-boot-time

rfoo 578 days ago [-]

> we spend less than a dollar per month for the whole bunch

This does not change the point, I'm just being pedantic, but:

4GB of gp3 EBS takes $0.32 per month, assuming a 50% discount (not unusual), less than a dollar gives only... 6 instances.

merb 578 days ago [-]

I always thought that stopped instances will cost money as well?!

torginus 578 days ago [-]

You're only paying for the hard drive (and the VPC stuff, if you want to be pedantic). The downside is that if you try to start your instance, they might not start if AWS doesn't have the capacity (rare but have seen it happen, particularly with larger, more exotic instances.)

williamdclt 578 days ago [-]

What would you say would be the "clean" way to implement a pool of workers (using EC2 instances too)?

Cthulhu_ 578 days ago [-]

Autoscaling and task queue based workloads, if my cloud theory is still relevant.

twodave 578 days ago [-]

Agreed. Scaling based on the length of the queue, up to some maximum.

giovannibonetti 578 days ago [-]

Even better, based on queue latency instead of length

jcrites 578 days ago [-]

The single best metric I've found for scaling things like this is the percent of concurrent capacity that's in use. I wrote about this in a previous HN comment: https://news.ycombinator.com/item?id=41277046

Scaling on things like the length of the queue doesn't work very well at all in practice. A queue length of 100 might be horribly long in some workloads and insignificant in others, so scaling on queue length requires a lot of tuning that must be adjusted over time as the workload changes. Scaling based on percent of concurrent capacity can work for most workloads, and tends to remain stable over time even as workloads change.

torginus 577 days ago [-]

Yeah this is why I hate AWS - I did a similar task runner thing and what I ended up doing is just firing up a small controller instance which manually creates and destroys instances based on demand, and schedules work on them by ssh-ing into the running instances, and piping the logs to a db.

I did read up on the 'proper' solution and it made my head spin.

You're supposed to use AWS batch, creating instances with autoscaling groups, pipe the logs to CloudWatch, and serve it from the on the frontend etc.

The number of new concepts I'd have to master, I have no control over if they went wrong, except to chase after internet erudites and spending weeks talking to AWS support is staggering.

And there's the little things, like CloudWatch logs costing like $0.5/GB, while an EBS block volume costs like $0.08, with S3 being even cheaper than that.

If I go full AWS word salad, I'm pretty sure even the most wizened AWS sages would have no idea what my bills would look like.

Yeah, my solution is shit and Im a filthy subhuman, but at least I know how every part of my code works, and the amount of code I'd had to write is not more than double that if I used AWS solutions, but I probably saved a lot of time debugging proprietary infra.

ndjdjddjsjj 577 days ago [-]

It is a shame that comment is not a blog post!

Lanzaa 577 days ago [-]

You will like the Strange Loop 2017 talk about this subject:

"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore https://www.youtube.com/watch?v=m64SWl9bfvk

Concurrent capacity might not be the best metric.

ndjdjddjsjj 577 days ago [-]

Chefs kiss!

stolsvik 574 days ago [-]

Duty Time, straight out.

torginus 578 days ago [-]

not sure, probably either an eks cluster with a job scheduler pod that creates jobs via the batch api. The scheduler pod might be replaced by a lambda. Another possibility is something cooked up with a lambda creating ec2 instances via cdk and the whole thing is kept track by a dynamodb table.

the first one is probably cleaner (though I don't like it, it means that I need the instance to be a kubernetes node, and that comes with a bunch of baggage).

ndjdjddjsjj 578 days ago [-]

etcd?

JoshTriplett 578 days ago [-]

It's also possible to enforce the use of conditional writes: https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3...

My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.

josnyder 578 days ago [-]

While it can't be done server-side, this can be done straightforwardly in a signer service, and the signer doesn't need to interact with the payloads being uploaded. In other words, a tiny signer can act as a control plane for massive quantities of uploaded data.

The client sends the request headers (including the x-amz-content-sha256 header) to the signer, and the signer responds with a valid S3 PUT request (minus body). The client takes the signer's response, appends its chosen request payload, and uploads it to S3. With such a system, you can implement a signer in a lambda function, and the lambda function enforces the content-addressed invariant.

Unfortunately it doesn't work natively with multipart: while SigV4+S3 enables you to enforce the SHA256 of each individual part, you can't enforce the SHA256 of the entire object. If you really want, you can invent your own tree hashing format atop SHA256, and enforce content-addressability on that.

I have a blog post [1] that goes into more depth on signers in general.

[1] https://josnyder.com/blog/2024/patterns_in_s3_data_access.ht...

JoshTriplett 578 days ago [-]

That's incredibly interesting, thank you! That's a really creative approach, and it looks like it might work for me.

UltraSane 578 days ago [-]

S3 has supported SHA-256 as a checksum algo since 2022. You can calculate the hash locally and then specify that hash in the PutObject call. S3 will calculate the hash and compare it with the hash in the PutObject call and reject the Put if they differ. The hash and algo are then stored in the object's metadata. You simply also use the SHA-256 hash as the key for the object.

https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...

thayne 578 days ago [-]

Unfortunately, for a multi-part upload it isn't a hash of the total object, it is a hash of the hashes for each part, which is a lot less useful. Especially if you don't know how the file was partititioned during upload.

And even if it was for the whole file, it isn't used for the ETag, so, so it can't be used for conditional PUTs.

I had a use case where this looked really promising, then I ran into the multipart upload limitations, and ended up using my own custom metadata for the sha256sum.

infogulch 578 days ago [-]

If parts are aligned on a 1024-byte boundary and you know each part's start offset, it should be possible to use the internals of a BLAKE3 tree to get the final hash of all the parts together even as they're uploaded separately. https://github.com/C2SP/C2SP/blob/main/BLAKE3.md#13-tree-has...

Edit: This is actually already implemented in the Bao project which exploits the structure of the BLAKE3 merkle tree structure to offer cool features like streaming verification and verifying slices of a file as I described above: https://github.com/oconnor663/bao#verifying-slices

UltraSane 577 days ago [-]

That is very neat! I love clever uses of data structures like this.

vdm 578 days ago [-]

Ways to control etag/Additional Checksums without configuring clients:

CopyObject writes a single part object and can read from a multipart object, as long as the parts total less than the 5 gibibyte limit for a single part.

For future writes, s3:ObjectCreated:CompleteMultipartUpload event can trigger CopyObject, else defrag to policy size parts. Boto copy() with multipart_chunksize configured is the most convenient implementation, other SDKs lack an equivalent.

For past writes, existing multipart objects can be selected from inventory filtering ETag column length greater than 32 characters. Dividing object size by part size might hint if part size is policy.

vdm 578 days ago [-]

> Dividing object size by part size

Correction: and also part quantity (parsed from etag) for comparison

vdm 578 days ago [-]

Don't the SDKs take care of computing the multi-part checksum during upload?

> To create a trailing checksum when using an AWS SDK, populate the ChecksumAlgorithm parameter with your preferred algorithm. The SDK uses that algorithm to calculate the checksum for your object (or object parts) and automatically appends it to the end of your upload request. This behavior saves you time because Amazon S3 performs both the verification and upload of your data in a single pass. https://docs.aws.amazon.com/AmazonS3/latest/userguide/checki...

tedk-42 578 days ago [-]

It does and has a good default. An issue I've come across though is you have the file locally and you want to check the e-tag value - you'll have to do this locally first and then compare the value to the S3 stored object.

vdm 578 days ago [-]

https://github.com/peak/s3hash

It would be nice if this got updated for Additional Checksums.

texthompson 578 days ago [-]

That's interesting. Would you want it to be something like a bucket setting, like "any time an object is uploaded, don't let an object write complete unless S3 verifies that a pre-defined hash function (like SHA256) is called to verify that the object's name matches the object's contents?"

BikiniPrince 578 days ago [-]

You can already put with a sha256 hash. If it fails it just returns an error.

jiggawatts 578 days ago [-]

That will probably never happen because of the fundamental nature of blob storage.

Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.

Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.

What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.

texthompson 578 days ago [-]

Why would you PUT an object, then download it again to a central server in the first place? If a service is accepting an upload of the bytes, it is already doing a pass over all the bytes anyway. It doesn't seem like a ton of overhead to calculate SHA256 in the 4092-byte chunks as the upload progresses. I suspect that sort of calculation would happen anyways.

willglynn 578 days ago [-]

You're right, and in fact S3 does this with the `ETag:` header… in the simple case.

S3 also supports more complicated cases where the entire object may not be visible to any single component while it is being written, and in those cases, `ETag:` works differently.

> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.

> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.

> * If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption. If an object is larger than 16 MB, the AWS Management Console will upload or copy that object as a Multipart Upload, and therefore the ETag will not be an MD5 digest.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.h...

danielheath 578 days ago [-]

S3 supports multipart uploads which don’t necessarily send all the parts to the same server.

texthompson 578 days ago [-]

Why does it matter where the bytes are stored at rest? Isn't everything you need for SHA-256 just the results of the SHA-256 algorithm on every 4096-byte block? I think you could just calculate that as the data is streamed in.

jiggawatts 578 days ago [-]

The data is not necessarily "streamed" in! That's a significant design feature to allow parallel uploads of a single object using many parts ("blocks"). See: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMu...

Dylan16807 578 days ago [-]

> Isn't everything you need for SHA-256 just the results of the SHA-256 algorithm on every 4096-byte block?

No, you need the hash of the previous block before you can start processing the next block.

flakes 578 days ago [-]

You have just re-invented IPFS! https://en.m.wikipedia.org/wiki/InterPlanetary_File_System

losteric 578 days ago [-]

Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.

willglynn 578 days ago [-]

It is common to use multipart uploads for large objects, since this both increases throughput and decreases latency. Individual part uploads can happen in parallel and complete in any sequence. There's no architectural requirement that an entire object pass through a single system on either S3's side or on the client's side.

Salgat 578 days ago [-]

Isn't that the point of the metadata? Calculate the hash ahead of time and store it in the metadata as part of the atomic commit for the blob (at least for S3).

cmeacham98 578 days ago [-]

Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?

JoshTriplett 578 days ago [-]

> Is there any reason you can't enforce that restriction on your side?

I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.

> Or are you saying you want S3 to automatically set the name for you based on the hash?

I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.

mdavidn 578 days ago [-]

I think you can presign PutObject calls that validate a particular SHA-256 checksum. An API endpoint, e.g. in a Lambda, can effectively enforce this rule. It unfortunately won’t work on multipart uploads except on individual parts.

UltraSane 578 days ago [-]

The hash of multipart uploads is simply the hash of all the part hashes. I've been able to replicate it.

thayne 578 days ago [-]

But in order to do that you need to already know the contents of the file.

I suppose you could have some API to request a signed url for a certain hash, but that starts getting complicated, especially if you need support for multi-part uploads, which you probably do.

JoshTriplett 578 days ago [-]

Unfortunately, last I checked, the list of headers you're allowed to enforce for pre-signing does not include the hash.

anotheraccount9 578 days ago [-]

Could you use a meta field from the object and save the hash in it, running a compare from it?

Sirupsen 578 days ago [-]

To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!

https://turbopuffer.com/blog/turbopuffer

amazingamazing 578 days ago [-]

Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3

viraptor 578 days ago [-]

Someone made an informed technical bet that worked out. Sounds like HN material to me. (Also, is it really a useful ad if you can't easily use the product?)

amazingamazing 578 days ago [-]

Worked out how? There’s no implementation. It’s just conjecture.

viraptor 578 days ago [-]

It's right there:

> Our bet that S3 would get it in a reasonable time-frame worked out!

amazingamazing 578 days ago [-]

How? This is a technical forum. Unless you’re saying any consumer of S3 can now spam links to their product on this thread with impunity. (Hey maybe they’re using cas).

richardlblair 578 days ago [-]

Oh look, someone is mad on the internet about something silly.

hedora 578 days ago [-]

Pretty much all other S3 implementations (including open source ones) support this or equivalent primitives, so this is great for interoperability with existing implementations.

ramraj07 578 days ago [-]

No one owes anyone open source. If they can make the business case work or if it works in their favor, sure.

jrochkind1 578 days ago [-]

I don't mind hearing another developer's use case for this feature, even if it's commercial proprietary software.

It's no longer top comment, which is fine.

jauntywundrkind 578 days ago [-]

https://github.com/slatedb/slatedb will, I expect, use this at some point. Object backed DB, which is open source.

benesch 578 days ago [-]

Yes! I’m actively working on it, in fact. We’re waiting on the next release of the Rust `object_store` crate, which will bring support for S3’s native conditional puts.

If you want to follow along: https://github.com/slatedb/slatedb/issues/164

deanCommie 577 days ago [-]

I mean isn't the news story itself essentially an ad?

CobrastanJorji 578 days ago [-]

I'm glad that bet worked out for you, but what made you think one year ago that S3 would introduce it soon that was untrue for the previous 15 years?

1a527dd5 578 days ago [-]

Be still my beating heart. I have lived to see this day.

Genuinely, we've wanted this for ages and we got half way there with strong consistency.

ncruces 578 days ago [-]

Might finally be possible to do this on S3: https://pkg.go.dev/github.com/ncruces/go-gcp/gmutex

phrotoma 578 days ago [-]

Huh. Does this mean that the AWS terraform provider could implement state locking without the need for a DDB table the way the GCP provider does?

arianvanp 578 days ago [-]

Correct

phrotoma 577 days ago [-]

Holy crap that is fantastic!

rekwah 568 days ago [-]

I started looking into this but DeleteObject doesn't support these conditional headers on general purpose buckets; only directory buckets (Express Zone One).

paulddraper 578 days ago [-]

So....given CAP, which one did they give up

moralestapia 578 days ago [-]

A tiny bit of availability, unnoticeable at web scale.

nimih 578 days ago [-]

Based on my general experience with S3, they jettisoned A years ago (or maybe never had it).

the_arun 578 days ago [-]

I thought they have implemented Optimistic locking now to coordinate concurrent writes. How does it change anything in CAP?

paulddraper 578 days ago [-]

The C stands for Consistency.

johnrob 578 days ago [-]

I’d wager that the algorithm is slightly eager to throw a consistency error if it’s unable to verify across partitions. Since the caller is naturally ready for this error, it’s likely not a problem. So in short it’s the P :)

alanyilunli 578 days ago [-]

Shouldn't that be the A then? Since the network partition is still there but availability is non-guaranteed.

johnrob 578 days ago [-]

Yes, definitely. Good point (I was knee jerk assuming the A is always chosen and the real “choice” is between C and P).

btown 578 days ago [-]

https://tqdev.com/2024-the-p-in-cap-is-for-performance is a really interesting take on this as a response to https://blog.dtornow.com/the-cap-theorem.-the-bad-the-bad-th... - essentially, the only way to get CA is if you're willing to say that every request will succeed eventually, but it might take an unbounded amount of time for partitions to heal, and you have to be willing to wait indefinitely for that to happen. Which can indeed make sense for asynchronous messaging, but not for real-time applications as we think about them in the modern day. In practice, if you're talking about CAP for high-performance systems, you're choosing either CP or AP.

rhaen 578 days ago [-]

Well, P isn't really much of a choice, I don't think you can opt out of acts of god.

fwip 578 days ago [-]

You can design to minimize P, though. For instance, if you have all the services running on the same physical box, and make people enter the room to use it instead of over the Internet, "partition" becomes much less likely. (This example is a bit silly.)

But you're right, if you take a broad view of P, the choice is really between consistency and availability.

paulddraper 576 days ago [-]

Yes. P is kinda fundamental to your setup.

For example, running S3 locally or not.

578 days ago [-]

CubsFan1060 578 days ago [-]

I feel dumb for asking this, but can someone explain why this is such a big deal? I’m not quite sure I am grokking it yet.

lxgr 578 days ago [-]

If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.

As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:

- Download the current database copy

- Perform your write locally

- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.

- If you get success, consider the transaction successful.

- If you get failure, go back to step 1 and try again.

CobrastanJorji 578 days ago [-]

It is often very important to know, when you write an object, what the previous state was. Say you sold plushies and you had 100 plushies in a warehouse. You create a file "remainingPlushies.txt" that stores "100". If somebody buys a plushie, you read the file, and if it's bigger than 0, you subtract 1, write the new version of the file, and okay the sale.

Without conditional writes, two instances of your application might both read "100", both subtract 1, and both write "99". If they checked the file afterward, both would think everything was fine. But things aren't find because you've actually sold two.

The other cloud storage providers have had these sorts of conditional write features since basically forever, and it's always been really weird that S3 has lacked them.

Sirupsen 578 days ago [-]

The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.

CubsFan1060 578 days ago [-]

Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)

timmg 578 days ago [-]

(I assume) it will fail if the eTag doesn't match -- the instance it got the header.

The main point of it is: I have an object that I want to mutate. I think I have the latest version in memory. So I update in memory and upload it to S3 with the eTag of the version I have and tell it to only commit if that is the latest version. If it "fails", I re-download the object, re-apply the mutation, and try again.

poincaredisk 578 days ago [-]

I imagine, for it to make sense, that the comparison is done at the last possible moment, before atomically swapping the file contents.

lxgr 578 days ago [-]

Practically, they could do both: Do an early reject of a given POST in case the ETag does not match, but re-validate this just before swapping out the objects (and committing to considering the given request as the successful one globally).

That said, I'm not sure if common HTTP libraries look at response headers before they're done posting a response body, or if that's even allowed/possible in HTTP? It seems feasible at a first glance with chunked encoding, at least.

Edit: Upon looking a bit, it seems that informational response codes, e.g. 100 (Continue) in combination with Expect 100-continue in the requests, could enable just that and avoid an extra GET with If-Match.

Nevermark 578 days ago [-]

I can imagine it might be useful to make this a choice for databases with high frequency small swaps and occasional large ones.

1) default, load-compare-&-swap for small fast load/swaps.

2) optional, compare-load-&-swap to allow a large load to pass its compare, and cut in front of all the fast small swap that would otherwise create an un-hittable moving target during its long loads for its own compare.

3) If the load itself was stable relative to the compare, then it could be pre-loaded and swapped into a holding location, followed by as many fast compare-&-swaps as needed to get it into the right location.

jayd16 578 days ago [-]

When you upload a change you can know you're not clobbering changes you never saw.

ramraj07 578 days ago [-]

Brilliant single line that is better than every other description above. Kudos.

papichulo2023 578 days ago [-]

I think is called write after write (WAW) if I remember correctly.

578 days ago [-]

maglite77 578 days ago [-]

Noting that Azure Blob storage supports e-tag / optimistic controls as well (via If-Match conditions)[1], how does this differ? Or is it the same feature?

[1]: https://learn.microsoft.com/en-us/azure/storage/blobs/concur...

simonw 578 days ago [-]

It's the same feature. Google Cloud Storage has it too: https://cloud.google.com/storage/docs/request-preconditions#...

koolba 578 days ago [-]

This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.

IgorPartola 578 days ago [-]

Rename for objects and “directories” also. Atomic.

ncruces 578 days ago [-]

Both this and read-after-write consistency is single object.

So coordinating writes to multiple objects still requires… creativity.

offmycloud 578 days ago [-]

If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?

I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.

revnode 578 days ago [-]

MD5 hash collisions are unlikely to happen at random. The defect was that you can make it happen purposefully, making it useless for security.

aphantastic 578 days ago [-]

Sure, but theoretically you could have a system where a distributed log of user generated content is built via this CAS//MD5 primitive. A malicious actor could craft the data such that entries are dropped.

revnode 577 days ago [-]

My understanding of the feature, and correct me if I'm wrong, is that you are not granted write access based on a hash. You already have write access. You can use the hash to avoid overwriting someone else's data that was appended to the file in between you checking the file and writing to it. If you already have write access, the hash is irrelevant. As a bad actor, you can corrupt the data without it.

MD5 should not be used for anything security related. Granting write access based on an MD5 hash would be a huge no-no.

aphantastic 577 days ago [-]

Right, the issue comes when a trusted writer is logging data that is sourced from an untrusted party.

Imagine a transaction log being a blob per-customer with many lines corresponding to price, sku, etc, that additionally have some “memo” field provided by the customer. A trusted distributed worker process is responsible for taking incoming requests by the user, pulling their blob down, appending the line based on the request, and CAS’ing it back in (retrying on failure). With enough effort, a particularly devious user could issue many requests with ‘memo’s engineered to not alter the MD5 of their log. This would cause some lines to be lost. An audit of their account transaction log would be unable to accurately reflect the requests they made to the service, and the failure would be invisible.

This is obviously a bit contrived – I’ll be the first to admit. But if the incentives were to exist for this to be worth someone’s time for some system, I think it would be likely to see it come up eventually.

578 days ago [-]

UltraSane 578 days ago [-]

The default Etag is used to detect bit errors and and MD5 is fine for that. S3 does support using SHA256 instead.

CobrastanJorji 578 days ago [-]

With Google Cloud Storage, you can solve this by conditionally writing based on the "generation number" of the object, which always increases with each new write, so you can know whether the object has been overwritten regardless of its contents. I think Azure also has an equivalent.

578 days ago [-]

ipython 578 days ago [-]

I can't wait to see what abomination Cory Quinn can come up with now given this new primitive! (see previous work abusing Route53 as a database: https://www.lastweekinaws.com/blog/route-53-amazons-premier-...)

amazingamazing 578 days ago [-]

Ironically with this and lambda you could make a serverless sqlite by mapping pages to objects, using http range reads to read the db and lambda to translate queries to the writes in the appropriate pages via cas. Prior to this it would require a server to handle concurrent writers, making the whole thing a nonstarter for “serverless”.

Too bad performance would be terrible without a caching layer (ebs).

captn3m0 578 days ago [-]

For read heavy workloads, you could cache the results at cloudfront. Maybe we will someday see Wordpress-on-Lambda-to-Sqlite-over-S3.

sillysaurusx 578 days ago [-]

Finally. GCP has had this for a long time. Years ago I was surprised S3 didn’t.

ncruces 578 days ago [-]

GCS is just missing x-amz-copy-source-range in my book.

Can we have this Google?

…

Please?

seansmccullough 577 days ago [-]

Azure Storage has also had this for years - https://learn.microsoft.com/en-us/rest/api/storageservices/s...

mannyv 578 days ago [-]

GCP still doesn't have triggers out of beta last time i checked (which was a while ago).

BrandonY 578 days ago [-]

We do have Cloud Run Functions that trigger on Cloud Storage events, as well as Cloud Pub/Sub notifications for the same. Is there a specific bit of functionality you're looking for?

fragmede 578 days ago [-]

Gmail was in beta for five years, I don't think that label really means anything.

UltraSane 578 days ago [-]

It means that Google doesn't want to offer an SLA

sitkack 578 days ago [-]

Not that it matters. It just changes the volume and timing of "I believe I did bob"

m_d_ 578 days ago [-]

s3fs's https://github.com/fsspec/s3fs/pull/917 was in response to the IfNoneMatch feature from the summer. How would people imagine this new feature being surfaced in a filesystem abstraction?

spprashant 578 days ago [-]

I had no idea people rely on S3 beyond dumb storage. It almost feels like people are trying to build out a distributed OLAP database in the reverse direction.

amne 578 days ago [-]

1. SELECT ... INTO OUTFILE S3

2. glue jobs to partition by some columns reporting uses

3. query with athena

4. ???

5. profit (celebrate reduced cost)

This thing costs couple $ a month for ~500gb of data. Snowflake wanted crazy amounts of money for the same thing.

vytautask 578 days ago [-]

An open-source implementation of Amazon S3 - MinIO has had it for almost two years (relevant post: https://blog.min.io/leading-the-way-minios-conditional-write...). Strangely, Amazon is catching up just now.

topspin 578 days ago [-]

That's not "strange" to me. Object storage has been a long time coming, and it's still being figured out: the entirely typical process of discovering useful and feasible primitives that expand applicability to more sophisticated problems. This is obviously going occur first in smaller and/or younger, more agile implementations, whereas AWS has the problem of implementing this at pretty much the largest conceivable scale with zero risk. The lag is, therefore, entirely unsurprising.

aseipp 578 days ago [-]

It's not surprising at all. The scale of AWS, in particular S3, is nearly unfathomable, and the kind of solutions they need for "simple" things are totally different at that size. S3 was doing 1.1million requests a second back in 2013.[1]

I wouldn't be surprised if they saw over 100mil/req/sec globally by now. That's 100 million requests a second that need strong read-your-write consistency and atomicity at global scale. The number of pieces they had to move into place for this to happen is probably quite the engineering tale.

[1] https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-obje...

tonymet 578 days ago [-]

good example of how a simple feature on the surface (a header comparison) requires tremendous complexity and capacity on the backend.

akira2501 578 days ago [-]

S3 is rated as "durable" as opposed to "best effort." It has lots of interesting guarantees as a result.

tonymet 578 days ago [-]

Also they are faithful to their consistency commitments

wanderingmind 578 days ago [-]

Does this mean, in theory we will be able to manage multiple concurrent writes/updates to s3 without having to use new solutions like Regatta[1] that was recently launched?

https://news.ycombinator.com/item?id=42174204

huntaub 578 days ago [-]

Here's how I would think about this. Regatta isn't the best way to add synchronization primitives to S3, if you're already using the S3 API and able to change your code. Regatta is most useful when you need a local disk, or a higher performance version of S3. In this case, the addition of these new primitives actually just makes Regatta work better for our customers -- because we get to achieve even stronger consistency.

gravitronic 578 days ago [-]

First thing I thought when I saw the headline was "oh! I should tell Sirupsen"

lttlrck 578 days ago [-]

Isn't this compare-and-set rather than compare-and-swap?

rrr_oh_man 578 days ago [-]

Could anybody explain for the uninitiated?

msoad 578 days ago [-]

It ensures that when you try to upload (or “put”) a new version of a file, the operation only succeeds if the file on the server still has the exact version (ETag) you specify. If someone else has updated the file in the meantime, your upload is blocked to prevent overwriting their changes.

This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.

This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn

rrr_oh_man 578 days ago [-]

Thank you! That was extremely helpful (and written in a way that is easy to understand)!

stevefan1999 578 days ago [-]

So...are we closer to getting to use S3 as a...you guessed it...a database? With CAS, we are probably able to get a basic level of atomicity, and S3 itself is pretty durable, now we have to deal with consistency and isolation...although S3 branded itself as "eventually consistent"...

User23 578 days ago [-]

There was a great deal of interest in gossip protocols, eventual consistency, and such at Amazon in the mid oughts. So much so that they hired a certain Cornell professor along with the better part of his grad students to build out those technologies.

mr_toad 578 days ago [-]

People who want all those features use something like Delta Lake on top of object storage.

gynther 578 days ago [-]

S3 is strongly consistent since 4 years ago. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

vlovich123 578 days ago [-]

I implemented that extension in R2 at launch IIRC. Thanks for catching up & helping move distributed storage applications a meaningful step forward. Intended sincerely. I'm sure adding this was non-trivial for a complex legacy codebase like that.

anonymousDan 578 days ago [-]

Would be interesting to understand how they've implemented it and they whether there is any perf impact on other API calls.

dvektor 578 days ago [-]

[rejected] error: failed to push some refs to remote repository

Finally we can have this with s3 :)

mdaniel 578 days ago [-]

Relevant: https://github.com/awslabs/git-remote-s3#readme https://news.ycombinator.com/item?id=41887004

paulsutter 578 days ago [-]

What’s amazing is that it took them so long to add these functions

thayne 578 days ago [-]

Now if only you had more control over the ETag, so you could use a sha256 of the total file (even for multi-part uploads), or a version counter, or a global counter from an external system, or a logical hash of the content as opposed to a hash of the bytes.

londons_explore 578 days ago [-]

So we can now implement S3-as-RAM for a worldwide million-core linux VM?

juggli 578 days ago [-]

finally

throwaway314155 578 days ago [-]

[flagged]

earth2mars 578 days ago [-]

What is stopping you not doing it now? I know Q is not good (hallucinates, slow, requires sign in) But it's wise to explain what your gripe is about than saying which you can always do.

throwaway314155 578 days ago [-]

My gripe was with the Explainer modal that covers the entire article upon visiting the site.

ramon156 578 days ago [-]

Honestly if it was fast and uninvasive, I wouldn't mind it at all

grahamj 578 days ago [-]

bender_neat.gif

serbrech 578 days ago [-]

Why is standard etag support making the frontpage?

Rendered at 07:04:48 GMT+0000 (UTC) with Wasmer Edge.