I have a little app that uploads to S3 via Soto. I just tried to upload a 5 GB file and it failed, because down inside ByteBuffer, it casts an Int capacity value to _Capacity, which is typealiased to UInt32. This seems pretty bad. A related type, _Index, is also UInt32.
S3 itself is limited to 5 GB for a single PUT operation, but objects can be 5TB in size.
I can switch to multipart upload for this, but this still seems like a problem in NIO, no?
Wouldn't this only be a limitation if you wanted to buffer more than UInt32.max bytes in memory? It shouldn't, in principle, limit you from uploading files of arbitrary size.
I'm not familiar with Soto or the specific Soto APIs you're using but maybe the problem is in there, or how you are calling it?
I think they're loading the whole file into RAM via ByteBuffer, and Soto/S3 recommends using multipart upload for files larger than 100 MB, so I found a workaround.
But ByteBuffer at best is inconsistent, taking Int in some places (or UnsafeRawBufferPointer, which has an Int count), while taking _Capacity in most others.
I'm just not sure why it uses UInt32 instead of UInt or Int.
I think the link I posted above actually suggests using streaming APIs, without the need to move to multipart.
Streaming Payloads
When uploading or downloading large blocks of data it is preferable if you can stream this data instead of holding it all in memory. Soto supplies methods for streaming raw data payloads of both requests to AWS and responses from AWS. The most common use of this would be when uploading or downloading large objects to S3.
— source: Soto - Streaming Payloads
I can't speak to the history of the type, but I don't think it's the limiting factor here.
Ignore my specific use case. I'm suggesting it's an inconsistency in ByteBuffer. There may well be times when one needs to allocate one larger than 31 bits, and as-is it crashes if you go over 2GB.
ByteBuffer should be limited to 4GB, ie. 32 bits on all platforms, regardless of whether you use 32 or 64 bit machines. SwiftNIO is obviously still able to send and receive arbitrary amounts of data.
When uploading a file, you should be streaming it, otherwise you may suffer OOM kills anyway.
Any library that allows you to upload files should be doing this automatically. So if you're saying that Soto requires you to provide a single ByteBuffer holding all bytes, then I think you should file a Soto bug requesting streaming uploads.
If you need to hold more than 4GB in memory at once, then ByteBuffer is not the type you want. And in general if you're holding buffers if such enormous size you need to be a bit careful: any CoW copy would be disastrous (so maybe you want a specialised, immutable type without CoW?), it won't work on all platforms. It's also probably an orange flag for the design of a software if it requires single arbitrarily-sized memory buffers.
ByteBuffer also doesn't expose _Capacity, all APIs should be Ints (which is an issue on 32 bit platforms as they can't represent 4GB...)
The "(almost) always use Int" mantra is inherited from Swift such that Swift users don't constantly need to convert between the various integer types.
It's the same with Array. If you pass Array.reserveCapacity(-1) then it'll also crash. And Array.reserveCapacity(.max) will (likely) crash too because you won't have the memory (and most allocators don't want to overcommit by that much).
FWIW, ByteBuffer is designed with network safety in mind. In general, you shouldn't just accept for example lengths from the network, but buffer.readSlice(length: Int.max) or length: -1 or so will not crash but instead yield a nil, just in case the user forgets to properly sanitise the values. All read* APIs share this.
The write side is different, NIO can't protect you there. If you accept arbitrary amounts of data and load them into memory, then your application is trivially attackable and there's not much that can be done automatically. (If you use ByteToMessageDecoder then that has built-in buffering limits that you can configure)
I too don't like that values that can never be negative are carried by signed integer variables, but I guess it simplifies somethings.
But Array.reserveCapacity(2^33) will likely succeed (on a real machine, anyway, if not an iOS device).
It’s one thing to crash because you've exhausted memory. It's another to crash because the underlying code assumes it will be called with values less than 2^32.
I can understand your point of view but Swift made a different decision there, everything is Int based. SwiftNIO merely followed that.
The key reason is that constantly forcing users into converting integer types is very awkward and error-prone. Also subtracting unsigned integer types is extra dangerous because if they go under 0 it'll crash (underflow). But I think this isn't the right post to discuss Swift's API choices.
Oh I see, that is internal though (as in could be changed without API changes), right? The reason if uses 32 bits internally is to save bits.
It's quite important for a Swift type as prominent as ByteBuffer to fit inside an existential box (which is 3 words -- 24 bytes on 64bit which I'll now assume). ByteBuffer is (IIRC) 23 bytes wide which leaves a few spare bits such that it can be an enum payload and the composite would still fit into 3 words. Bute ByteBuffer holds a pointer (1 word), a reader index, a writer index, a slice begin and a slice end. Naively this would be 5 words (40 bytes) but ByteBuffer packs that in under 3 words. This can be a very significant performance win because it potentially saves an allocation each time ByteBuffer goes into an existential box.
But of course, it's a tradeoff. I think it's a good one. If we were to show that it's a bad tradeoff, this could be changed at any point in time without breaking the API at all.
And just to reiterate: if you have a program that relies on storing many gigabytes in a single allocation, then that's an orange, if not red flag. And even if you do need to store that much, it's quite easy to build a composite type that internally holds [ByteBuffer], or you just use a specialised type for that (probably without CoW, likely ~Copyable).
The reasons why the internals work this way make sense, I think part of the complaint here is that the limit was unexpected, undocumented, and resulted in a crash.
This should be documented by saying what the limit is today, and in the future it should never decrease (but can increase).
Yes. I did look around for a comment explaining why it was limited to 32 bits, and didn't see one. Documenting that (and briefly explaining why, e.g. "for performance reasons," although a bit more might be nice) would help a lot.
And maybe some debug asserts to indicate what went wrong.
Yes, please send a request for documentation (or a PR adding docs) over.
The old bug you reference was different: If you had a 2GB ByteBuffer on 32-bit platforms and added a byte, then ByteBuffer would internally double its capacity to 4GB (which still fits into UInt32 but not Int on 32bit platforms) and then cast that size to an Int. That would now crash in ByteBuffer-internal code. This has since been fixed.