Fixed sized Data

Terje · December 1, 2021, 7:41pm

I want to read a file in parts or pages of e.g. 4k or 16k (or whatever the optimal may be). So FileHandle can be used to read chucks of the file that is returned as Data.

Being able to use FileHandle and Data is nice because they do all the heavy lifting.

However these filepages are fixed in size. I am always overwriting bytes in various places, never adding or removing. The thing is that Data does the optimal thing and allocates more memory than necessary in anticipation of ever more data (i.e. appending). For example, a quick test shows that 16K data has a capacity of 20480 bytes. Which is 25% more. I rather use that for keeping more pages in memory.

Do I have any recourse other than reinventing a (square) wheel that is Data and FileHandle?
If it makes any difference: I am using a wrapper around Data to use it as a reference type. Though multiple copies will exist in case of (nested) transactions.

A general answer or guidelines is preferred over explicit code.

tera · December 2, 2021, 12:33am

consider using mmap - for you that'd be like "memory" and the heavy lifting of the actual reading / writing would be done inside.

Terje · December 2, 2021, 4:53am

Yes, you suggested that before. I’m not entirely sure I’m on the right path though.

My FilePager would thus have these:
mmap(of a FileHandle) - [page] being read - [page] being modified in transaction 1 - … - [page] being modified in nested transaction n

Thus I still need a page object which is a fixed sized bag of bytes, no?
A new transaction implies copying the bytes upon write (cow).

Instead of wrapping Data, I cow the bytes in a unsafeMutableRawBuffer or unsafeMutableBuffer<UInt8> of fixed size, then?

Using mmap directly instead of through FileHandle is that the mapping is reused. Or, come to think of it, am I misinterpreting everything and I should be using a mmap per page? 1 file = 1 map I would think.

Undo or revert would then just drop everything up to transaction m.
There is also the issue of (todo:) a WAL file. Though that will probably sort itself out since that is “just” writing to file b before copying parts of file b into file a.

eskimo · December 2, 2021, 8:47am

Keep in mind that memory mapping is only safe if you can guarantee that the file is on a volume that can’t go away (more accurately, where it going away would cause catastrophic failure in your app anyway). It’s not safe, for example, to memory map a user document because those can exist on volumes that go away [1]. If the underlying volume goes away then any access turns into a memory access exception [2], which are very challenging to deal with.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] Historically this issue only cropped up on macOS, but file providers mean that iOS apps now have to deal with this as well.

[2] In the absence of MAP_RESILIENT_MEDIA, but that has its own issues.

Terje · December 2, 2021, 9:53am

Ah yes, right, and also services like iCloud and Dropbox which can yank the file from underneath you. It’s the reason CoreData based documents and iCloud don’t mix and match.

For a database package I can only set the appropriate file flag to “don’t use iCloud”. (would have to look up which flag. See photos database file for example. ) And advice my users (if any) to keep this mind like they would have to do with SQLite and similar databases.

So I should give FilePager.init a parameter to choose between read and mmap (which is more efficient but not always safe)? Ah, and for very large files relative to the available memory it would result in paging.

It does mean using c-functions instead of staying in pure Swift. I don’t see a way around it though. Yes, I know, FileHandle uses them too so the distinction is paper thin.

mmap vs read
3.1 Disk drives that do not honor sync requests

tera · December 2, 2021, 11:49am

Undo / transactions add a whole new dimension to the complexity here of course. One strategy would be to work on a file copy (if the copy on the same AFS volume it is cheap and instant) and when you want to commit a transaction - atomically exchange the two file contents (exchangedata). On AFS you may store multiple copies, just bear in mind even if the two copies are different in single byte the whole file allocation block size will be used for the change (4K or so), so I won't use this for undo, which I would handle on a higher level.

BTW, you may get a memory exception even without files on removable volumes...

if let p = malloc(big size) {
    accessing bytes of p // can cause a fatal memory access exception
}

eskimo · December 3, 2021, 9:29am

choose between read and mmap (which is more efficient but not
always safe)

I’m concerned about this whole “more efficient” thing. My experience is that lots of folks assume that memory mapping is more efficient because of the way that Unix-y kernels were traditionally designed. However, the Darwin kernel supports no-copy transfers [1], so you don’t have to use memory mapping to avoid copies.

Consider this quote from the post you referenced:

If you are planning on copying out all 200mb of the [file] then mmap
is likely a faster approach.

This is almost certainly bad advice on Darwin (and I suspect on other modern Unix platforms). If you’re streaming through a large file from start to end, no-copy file I/O is by far your best option because the I/O is just as fast, there’s no copies, and it doesn’t put nearly as much pressure on the VM subsystem.

So, if you’re building a general-purpose library, you can and should provide an efficient file I/O version. And if you have that then do you need the memory mapping version at all?

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] To be guaranteed you hit this path:

Set F_NOCACHE.
Use a page-aligned buffer.
Transfer a multiple of the page size.
To or from a page-aligned offset in the file.

tera · December 3, 2021, 10:50am

Consider this example:

you have a graph where each node takes 4K (for simplicity).
each node describes its contents as well as the links to other nodes.
you started at some node and display it's contents to the user.
from there user may navigate either downwards to some link or upwards to the parent node.
you want to minimise the number of I/O operations.
you don't change nodes (or do so infrequently)

In this task memory mapping is ideal (IMHO). For example when user flips back and forth between the node and its parent VM will load the relevant nodes once and there won't be subsequent I/O after that. With explicit read: either I/O will happen more than once, or I need to use cached I/O (and thus extra memory will be taken up for the cache and extra memory copy will happen on every I/O) or I'll have to, basically, reimplement VM...

"it doesn’t put nearly as much pressure on the VM subsystem." - that's the thing, depending upon the task in question you are not just lifting the pressure from one component - you are moving it onto another component. It's an engineering tradeoff where you want that pressure to be.

Terje · December 3, 2021, 11:30am

So a B-Tree basically. Which afaik is what APFS uses and most databases, me included.

SQLite3 does that, but that's probably more to support as many (exotic) platforms as possible.

But don't I have to do so anyway if I want to support transactions?
To avoid ambiguity: DataSheet is my simplified version of Swift.Data / bag of bytes. It's a wrapper around UnsafeMutableRawBufferPointer,

Transactions implies having multiple versions of each sheet in memory. Is there much difference between copying bytes between one sheet and another sheet (i.e. a new transaction) and making that first "copy" by copying bytes from disk into DataSheet? I have to manage memory anyway so I might as well do I/O myself, no?

note: I mentioned undo in a comment above. But that's more in the context of undoing an entire SQL statement. Not in the context of e.g. undoing a keystroke.

Avi · December 3, 2021, 11:32am

You are misunderstanding both I/O and memory mapping. The latter does not guarantee that pages are kept in memory, and the former does not imply that data is not cached.

Memory mapping, on the other hand, forces user-level page table mappings to be made, and that can cause problems for the application. All modern operating systems restrict the amount of virtual memory that can be mapped by an application. Using mmap'd files eats into this total. Relying on the operating system's filesystem cache does not.

tera · December 3, 2021, 11:36am

this can work for transactions:

- copy file to a temp location on the same volume (fast and cheap on AFS)
- mmap the copy
- change some bytes
- munmap the copy
- to commit: exchangedata of main file and the copy, delete old file
- to rollback: delete the copy

Terje · December 3, 2021, 12:55pm

To be honest, I am not convinced on mmap. I appreciate that this approach would take care of certain things for me, but I don't mind reinventing wheels. I mean, we are talking about writing a database from scratch.

If I understand everything correctly:

Arguments against mmap are:

it eats into the allowed virtual memory of an app
it pages anyway if the file exceeds available memory
if the volume disappears, all hell breaks loose.

Arguments in favour of mmap are:

Caching is done transparently for me (with the caveat that not all filepages are always cached)
less complexity / code for me

However if large files and other platforms require direct I/O I don't see how special casing mmap helps reduce complexity / code. And in any case, facing that complexity head-on usually brings new insights.
I am more concerned about correctness without sacrificing capability. mmap I think would impose constraints upon file size.

Now the idea of a fast and cheap file copy is enticing as a replacement for a WAL file. However it would again only be the case for Apple platforms. So special casing AFPS would not reduce my code base.

I appreciate all of your answers. I'll go with the I/O route for now. If I do a proper job then this component should easily be replaceable/extendable/special-casing later on anyway.

Terje · December 3, 2021, 1:00pm

You used latter twice. I thinks this is correct?

tera · December 3, 2021, 1:10pm

In this case you may try to do without your explicit memory management of pages and use cached I/O (first read slow - next read quick provided the data is still in cache). As for transactions - some journalism mechanism will be needed.

[And if you don't want to put the pressure on the file system cache for some reason - then you'll invest into your full blown memory management / caching infrastructure and put the pressure up on it.]

Avi · December 3, 2021, 1:13pm

Corrected. Thanks.

Terje · December 3, 2021, 1:20pm

keeping only the dirty pages in memory until the end of a transaction and otherwise rely on the system cache. Yes, that could probably work as a first approach. Thanks.

tera · December 3, 2021, 1:32pm

I'd say you'll write dirty pages, then when I/O flushed and the data hit the disk you do another atomic write, say, that has a new B-Tree root, after that succeeded - that's your transaction done. Any point before that - the system might crash, etc, the transaction is still either in flight or rolled back. Also worth considering an unfortunate situation when I/O operation is done partially. Quote a journey for you, Apple did it right only in a journaled HFS+, some 25 years after their original file system implementation.

Terje · December 16, 2021, 5:46pm

I'm writing down my thought process during development of this database. When it's ready someday I'll probably put it online somewhere. In case anybody is interested, here a few extracts relevant to this post.

This first extract is from a post about DataSheet. That is basically a fixed size Swift.Data.

Per the requirements we support returning a copy in the form of Swift.Data. In the implementation you see we used Data(buffer[range]). You might be asking yourself, like I did, if the Data's capacity is larger than necessary. That was after all the reason we choose to manage a buffer ourselves.

We could play around with it and see what the debugger tells us. Or we could have a peek at the source file of Swift.Data. This file can be found on github in the Foundation folder of Apple's swift-corelibs-foundation project. It's a relatively large file but if you follow the indentations correctly, you'll notice that Data boils down to internal enum with an associated payload. This way it can be optimized for various sizes.

Next locate the constructors of Data accepting UnsafeMutableBufferPointer types which are accessible to us as users. For example:
init<SourceType>(buffer: UnsafeMutableBufferPointer<SourceType>) 
{
    _representation = _Representation(UnsafeRawBufferPointer(buffer))
}
_Representation is the enum I mentioned but no need to look at that in detail. Already here we can see that our buffer will be copied into another buffer which is then passed on the enum. To my knowledge these buffers do not allocate extra capacity. Indeed that would be a bit weird since their purpose is to have the user in complete control. In other words our copy here doesn't waste any memory. What happens after is out of our hands.

This second extract is from a post about file paging.

Seeking

Our file handle is optional and our page numbers will be Int even though only UInt32 values are valid. The first thing that always happens is seeking the correct position within our file. The seek function of FileHandle expects UInt64 to boot. Let's just make our own seek function, shall we.
extension FilePager
{
    private func seek(number: Number) throws -> FileHandle
    {
        guard number < UInt32.max else { throw Error.outOfBounds }
        guard let assumedHandle = handle else { throw Error.inAccessible }
        try assumedHandle.seek(toOffset: UInt64(number * size) )
        return assumedHandle
    }
}

In case somebody is wondering about the Int and UInt32. Mixing integer types is sometimes a pain in the ass. Those page numbers are only used at few select places as UInt32 anyway. Also page numbers are considered to be user input since they are stored within the file to locate data. It's nicer to throw an error than to let Swift trap on this.

Writing

Although we rely on the system's cache to speed things up, we still need to copy the data to and from a datasheet. We might edit the sheet but then dispose of the edits without saving them to disk. Likewise after we wrote data to disk we might have additional edits that end up being disposed. Thus we need isolated copies.

FileHandle has functions for reading and writing Data. Let's investigate those functions like we did with Data in the previous post. The write uses _writeBytes(buf: UnsafeRawPointer, length: Int) which uses system functions like Darwin.write. Pretty straightforward and since we saw last time that our Data copy is pretty efficient as well, we can use the following snippet:
func write(sheet: DataSheet, at number: Number) throws
{
    guard sheet.isDirty else { return }
    guard number <= count else { throw Error.outOfBounds }
    try seek(number: number).write(contentsOf: sheet.data)
    if number == count { count += 1 }
}
Reading

To read the file we can use read(upToCount count: Int). Internally that calls _readDataOfLength. Depending on the options parameter we see that memory mapping is used. Something we want to avoid. However the options are empty by default and the read function doesn't specify any options. I can only conclude that mapping is not used, or at least not in this case.

Instead of memory mapping system functions like Darwin.read are used. Here it gets interesting though. Our FileHandle.read function first determines a suitable buffer size and then allocates an unsafe buffer of that size. Then via Darwin.read it reads bytes into the buffer. If more bytes are expected it resizes the buffer, each time by the same pre-determined amount. Finally once the expected amount of bytes are read, the buffer goes into Swift.Data.

Well actually NSDataReadResult but that gets converted into Data without any additional copies. Figuring that out requires peeking in the NSData source file.

To come back to that buffer size. As far as I can tell the buffer size is stat().st_blksize , i.e. the file system buffer size. If that can't be determined or returns 0 it defaults to 8 KB. You probably know or heard the recommendation to use multiples of the system's page size? Here we saw it in action. Note that for "sockets, character special files, FIFOs ..." the buffer size is always 8 KB.

If I print out the pagesize (print(stat().st_blksize)) I get 0 though. In any case the page size will probably depend on the platform. Our B-Tree page size is adaptable but still fixed for a given file regardless of platform. What we should take away from this is that our default page size should probably be a multiple of 8 KB. Unless we want to add the complication of reading multiple pages of smaller size at once.

To conclude this segment, 'FileHandle' reads bytes into Swift.Data without reserving additional memory capacity. It does so in blocks of 8 KB unless the page size can be determined. Thus we shouldn't have any hangups about using FileHandle but wisely choose a B-Tree page size.

Thus we implement our read function as follows:
func read(sheet number: Number) throws -> Element
{
    guard number < self.count else { throw Error.outOfBounds }
    guard let data = try seek(number: number).read(upToCount: size), data.count == size
      else { throw Error.outOfBounds }
    return .init(from: data)
}

tera · December 17, 2021, 5:26pm

Thus we implement our read function as follows:

func read(sheet number: Number) throws -> Element
{
    guard number < self.count else { throw Error.outOfBounds }
    guard let data = try seek(number: number).read(upToCount: size), data.count == size
      else { throw Error.outOfBounds }
    return .init(from: data)
}

Do you like synchronous I/O?

I noted you mentioned non apple platforms and disappearing volumes before. Perhaps you don't exclude network volumes either? Those could be slow, so any sync I/O operation is unbound time (strictly speaking any I/O is unbound time, it's just very obvious in case of network). Depending upon how you structure the rest of your app (threads? queues?) it might be a problem. "Modern" way to do I/O is by doing it asynchronously (having said that I remember Apple had PBRead+async even in 1984 if not earlier!) Quickly glancing through headers I see some ancient readToEndOfFileInBackgroundAndNotify but those are not what I'd recommend to do. Until it is done properly in the file manager with new API's and some nice async/await wrappers on top, the thing I'd look at would be DispatchIO (e.g. this). I don't see the new async/await facelift in there but would assume they (or some better equivalents in the FileManager itself) are forthcoming.

Edit: the future has come. For modern async I/O I'd look no further than this:

@available ( macOS 12.0, iOS 15.0, tvOS 15.0, watchOS 8.0, *)
public var resourceBytes: URL.AsyncBytes { get }

Edit 2: but you need to read a number of bytes from a certain offset... maybe the future is not here yet.

Terje · December 17, 2021, 9:01pm

Of course not. Huge fan of libdispatch. Made my crude version once to better understand how that works.

The future is actors, no. The idea is to hide this behind an actor
Userland/main actor/background context/… => actor Database => [actor Transaction] => actor Storage => [actor Pager]

So

read bytes from main file and/or WAL file | 2 files per entire database or 2 files per table => each file is accessed through an actor.
interpret bytes as specific BTree page & file coordinator => 1 actor per database
Transaction: Query BTree & value <=> bytes transformations => 1 actor per transaction
Handle transaction / transaction results = database => 1 actor
Future work: 1 App = 1 DB connection or maybe multiple databases or maybe 1 connection per network connection or maybe 1 for local + 1 for remote DB or …(ah, who knows at this point)

The boundaries are still somewhat in flux. This week I read about ACID / Isolation / multi-version concurrency control. So I’m incorporating that new knowledge. Points 1 and 2 were the same actor and coded as such in a first version.

With the current refactor pass I can I/0 individual sheets in a file and write arbitrary data (byte collections/ints/floats/date/strings/enum) to a sheet. Decimal too once that is added to Swift. Not Float80 (yet) because there is no Int80 or larger. Int128 is coming I think to support Decimal.

Now I am working on WAL which lead to Isolation levels and transaction concurrency. This requires a new set of boundaries. Currently planned as described above. My gut feeling says that actor in point 2 might be a bottle neck. Also multiple connections ( != concurrent transactions I think) has not yet been looked into. So I’m not yet sure where it will end up exactly.

I’m aiming for a lock free DB along the lines of multiversion concurrency control (MVCC).

Also I came across ARIES, but that is more on row level not page level. Sounds useful for maybe DB replication over the network. Don’t know. A lot of this stuff is new for me. I’ll figure it out as I go along.

PS. My background is physics not CS. Got burned twice on slow code (15-20 years ago). Once because of the network, once because of inner/outer loops and file I/O. Last one was easily fixed, first one not. So yes, I’m very sensitive to performance. I mean, this whole thread is about saving a few bytes.

PS.2 I thank you for your comment and your concern, Tera.

PS.3 I had already planned to present my actor boundaries and ask for feedback once more of it was implemented. Now it’s still too much in flux to bother everyone with it, I think. But don’t let that hold you back