I'm writing down my thought process during development of this database. When it's ready someday I'll probably put it online somewhere. In case anybody is interested, here a few extracts relevant to this post.
This first extract is from a post about DataSheet
. That is basically a fixed size Swift.Data
.
Per the requirements we support returning a copy in the form of Swift.Data
. In the implementation you see we used Data(buffer[range])
. You might be asking yourself, like I did, if the Data
's capacity is larger than necessary. That was after all the reason we choose to manage a buffer ourselves.
We could play around with it and see what the debugger tells us. Or we could have a peek at the source file of Swift.Data
. This file can be found on github in the Foundation
folder of Apple's swift-corelibs-foundation
project. It's a relatively large file but if you follow the indentations correctly, you'll notice that Data
boils down to internal enum with an associated payload. This way it can be optimized for various sizes.
Next locate the constructors of Data
accepting UnsafeMutableBufferPointer
types which are accessible to us as users. For example:
init<SourceType>(buffer: UnsafeMutableBufferPointer<SourceType>)
{
_representation = _Representation(UnsafeRawBufferPointer(buffer))
}
_Representation
is the enum I mentioned but no need to look at that in detail. Already here we can see that our buffer will be copied into another buffer which is then passed on the enum. To my knowledge these buffers do not allocate extra capacity. Indeed that would be a bit weird since their purpose is to have the user in complete control. In other words our copy here doesn't waste any memory. What happens after is out of our hands.
This second extract is from a post about file paging.
Seeking
Our file handle is optional and our page numbers will be Int
even though only UInt32
values are valid. The first thing that always happens is seeking the correct position within our file. The seek function of FileHandle
expects UInt64
to boot. Let's just make our own seek
function, shall we.
extension FilePager
{
private func seek(number: Number) throws -> FileHandle
{
guard number < UInt32.max else { throw Error.outOfBounds }
guard let assumedHandle = handle else { throw Error.inAccessible }
try assumedHandle.seek(toOffset: UInt64(number * size) )
return assumedHandle
}
}
In case somebody is wondering about the Int
and UInt32
. Mixing integer types is sometimes a pain in the ass. Those page numbers are only used at few select places as UInt32
anyway. Also page numbers are considered to be user input since they are stored within the file to locate data. It's nicer to throw an error than to let Swift trap on this.
Writing
Although we rely on the system's cache to speed things up, we still need to copy the data to and from a datasheet. We might edit the sheet but then dispose of the edits without saving them to disk. Likewise after we wrote data to disk we might have additional edits that end up being disposed. Thus we need isolated copies.
FileHandle
has functions for reading and writing Data
. Let's investigate those functions like we did with Data
in the previous post. The write
uses _writeBytes(buf: UnsafeRawPointer, length: Int)
which uses system functions like Darwin.write
. Pretty straightforward and since we saw last time that our Data
copy is pretty efficient as well, we can use the following snippet:
func write(sheet: DataSheet, at number: Number) throws
{
guard sheet.isDirty else { return }
guard number <= count else { throw Error.outOfBounds }
try seek(number: number).write(contentsOf: sheet.data)
if number == count { count += 1 }
}
Reading
To read the file we can use read(upToCount count: Int)
. Internally that calls _readDataOfLength
. Depending on the options parameter we see that memory mapping is used. Something we want to avoid. However the options are empty by default and the read function doesn't specify any options. I can only conclude that mapping is not used, or at least not in this case.
Instead of memory mapping system functions like Darwin.read
are used. Here it gets interesting though. Our FileHandle.read
function first determines a suitable buffer size and then allocates an unsafe buffer of that size. Then via Darwin.read
it reads bytes into the buffer. If more bytes are expected it resizes the buffer, each time by the same pre-determined amount. Finally once the expected amount of bytes are read, the buffer goes into Swift.Data
.
Well actually NSDataReadResult
but that gets converted into Data
without any additional copies. Figuring that out requires peeking in the NSData
source file.
To come back to that buffer size. As far as I can tell the buffer size is stat().st_blksize
, i.e. the file system buffer size. If that can't be determined or returns 0 it defaults to 8 KB. You probably know or heard the recommendation to use multiples of the system's page size? Here we saw it in action. Note that for "sockets, character special files, FIFOs ..." the buffer size is always 8 KB.
If I print out the pagesize (print(stat().st_blksize)
) I get 0 though. In any case the page size will probably depend on the platform. Our B-Tree page size is adaptable but still fixed for a given file regardless of platform. What we should take away from this is that our default page size should probably be a multiple of 8 KB. Unless we want to add the complication of reading multiple pages of smaller size at once.
To conclude this segment, 'FileHandle' reads bytes into Swift.Data
without reserving additional memory capacity. It does so in blocks of 8 KB unless the page size can be determined. Thus we shouldn't have any hangups about using FileHandle
but wisely choose a B-Tree page size.
Thus we implement our read
function as follows:
func read(sheet number: Number) throws -> Element
{
guard number < self.count else { throw Error.outOfBounds }
guard let data = try seek(number: number).read(upToCount: size), data.count == size
else { throw Error.outOfBounds }
return .init(from: data)
}