Reading large files fast and memory efficient

I'm trying to read large files as memory efficient and fast as I can.
Right now I am using FileHandle.availableData which is reading in 4KB chunks which is great:

func scan(advance: Int = 1) -> UInt8 {
        if data.isEmpty {
                return 0
        }
        if index >= data.count {
                return 0
        }
        let c = data[index]
        index += advance
        if index == data.count {
                index = 0
                data = file.availableData
        }
        return c
}

But performance is not optimal, it takes 1s to read 50MB. Running perf I get:

  29,00%  bla      libswiftCore.so    [.] swift_beginAccess
  26,66%  bla      bla                [.] $s3bla4scan7advances5UInt8VSi_tF
  18,94%  bla      libFoundation.so   [.] $s10Foundation4DataV15_RepresentationOys5UInt8VSicig
   5,36%  bla      ld-2.31.so         [.] __tls_get_addr

Is there a way to get rid of "swift_beginAccess"? I'm already compiling with

  swiftc -whole-module-optimization -Ounchecked bla.swift -enforce-exclusivity=none

There’s two parts to this:

  • Reading the file

  • Parsing the data

They are somewhat related, in that the API you use to read the file will affect the operations required to parse it.

Before we go further, I have some quick questions:

  • What platform are targeting?

  • Just how big can these files get? Is your 50 MB file a worst case? A expected case?

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

First: I found InputStream which seems more appropriate.

Platform: Linux
File size: 1GB or more
File type: ASCII text

Now I am using InputStream.read() which seems fast enough and interpreting the resulting UInt8 myself since everything else seems too slow (especially String and Character).

Reading a 1GB file with a InputStream buffer size of 16k takes 2 seconds without further parsing.

1 Like

If anyone is still interested in the part of the question about swift_beginAccess, it would seem that swift_beginAccess gets called even when enforce exclusivity is set to none (from reading the comments in the definition of swift_beginAccess at swift/Exclusivity.cpp at e37eb35c7c6e9faeabe2bda7de59dc92a718d779 · apple/swift · GitHub). I may be interpreting that incorrectly, but from your evidence it does seem to be the case (unless the compiler option is just broken).

It looks to me like your seek function is part of a class — swift_beginAccess is called because you are accessing members of the class. Making the class final or refactoring it to make it a struct would most likely get rid of the swift_beginAccess calls (not so sure about final, but pretty sure about struct). I'm surprised that the calls had such a high performance impact though given that they short-circuit pretty quickly when exclusivity is not being enforced. Reading more bytes at once could also minimise the performance impact of swift_beginAccess.

I know that the question is quite old and you already found an alternative solution, but I was working on a similar problem and some future reader of the thread might find it useful.