Difficulties With Efficient Large File Parsing

ezfe · April 25, 2019, 5:06am

Hi all,

I'm reading in a 422 MB text file with ~28 million lines. Each line is 10-20 characters with a single space in the middle (pairs of ids).

Reading this into memory with String(contentsOfFile:) is very quick and takes just a second to run. However, just splitting it into an array of substrings takes a very long time. I've tried using a few options:

.split(separator: "\n")
.components(separatedBy:)
.enumerateLines(:)

Compared to other languages (Java, others), the speeds I'm observing with Swift are extremely slow (2-4x slower). Does anyone have recommendations for the fastest way to process a file line-by-line?

David_Smith · April 25, 2019, 6:53am

String(contentsOfFile:) unfortunately can hit some obscure slow paths in NSString. Any chance you could post an Instruments 'Time Profile' of this running?

Torust · April 25, 2019, 7:51am

An option here, if you know the sort of input you have, is to manually write a parser. Given your problem description, here's a basic proof of concept (written in about five minutes so excuse any bugs):

var string = """
Some string
made of
many lines
"""

class Parser {
    let string : String
    var index : String.Index
    
    init(string: String) {
        self.string = string
        self.index = string.startIndex
    }
    
    func nextId() -> Substring? {
        if self.index == self.string.endIndex {
            return nil
        }
        
        let endIndex = self.string[self.index...].firstIndex(where: { $0 == "\n" || $0 == " " }) ?? self.string.endIndex
        let returnValue = self.string[self.index..<endIndex]
        
        self.index = self.string[endIndex...].firstIndex(where: { $0 != "\n" && $0 != " " }) ?? self.string.endIndex
        return returnValue
    }
}

let parser = Parser(string: string)
while let id = parser.nextId() {
    print(id)
}

In practice, you may need to handle other newline characters (e.g. \r\n), and it may be better to perform the parsing on the string's utf8 or utf16 views for performance and compare against the characters in that encoding. This should be a good starting point, though, and avoids the overhead of e.g. allocating separate arrays and strings that split or components entail.

DeFrenZ · April 25, 2019, 9:03am

If you have details about the contents of the file (e.g. ASCII-only), String might not be your best option, as being unicode-correct is heavy performance-wise

TellowKrinkle · April 25, 2019, 9:48am

Some results from my testing (on a ~500MB text file with lines averaging 10 characters long):

switch CommandLine.arguments.dropFirst().first {
case "ss": // >72 seconds (I cancelled before finishing)
	let str = try String(contentsOf: url)
	_ = str.split(separator: "\n")
case "sc": // 32.51 seconds
	let str = try String(contentsOf: url)
	_ = str.components(separatedBy: "\n")
case "ds": // 14.56 seconds
	let data = try Data(contentsOf: url)
	_ = data.split(separator: UInt8(ascii: "\n"))
case "das": // 7.04 seconds
	let data = Array(try Data(contentsOf: url))
	_ = data.split(separator: UInt8(ascii: "\n"))
case "dss": // 25.96 seconds
	let data = try Data(contentsOf: url)
	let str = data.withUnsafeBytes { String(decoding: $0, as: UTF8.self) }
	_ = str.split(separator: "\n")
case "dus": // 4.82 seconds
	let data = try Data(contentsOf: url)
	data.withUnsafeBytes {
		_ = $0.split(separator: UInt8(ascii: "\n"))
	}
case "duss": // 9.48 seconds
	let data = try Data(contentsOf: url)
	_ = data.withUnsafeBytes {
		return $0.split(separator: UInt8(ascii: "\n")).map { String(decoding: UnsafeRawBufferPointer(rebasing: $0), as: UTF8.self) }
	}
default:
	print("Unknown argument")
}

As previously mentioned, the first case (using String contentsOf) is really slow due to the bridging that happens since String contentsOf is a cocoa method and returns an NSString:

Using contentsOf and then components:separatedBy stays in Cocoa land and avoids the heavy bridging, but Cocoa string processing isn't the most efficient

It appears Data still has some speed issues, so throwing an Array constructor in still speeds it up heavily. If you don't need Unicode correctness, this is definitely the way to go
The String(decoding:as:) constructor is a Swift one and therefore creates Swift native strings which process much faster. A lot of stuff still seems to be going on though

If you can do all of your processing at once inside of an withUnsafeBytes block, that will be the fastest since it avoids spamming retain on the main object. Note that 2.3 seconds of that was spent copying the array of splitted objects to a new buffer, so using a lazy split might improve this. It also might let the compiler avoid spamming retain / release if you did this with an Array.

Finally, thanks to the new short string optimizations, splitting and then initializing strings from the (now short) pieces worked really well. Note that for some reason keeping the pointer as a slice (not using the init(rebasing:) made it super slow. Maybe someone forgot an @inlinable somewhere in Slice?

alextud · April 25, 2019, 10:08am

I also did a test on data provided by GitHub - Flight-School/CodablePerformance: Performance benchmarks for Codable and JSONSerialization using Swift and ObjC:

func testObjC() {
    self.measure {
        let string = NSString(data: data, encoding: String.Encoding.utf8.rawValue)!
        let count = string.components(separatedBy: "\n").count;
        XCTAssertEqual(176468, count)
    }
}

func testSwift() {
    self.measure {
        let string = String(data: data, encoding: .utf8)!
        let count = string.components(separatedBy: "\n").count;
        XCTAssertEqual(176468, count)
    }
}

And I got this results:
Test Case '-[Performance_Tests.PerformanceTests testObjC]' passed (3.538 seconds).
Test Case '-[Performance_Tests.PerformanceTests testSwift]' passed (5.702 seconds).

ezfe · April 25, 2019, 1:01pm

Thanks for all the tips and info. I’ll see where this goes and report back.

ezfe · April 25, 2019, 4:43pm

I implemented data.withUnsafeBytes to split by UInt8(ascii: "\n"), which seems to have helped. Splitting by newlines is no longer a substantial part of the entire program, and execution time (of the entire program, not just the splitting) has gone from 3.5 minutes to ~2 minutes, which is great.

I'll keep reviewing better ways to improve my specific code, but I'm satisfied with the \n splitting.

Michael_Ilseman · April 25, 2019, 8:01pm

If you have Swift 5.1 available, you can use SE-0247's String.withUTF8, which will force the contents into an efficient, contiguous form and process on that.

Could you also try to force an eager bridge from NSString into native Swift String by doing something like:

var str = String(contentsOfFile: ...)
str += "" // Force bridge

And see if .split is faster?

ezfe · April 25, 2019, 8:21pm

Using data.withUnsafeBytes and String(decoding:as:) took ~3.7 seconds to split the lines into an array of Strings

Using String(contentsOfFile:) and .split without force-bridging took a whopping 241 seconds

Using String(contentsOfFile:) and .split with force-bridging took 21 seconds, so it's clear that was a large part of the problem

Lastly, I used String.makeContiguousUTF8() (from Swift 5.1) on the string instead, which also took 21 seconds.

Michael_Ilseman · April 25, 2019, 8:33pm

What is this approach?

makeContiguousUTF8() is basically the force-bridge (except for contiguous-nul-terminated-ASCII bridged strings, where we just have a slightly higher initial constant cost).

String.split() does not have the same semantics as what you described earlier, because it will first segment graphemes such that "\n\u{301}" is one Character and not equal to "\n". You probably don't want this semantics, nor do you want to spend the time doing the more complex grapheme analysis. You can try String.UTF8View.split() to cut out much of this overhead, but since it's the generic Collection version (and doesn't understand contiguity, at least not yet), it won't be as fast as your hand-written loop.

TellowKrinkle · April 25, 2019, 8:36pm

I'm guessing it was this one:

let data = try Data(contentsOf: url)
_ = data.withUnsafeBytes {
	return $0.split(separator: UInt8(ascii: "\n")).map { String(decoding: UnsafeRawBufferPointer(rebasing: $0), as: UTF8.self) }
}

If all the splitted strings are <15 characters they get to be stack allocated an I'm sure that saves a lot of allocation and reference counting time.

David_Smith · April 25, 2019, 8:42pm

For anyone who's curious, the reason why bridging is so expensive here is because +stringWithContentsOfFile: doesn't null terminate its contents, which causes CFStringGetCStringPtr to return NULL.

Michael_Ilseman · April 25, 2019, 8:43pm

TellowKrinkle:

I'm guessing it was this one:

let data = try Data(contentsOf: url)
_ = data.withUnsafeBytes {
	return $0.split(separator: UInt8(ascii: "\n")).map { String(decoding: UnsafeRawBufferPointer(rebasing: $0), as: UTF8.self) }
}

I wasn't sure if he was still calling a split() somewhere or if he did the logic himself.

<= 15 UTF-8 code units

They're inlined directly into the struct's bits, i.e. Array.Element. They're "struct-allocated" .

Correct, no allocations or reference counting.

TellowKrinkle · April 25, 2019, 8:44pm

Hmm I tried the + "" method to force a String(contentsOf:) String to native representation and it seems really slow

David_Smith · April 25, 2019, 8:47pm

Hm. That looks extremely fixable! That should be using CFStringGetBytes().

TellowKrinkle · April 25, 2019, 8:48pm

Anyways, it appears string.utf8.split is fast as well

case "ds8s": // 7.89 seconds
	let str = try Data(contentsOf: url).withUnsafeBytes { String(decoding: $0, as: UTF8.self) }
	_ = str.utf8.split(separator: UInt8(ascii: "\n"))

Is there a method to take those UTF8View subsequences and turn them back into Substrings?

Michael_Ilseman · April 25, 2019, 9:01pm

Just init it. Substring(utf8SubSequence).

It's a bad idea to do so if your UTF8View subsequence is splitting a scalar, and slightly dubious (but probably ok) if it's splitting a grapheme cluster.

Also, it's often better to just make an eager copy if small, as Substring participates in memory management of the larger String.

David_Smith · April 25, 2019, 9:11pm

Untested PR up here: SR-10555 foreignCopyUTF8 should do bulk access by Catfish-Man · Pull Request #24289 · apple/swift · GitHub

We'll see if any of the existing benchmarks hit this. If not I'll have to add one that does what you're doing.

ezfe · April 25, 2019, 9:18pm

I’m away from my computer but I can confirm that that was the technique and there was no more splitting in the code for the timings I mentioned.