Read text file line by line

linuxjh · September 20, 2019, 2:58pm

My code getline() will put them together in one line, will not "split" them. It's not fgets' fault.

Sorry, I was wrong about this. After the two parts are concatenated, the result does not look right in my test.

linuxjh · September 20, 2019, 3:24pm

you can trim the white spaces after you read a line of text. and you can set up another pre-processing of line delimiter validation before read the file. or you can put them together. general programming languages serve general purpose. programmer code for their very own needs.

gnuoyd · September 20, 2019, 3:30pm

Memory-mapped files are awfully convenient, but they will not always
be faster than buffered reads (or writes), especially if you are just
streaming the file. Your file may occupy physical memory and the data
cache in excess of what your app needs, and your app's virtual-memory
footprint will be way larger, so you may incur TLB misses in excess of a
buffered implementation. All of that can drag down the performance of
your app as well as the total system.

Dave

wilshipley · September 20, 2019, 9:46pm

I’m learning a lot about memory mapping — thanks all!

I do worry we’re wandering a bit far afield — the approaches to reading files obviously vary with how many files will be open at once / how large the files could potentially be / how important it is to be fast / how important it is to never ever fail, so we’re in a situation where all of us can be right at the same time given our different assumptions of the input conditions.

Also we’re getting into a discussion that’s language-agnostic — most of the approaches suggested for large files are the same in pretty much any language. (Although I do appreciate the pointers to Swift-specific solutions.)

Being me, I always advocate for the simplest solution first, then adding code & complexity if it proves insufficient in testing. There are those who argue one should always code against a worst-case future, and can I respect that point of view as well, but it’s not one I advocate unless you’re writing a multi-use framework.

-Wil

wilshipley · September 20, 2019, 9:55pm

I’m not clear on how memory mapping is worse than reading a line at a time from a file handle in the case where a file disappears? Seems like both would fail.

-W

jonprescott · September 20, 2019, 11:27pm

If an open file is suddenly removed, usually you get an error indicator that you can trap or test against. If a memory-mapped file goes away, you get a memory access crash, and you can rarely trap those or even identify it. It's like a disk crash.

gnuoyd · September 21, 2019, 5:48pm

Are you sure that a Swift program on macOS or Linux will behave that way
when you remove a file? The baked-in BSD UNIX and Linux behavior is
that a file's blocks hang around until no process has the file open or
mapped. Are the unlink semantics on iOS different than that?

Dave

jonprescott · September 21, 2019, 7:37pm

Most of the time, no, file can't be removed until it closes or unmaps. However, there are edge cases that can occur that sometimes causes it to happen (can't remember off-hand, and they may have been fixed, it's been awhile). However, the point is the type of reaction to an anomalous error differs between open files and mapped files.

The other difference is that mapped files count against your virtual memory usage, and the system virtual memory usage; open files, not so much. May not be a problem for smaller files, but, you get to 10's of gigabytes, it starts to affect performance. Those are the types of files the OP is concerned with.

lukasa · September 22, 2019, 6:27am

The standard case was identified by @eskimo above: when the file you mapped is not on the root volume. In this case the volume can become inaccessible, and so naturally I/O on the file will fail. This cannot be prevented by the OS in many cases: while the OS can refuse to unmount a drive, it cannot prevent that drive being yanked out of the bus to which it is attached!

catalina · March 5, 2021, 10:19pm

This is literally what I needed but wondered if there was a “hidden” swift way of doing it. Now that I know there is not, I will use your solution. I have always found your answers to be best (and I’ve been following you for some time)!

willtemperley · October 14, 2023, 10:09am

Thanks to AsyncStreams in Swift 5.5 and above, this is now possible:

      for try await line in url.lines {
          // do something
      }

It's also really fast.

wadetregaskis · October 14, 2023, 2:20pm

Is that based on comparisons / benchmarks / profiling, or more anecdotal?

I ask because I (and others) have observed it being actually very slow, due to tremendous overheads in Swift Task management etc. But I'm still wondering why and if there's something I can do (or not do) that will fix that.

willtemperley · October 14, 2023, 2:47pm

I'm extrapolating a little as I've only benchmarked AsyncBytes which is what is beneath url.lines. When building a binary delimited protobuf stream parser on top of AsyncBytes (https://github.com/apple/swift-protobuf/pull/1434) I initially had the impression AsyncBytes was very slow - which is true in debug but the release build was over 16x faster. I managed to get an AsyncBytes based protobuf parser to be 25% faster than my best attempts with aggressive read-ahead buffering and 5x faster than the default InputStream based parser. I wonder if you've also tested debug vs release speeds?

wadetregaskis · October 14, 2023, 4:25pm

Of course; I only worry about release builds when it comes to performance.

Though, I have not yet dug into the profiles or Swift stdlib source… I'm a bit reluctant to given that the profiles were pretty obtuse (lots of symptomatic noise, like unnecessary retain/release traffic, but no hint of a signal i.e. root cause).

I think I did actually download the Swift sources with the intent to build a symbolicated version, in order to investigate further, but then I ran into issues getting Swift to build, so I put that particular yak aside.

I vaguely recall someone saying (here in the forums) that it's kind of a known thing that AsyncSequences (and Tasks more generally) can have non-trivial unnecessary overhead right now - there's apparently a bunch of optimisations anticipated but not yet performed. I think it's an area - like Regexes or some of the String methods - that would really welcome anyone willing to dive in a bit and optimise the code.

willtemperley · October 14, 2023, 5:00pm

Interesting. I just knocked something up quickly using your readline approach as I couldn't get your example to compile. I'm getting around 26s with URL.lines and 20s with readLine to count lines in the synthetic log file your code created (around 420MB). I'm on an M2 Max.

Given URL.lines is cross platform, baked into Swift and trivial to use, whereas the readLine approach is non-obvious (at least to me) I think URL.lines isn't doing badly.

if #available(macOS 13, *) {
  let url = FileManager.default.homeDirectoryForCurrentUser.appending(path: "Downloads/access.log")
  let date = Date()
//  try await asyncLinesPerf(url: url)
  readLinesPerf(url: url)
  print(Date().timeIntervalSince(date))
}

@available(macOS 12.0, *)
func asyncLinesPerf(url: URL) async throws {
  var count = 0
  let filtered = url.lines
    .filter { $0.contains("26/Apr") }
    .filter { $0.contains("\"GET ") }
  
  for try await _ in filtered {
    count+=1
  }
  print("URL.lines: \(count)")
}

func readLinesPerf(url: URL) {
  guard freopen(url.path, "r", stdin) != nil else {
    exit(EXIT_FAILURE)
  }
  
  let matchedLines = sequence(state: 0) { _ in readLine() }
    .filter { $0.contains("26/Apr") }
    .filter { $0.contains("\"GET ") }
  
  var count = 0
  for _ in matchedLines {
    count += 1
  }
  print("Readlines: \(count)")
}