Text streaming in standard library

omochimetaru · January 8, 2019, 5:18am

I want text streaming functionality in swift standard library. Text streamng in here is, for example, without loading whole contents of CSV file in memory, parsing data record one each during reading file little by little.

Actually I often write such program for my job. I use it to playback like time series coordinate data constructed from video image processing based on its time value. Data size for each time point is large and recording time is long, so I don't read all in memory, process little by little.

When processing like this, the following functional units are required. To simplify this, I consider only UTF-8.

(1) Reading file by byte units little by little.
(2) Decoding unicode codepoint from byte stream as UTF-8 little by little. From this, byte stream becomes into unicode codepoint stream.
(3) Decoding graphem clusters from unicode codepoint stream little by little. This is Character type in Swift.

And, depending on format of text, peeking and seeking back is needed to parse it. So stream position information and seeking function is required in each above steps.

Implementing them is hard with current standard library. I explain one by one.

(1) About this, closer ones are in Foundation. NSInputStream and NSFileHandle are.

InputStream(Swift)/NSInputStream class has wasteful complicated functional. Its not useful and hard to understand.
https://developer.apple.com/documentation/foundation/inputstream

Without seeking function, it is useless for processing text requiring peeking to parse.

Functions working together RunLoop is integrated. But I want to manage such driving control by myself by DispatchQueue.

FileHandle(Swift)/NSFileHandle uses Objective-C exception for error handling. It can not control from Swift so that it is useless at all.

https://developer.apple.com/documentation/foundation/filehandle/1413916-readdata

(2) About this, closer one is in standard library. Unicode.UTF8.decode is.

https://developer.apple.com/documentation/swift/unicode/utf8/2907346-decode

But this does not returns positional information. If I don't know the correct position, I can't know position in byte unit. So I can not implement seeking with unicode codepoint unit in latter stage processing.

(3) About this, It can be implemented by String with much effort. Repeating append UnicodeScalar to String, if .character property returns more than 2 characters, split first one character. When end of stream is coming, reading out all remaining .characters. Like this, it can be realized, But I think implementation is complex, and getting this idea is not easy. And it run after a fashion, I think its inefficient for performance relating on internal processing in String.

Though in such situation, I want to implement correnct text processing, So I made requirements.

About (1), I wrapped C fopen family.

github.com

omochi/StringStream/blob/master/Sources/StringStream/FileHandle.swift

import Foundation

public class FileHandle : FileHandleProtocol {
    public enum Closer {
        case close
        case none
    }
    
    public convenience init(path: URL, mode: String) throws {
        guard let handle = Darwin.fopen(path.path, mode) else {
            throw PosixError.current
        }
        self.init(handle: handle, closer: .close)
    }
    
    public init(handle: UnsafeMutablePointer<FILE>,
                closer: Closer)
    {
        self.handle = handle
        self.closer = closer

This file has been truncated. show original

About (2), I made UTF-8 decoder.

github.com

omochi/StringStream/blob/master/Sources/StringStream/UTF8Reader.swift

import Foundation

public class UTF8Reader {
    public init(handle: FileHandleProtocol) throws {
        self.handle = handle
        let position = try handle.position()
        self.buffer = Buffer(start: position, position: 0, data: Data())
    }
    
    private let handle: FileHandleProtocol
    
    public var position: Int {
        return buffer.start + buffer.position
    }
    
    private struct Buffer {
        public var start: Int
        public var end: Int { return start + data.count }
        public var position: Int
        public var data: Data

This file has been truncated. show original

About (3), I implemented process above.

github.com

omochi/StringStream/blob/master/Sources/StringStream/CharacterReader.swift

import Foundation

public class CharacterReader {
    public init(handle: FileHandleProtocol) throws {
        self.reader = try UTF8Reader(handle: handle)
        self.buffer = Buffer(start: reader.position,
                             tokens: [],
                             string: "")
    }
    
    private let reader: UTF8Reader
    
    public var position: Int {
        return buffer.start
    }
    
    private struct Token {
        public var length: Int
        public var scalar: Unicode.Scalar
    }

This file has been truncated. show original

Because of I made by myself, its ok in job for the moment. But I think that suck like text stream processing is common. It is good that it in standard library.

What do you think?

Japanese translation (actually original) of this post is here

RMJay · January 8, 2019, 12:36pm

I was recently conducting a data processing task in Swift and wanted to be able to iterate over the lines of text in a file. I found this to be a hassle in Swift compared to Python. I ended up creating my own LineReader class to be used as follows.

let lineReader = try LineReader(file: f) //f is a file handle
for line in lineReader {
    //...
}

I had to search on StackOverflow to work out how to do it and I can certainly imagine a beginner struggling. One of the most beginner friendly features of Python is how easy it is to iterate over the contents of a file line by line.

//Python
with open("demofile.txt", 'r') as f:
    for line in f:
        //...

I think it would be great if Swift could be almost as beginner friendly as Python for this particular task. The LineReader class I ended up using can be found here GitHub - RMJay/LineReader: Swift line reader class for iterating over a file line by line.

algal · January 12, 2019, 5:12pm

I had the exact same experience recently. The awkwardness of just iterating through lines of text is remarkable.

ASwiftUser · January 12, 2019, 11:33pm

I agree - I even had to resort to this stack overflow answer which I have absolutely no idea how it works. And all I wanted to do is to create my own bytecode.

Maybe, instead of creating separate classes for utf8, char, etc, make functions on the InputStream (or whatever):

// data is "Hello"
stream.readByte() // H, UInt8 or CChar
stream.readScalar() // e, Unicode.Scalar
stream.readCharacter() // l, Character

Michael_Ilseman · January 14, 2019, 10:40pm

Huge +1 to the general effort. My view is that the lack of file handles and text streaming support in the standard library is a glaring omission that makes using Swift for processing tasks needlessly obnoxious. This is definitely something that we should remedy as soon as we can.

We should provide a byte stream as well as equivalents of all of String's views on top of these streams (performing encoding validation), so that you can read a stream of validated UTF-8 bytes, or Unicode scalar values, or graphemes, or transcoded UTF-16 code units, etc. In the future, when we add normalized views as well, the stream should be able to support that too.

For example, when parsing CSV, you wouldn't want to operate on a stream of grapheme clusters, as grapheme segmentation is irrelevant to CSV and you wouldn't want to have to handle degenerate graphemes. You'd instead want to operate at or below the level of the specification at, i.e. Unicode.Scalar or UTF8.CodeUnit.

omochimetaru · January 15, 2019, 2:45am

CSV parser may provides function to customize separator sometimes.
If I implement parser on Unicode.Scalar layer, user can not use 🇯🇵 as separator.
But 🇯🇵 is one character for human naturally.
Even if no benefit in technically, Character or String interface is natural for major programmer.

And in Character layer, CR + LF is combined to one newline Character.
It is useful to implement some readLine without 2 character peeking.

Anyway, I agree we need all layer stream about text composition.
It is best that we can select appropriate stream for target at that moment.

SDGGiesbrecht · January 15, 2019, 6:32am

Even with the standard , (comma) as the field separator, the user’s fields may start with a combining character. (It is even extremely common in certain contexts, such as in a CSV representing a keyboard layout.) The CSV source would then have occurrences of things like ,́,, where the field represents an acute accent. By working at the cluster (Character) level, your parser would miss the preceding commas, resulting in irregular, less‐than‐expected field counts and CSV source guts spilled into field values.

If you want to support multi‐scalar separators, then you have to do just that: support multi‐scalar separators.

omochimetaru · January 15, 2019, 6:50am

There are 2 acute accent codepoint.

U+0301 is combining acute accent, U+00B4 is (not combining) acute accent.

let str1 = ",\u{301}"
print(str1.count) // 1

let str2 = ",\u{B4}"
print(str2.count) // 2

the user’s fields may start with a combining character

Why is it? people should use U+B4 for such situation.

SDGGiesbrecht · January 15, 2019, 7:03am

Because when the results are concatenated, they would be beside each other, not combined as intended. If I press a key labelled “e” followed by a key labelled “ ́”, I expect both fields to be looked up in the keyboard layout CSV file and the result concatenated giving me “x” + “́” = “x́”. But if the CSV contained U+B4, the result would be “x´”, which is categorically not what the user wanted.

omochimetaru · January 15, 2019, 7:27am

Thanks I understand.
Data in keyboard layout CSV is directly meaning the codepoint when keyboard built from this CSV is typed by keyboard user.
So in such CSV, parser must split column by Unicode codepoint and keep isolated combining character codepoint as is.

Michael_Ilseman · January 15, 2019, 8:34pm

Right, the choice of String being a collection of Character is a compromise favoring ease of use ("natural" as you say) over technical precision. However, if you want to implement a rigid technical specification, you'll certainly want to use one of the more technical views.

+1. Note that Character equality also follows canonical equivalence, again favoring a compromise towards ease of use / "natural". However, for your processing needs, you probably want technical precision by matching specific byte values.

For example, the Greek question mark is canonically equivalent to semi-colon, but is a different scalar:

";" == "\u{037e}" // true
(";" as Unicode.Scalar) == ("\u{037e}" as Unicode.Scalar) // false

There's a design question of whether such streams should be shared (class) or unique (moveonly struct). Probably the latter, but we may want an intermediary solution while we await move-only structs.