I want text streaming functionality in swift standard library. Text streamng in here is, for example, without loading whole contents of CSV file in memory, parsing data record one each during reading file little by little.
Actually I often write such program for my job. I use it to playback like time series coordinate data constructed from video image processing based on its time value. Data size for each time point is large and recording time is long, so I don't read all in memory, process little by little.
When processing like this, the following functional units are required. To simplify this, I consider only UTF-8.
(1) Reading file by byte units little by little.
(2) Decoding unicode codepoint from byte stream as UTF-8 little by little. From this, byte stream becomes into unicode codepoint stream.
(3) Decoding graphem clusters from unicode codepoint stream little by little. This is
Character type in Swift.
And, depending on format of text, peeking and seeking back is needed to parse it. So stream position information and seeking function is required in each above steps.
Implementing them is hard with current standard library. I explain one by one.
(1) About this, closer ones are in Foundation.
NSInputStream class has wasteful complicated functional. Its not useful and hard to understand.
Without seeking function, it is useless for processing text requiring peeking to parse.
Functions working together
RunLoop is integrated. But I want to manage such driving control by myself by
NSFileHandle uses Objective-C exception for error handling. It can not control from Swift so that it is useless at all.
(2) About this, closer one is in standard library.
But this does not returns positional information. If I don't know the correct position, I can't know position in byte unit. So I can not implement seeking with unicode codepoint unit in latter stage processing.
(3) About this, It can be implemented by
String with much effort. Repeating
.character property returns more than 2 characters, split first one character. When end of stream is coming, reading out all remaining
.characters. Like this, it can be realized, But I think implementation is complex, and getting this idea is not easy. And it run after a fashion, I think its inefficient for performance relating on internal processing in
Though in such situation, I want to implement correnct text processing, So I made requirements.
About (1), I wrapped C
About (2), I made UTF-8 decoder.
About (3), I implemented process above.
Because of I made by myself, its ok in job for the moment. But I think that suck like text stream processing is common. It is good that it in standard library.
What do you think?
Japanese translation (actually original) of this post is here