I want text streaming functionality in swift standard library. Text streamng in here is, for example, without loading whole contents of CSV file in memory, parsing data record one each during reading file little by little.
Actually I often write such program for my job. I use it to playback like time series coordinate data constructed from video image processing based on its time value. Data size for each time point is large and recording time is long, so I don't read all in memory, process little by little.
When processing like this, the following functional units are required. To simplify this, I consider only UTF-8.
(1) Reading file by byte units little by little.
(2) Decoding unicode codepoint from byte stream as UTF-8 little by little. From this, byte stream becomes into unicode codepoint stream.
(3) Decoding graphem clusters from unicode codepoint stream little by little. This is Character
type in Swift.
And, depending on format of text, peeking and seeking back is needed to parse it. So stream position information and seeking function is required in each above steps.
Implementing them is hard with current standard library. I explain one by one.
(1) About this, closer ones are in Foundation. NSInputStream
and NSFileHandle
are.
InputStream
(Swift)/NSInputStream
class has wasteful complicated functional. Its not useful and hard to understand.
https://developer.apple.com/documentation/foundation/inputstream
Without seeking function, it is useless for processing text requiring peeking to parse.
Functions working together RunLoop
is integrated. But I want to manage such driving control by myself by DispatchQueue
.
FileHandle
(Swift)/NSFileHandle
uses Objective-C exception for error handling. It can not control from Swift so that it is useless at all.
https://developer.apple.com/documentation/foundation/filehandle/1413916-readdata
(2) About this, closer one is in standard library. Unicode.UTF8.decode
is.
https://developer.apple.com/documentation/swift/unicode/utf8/2907346-decode
But this does not returns positional information. If I don't know the correct position, I can't know position in byte unit. So I can not implement seeking with unicode codepoint unit in latter stage processing.
(3) About this, It can be implemented by String
with much effort. Repeating append
UnicodeScalar
to String
, if .character
property returns more than 2 characters, split first one character. When end of stream is coming, reading out all remaining .characters
. Like this, it can be realized, But I think implementation is complex, and getting this idea is not easy. And it run after a fashion, I think its inefficient for performance relating on internal processing in String
.
Though in such situation, I want to implement correnct text processing, So I made requirements.
About (1), I wrapped C fopen
family.
About (2), I made UTF-8 decoder.
About (3), I implemented process above.
Because of I made by myself, its ok in job for the moment. But I think that suck like text stream processing is common. It is good that it in standard library.
What do you think?
Japanese translation (actually original) of this post is here