Newlines in Text Streams


(Ben Rimmington) #1

The readLine and print APIs currently use LF ("\n") as the newline character. But there are other possible newlines on different platforms, e.g. CRLF ("\r\n") on Windows; NEL on EBCDIC-based platforms such as z/OS.

Abbrev Alias Code Point
LF LINE FEED U+000A
VT VERTICAL TABULATION U+000B
FF FORM FEED U+000C
CR CARRIAGE RETURN U+000D
NEL NEXT LINE U+0085
LS LINE SEPARATOR U+2028
PS PARAGRAPH SEPARATOR U+2029

I've created apple/swift#21586 to recognize Unicode newlines in readLine. This follows the recommendations in §5.8 Newline Guidelines of the Unicode Standard:

The acronym NLF (newline function) stands for the generic control function for indication of a new line break. It may be represented by different characters, depending on the platform... [CR, LF, CRLF, or NEL].

Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them.

R4 A readline function should stop at NLF, LS, FF, or PS. In the typical implementation, it does not include the NLF, LS, PS, or FF that caused it to stop.

Questions:

  • Do we want readLine to recognize Unicode newlines? (cf. Character.isNewline). Or should we close SR-1280 as "Won't Do"?

  • Should the default terminator of the print API change from LF to NLF? Or can we depend on a translating text mode for Windows and other platforms?

  • Would a TextInputStream protocol be useful? And a readLine with an additional from: parameter?


(Jeremy David Giesbrecht) #2

Normally I would say yes, but...

...when there is more than one it could be, it can be necessary to know which one it is. Is this just another line (LS) or the last one of the paragraph (PS)? If the encountered newline character is not easily discoverable, then such a function is not particularly useful in a Unicode setting anyway. Recognizing only one newline would at least allow you to reason about what was stripped from between the lines.

readLine is rarely useful or natural language text anyway. When trying to parse or scan through source code or data (I’m thinking of CSV or YAML), I can imagine the caller would need it to match the behaviour to that of the particular language specification. A readLine that supports LS called on a Swift file would give numbers out of sync with #line, since LS can appear in comments and string literals, but is not a newline as far as Swift is concerned. Maybe the best solution is a new parameter (with a default value) which would allow the caller to specify which one(s) they want to watch for.


(Ben Rimmington) #3

You can use readLine(strippingNewline: false) so that any newline is preserved in the result.


(Jeremy David Giesbrecht) #4

You’re right. What a dumb thing to forget...


(Ben Rimmington) #5

At this point, it would need to be an overload of readLine with an extra required parameter. Or perhaps a different API which can stop on any given set of characters (i.e. not restricted to newlines). For example, see getdelim and getline.

Swift on z/OS has extra APIs for codepage transcoding:

  • zOSSwift.readLine(strippingNewline:encodingFrom:) (but I don't know if it also converts NEL to LF).

  • various stdout and stderr wrappers conforming to TextOutputStream (for the to: parameter of the standard print API).


(Jeremy David Giesbrecht) #6

Yeah, that’s what I figured. The two strategies would indistinguishable from the call site anyway, so it doesn’t really matter which way. Implementing it can easily wait and be done later (or never if the topic never comes up again).

Regarding the original questions:

Given the presence of strippingNewline, following the Unicode standard is probably the most reasonable default.

The “best” thing to do would be to make the default terminator LS, since that is why Unicode introduced it. It would confuse the whole world though, so the lofty ideal crashes and burns in the backwards‐compatibility department.