Newlines in Text Streams

benrimmington · January 8, 2019, 6:53pm

The readLine and print APIs currently use LF ("\n") as the newline character. But there are other possible newlines on different platforms, e.g. CRLF ("\r\n") on Windows; NEL on EBCDIC-based platforms such as z/OS.

Abbrev	Alias	Code Point
LF	LINE FEED	U+000A
VT	VERTICAL TABULATION	U+000B
FF	FORM FEED	U+000C
CR	CARRIAGE RETURN	U+000D
NEL	NEXT LINE	U+0085
LS	LINE SEPARATOR	U+2028
PS	PARAGRAPH SEPARATOR	U+2029

I've created apple/swift#21586 to recognize Unicode newlines in readLine. This follows the recommendations in §5.8 Newline Guidelines of the Unicode Standard:

The acronym NLF (newline function) stands for the generic control function for indication of a new line break. It may be represented by different characters, depending on the platform... [CR, LF, CRLF, or NEL].

Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them.

R4 A readline function should stop at NLF, LS, FF, or PS. In the typical implementation, it does not include the NLF, LS, PS, or FF that caused it to stop.

Questions:

Do we want readLine to recognize Unicode newlines? (cf. Character.isNewline). Or should we close SR-1280 as "Won't Do"?
Should the default terminator of the print API change from LF to NLF? Or can we depend on a translating text mode for Windows and other platforms?
Would a TextInputStream protocol be useful? And a readLine with an additional from: parameter?

SDGGiesbrecht · January 8, 2019, 8:00pm

Normally I would say yes, but...

...when there is more than one it could be, it can be necessary to know which one it is. Is this just another line (LS) or the last one of the paragraph (PS)? If the encountered newline character is not easily discoverable, then such a function is not particularly useful in a Unicode setting anyway. Recognizing only one newline would at least allow you to reason about what was stripped from between the lines.

readLine is rarely useful or natural language text anyway. When trying to parse or scan through source code or data (I’m thinking of CSV or YAML), I can imagine the caller would need it to match the behaviour to that of the particular language specification. A readLine that supports LS called on a Swift file would give numbers out of sync with #line, since LS can appear in comments and string literals, but is not a newline as far as Swift is concerned. Maybe the best solution is a new parameter (with a default value) which would allow the caller to specify which one(s) they want to watch for.

benrimmington · January 8, 2019, 8:20pm

You can use readLine(strippingNewline: false) so that any newline is preserved in the result.

SDGGiesbrecht · January 8, 2019, 8:36pm

You’re right. What a dumb thing to forget...

benrimmington · January 9, 2019, 2:49pm

At this point, it would need to be an overload of readLine with an extra required parameter. Or perhaps a different API which can stop on any given set of characters (i.e. not restricted to newlines). For example, see getdelim and getline.

Swift on z/OS has extra APIs for codepage transcoding:

zOSSwift.readLine(strippingNewline:encodingFrom:) (but I don't know if it also converts NEL to LF).
various stdout and stderr wrappers conforming to TextOutputStream (for the to: parameter of the standard print API).

SDGGiesbrecht · January 9, 2019, 7:01pm

Yeah, that’s what I figured. The two strategies would indistinguishable from the call site anyway, so it doesn’t really matter which way. Implementing it can easily wait and be done later (or never if the topic never comes up again).

Regarding the original questions:

Given the presence of strippingNewline, following the Unicode standard is probably the most reasonable default.

The “best” thing to do would be to make the default terminator LS, since that is why Unicode introduced it. It would confuse the whole world though, so the lofty ideal crashes and burns in the backwards‐compatibility department.

Michael_Ilseman · February 5, 2019, 1:32am

Changing readLine and print would change observable behavior, so it would need to go through SE.

There are a few options:

Keep existing behavior

Currently, we stop at a \n. If strippingNewline is true, then we drop any trailing \r\n or \n. This follows pretty standard intuition from using C, stdin and piping, and textual formats (e.g. CSV files). We're sort of doing a text-mode-lite on behalf of the user.

Use NLF, that is the platform's newline

This could regress functionality (and be weird). E.g., a CSV file is piped to stdin such that field\r\n became field\r after stripping.

Follow Unicode's recommendations for how a readLine function should operate

It could seem weird that certain byte patterns in the input, such as E2 80 A9, would count as a terminator in contrast to common intuition surrounding piping to stdin.

Follow Character's semantics

This would include full-grapheme-breaking including degenerate graphemes. I feel this would be an absurd direction to take and violate most user's intuition.

I'm weakly in favor of #1 for readLine and print. I think we should provide something to address more needs, such as a TextInputStream protocol, and that could include options for specifying a delimiter or even a (Character)->Bool closure. At that point, it might make sense to rename readLine to something that isn't spelled exactly like a function that Unicode has opinions about.

A change in semantics would cause more breaks from the input. It would be pretty annoying if developers now have to always check their newlines against a desired set.

SDGGiesbrecht · February 5, 2019, 1:57am

Given the following quotation from its documentation comment, I imagine a lot of other surprises would be encountered first.

/// Standard input is interpreted as `UTF-8`. Invalid bytes are replaced by
/// Unicode [replacement characters][rc].

benrimmington · February 5, 2019, 3:38am

I think it's fine to keep the existing behavior. I'll close the pull request, but can SR-1280 also be closed?
If strippingNewline is true, "field\r\n" would become "field" on all platforms. The recommendations in §5.8 Newline Guidelines are:

Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them.
I don't know how useful or harmful it might be to recognize E2 80 A9 (U+2029: PARAGRAPH SEPARATOR) as a newline.

The existing readLine(strippingNewline:) has to return nil, regardless of the end-of-file or error indicators (feof versus ferror).
It's unclear to me if getline(_:_:_:) can fail with EINTR (interrupted system call).

Michael_Ilseman · February 5, 2019, 7:05pm

SDGGiesbrecht:

Given the following quotation from its documentation comment , I imagine a lot of other surprises would be encountered first.
/// Standard input is interpreted as `UTF-8`. Invalid bytes are replaced by
/// Unicode [replacement characters][rc].

Hah, good point!

Probably doesn't matter for PS. But, if a user has VT in their input, they might expect that to be preserved. I don't think we have to worry too much about these corner cases, just trying to figure out at what level we want our semantics to operate.

Good point

As I said, it was a weakly held opinion and you both brought up good points :-). I was thinking of readLine as being a dual of print, doing a light "text-mode". I could see us changing it to be a basic Unicode rich version. I wouldn't want to throw Character in to the mix right now, grapheme breaking is weird and always changing (e.g. there's been drafts proposing all contiguous whitespace be treated as a single grapheme).

If you want to push for this small behavior change, it should be discussed in some capacity on SE. It might classify as a minor adjustment or behavior-changing bug fix, so maybe not a full proposal. Closing is also fine.

WDYT?

SDGGiesbrecht · February 5, 2019, 7:33pm

I have no strong opinions either. I was only considering the implications in the first place because @benrimmington asked for feedback. It is too low on my priority list for me to carve out the time to guide it through SE.

Maybe point Han Sangjin to this thread? (He’s the one who opened SR‐1280.) Ask him how important he thinks it is and if he wants to take it to SE. If he does not, close SR‐1280 saying some discussion has already taken place, link here, and say that it will require going through Swift Evolution and has been deferred for now. That way if it comes up again, it will be easy for someone to find this and pick it up where we left off.

benrimmington · February 5, 2019, 10:55pm

@SDGGiesbrecht Thanks for your feedback. I've closed apple/swift#21586, but SR-1280 and the original issue (rdar://problem/20013999) can remain open.

benrimmington · February 5, 2019, 11:22pm

@Michael_Ilseman Recognizing only LF and CRLF newlines is no worse than the UTF-8 requirement. Text files from other platforms (e.g. classic Mac OS) will need to be converted in any case. Swift on z/OS already has its own readLine and print APIs for dealing with EBCDIC codepages (and presumably NEL newlines).