Implementing String(contentOfUrl:usedEncoding)

Hello,

I am trying to contribute to open source software for the first time and
saw that in NSString.swift, a convenience initializer was not implemented
so I decided to work on it.

The initializer is:
public convenience init(contentsOf url:URL, usedEncoding enc:
UnsafeMutablePointer<UInt>?) throws

I don't understand why we need the usedEncoding parameter? I understand
that it's a pointer but how do we decide what encoding to use? Do we
default to NSUTF8StringEncoding?

According to documentation:
Upon return, if url is read successfully, contains the encoding used to
interpret the data. For possible values, see NSStringEncoding.

My understanding is that the url would not be readable without knowing the
encoding to be applied?

What do you guys think?

-Mohit

The original implementation in Foundation uses some heuristics to try to guess the encoding, since there are unfortunately billions of plain text files out there that don’t explicitly state their encoding. It’s not open source, so we can’t know for sure [except for the people who work at Apple], but I’m sure it includes things like:

- Look for a Unicode BOM at the start, in which case it’s probably UTF-16 (or maybe UTF-32? I don’t know the details.)
- If not, see whether all bytes are 0x00-0x7F ⟶ in that case use ASCII
- If not, does it contain any byte sequences that are illegal in UTF-8? ⟶ If not, use UTF-8
- Otherwise, does it contain any bytes in the range 0x80-0xBF?
  ⟶ If not, ISO-8859-1 (aka ISO-Latin-1) is a good guess
  ⟶ If so, CP-1252 (aka WinLatin1) is a good guess; it’s a nonstandard but very common superset of ISO-8859-1 with extra characters in that byte range

There are likely other heuristics too. It used to be important to detect the old MacRoman encoding used in pre-OS X apps, but it’s been long enough that there shouldn’t be many docs like that in the wild anymore. There are multibyte non-Unicode encodings that used to be very common in non-Roman languages, like Shift-JIS, but I have no idea how to detect them or if they’re even still relevant.

It could also be useful to check whether the start of the file looks like XML or HTML, and if so, parse it enough to find where it specifies its encoding. (Are there other text formats that include encodings? I’ve seen special markings at the top of source files used for emacs or vi, specifying tab widths and such, but I don’t know if those can specify encodings too.)

I’m not involved in Swift development, but IMHO a basic implementation that just uses the rules I sketched above would be pretty useful, and then people with more domain knowledge could enhance that code to add more heuristics later on.

—Jens

···

On Feb 22, 2017, at 6:05 PM, Mohit Athwani via swift-users <swift-users@swift.org> wrote:

I don't understand why we need the usedEncoding parameter? I understand that it's a pointer but how do we decide what encoding to use? Do we default to NSUTF8StringEncoding?

Hey Jens,

Thanks so much! This is really useful! I'm going to get started on this.

Cheers!
Mohit

···

On Wed, Feb 22, 2017 at 9:09 PM, Jens Alfke <jens@mooseyard.com> wrote:

On Feb 22, 2017, at 6:05 PM, Mohit Athwani via swift-users < > swift-users@swift.org> wrote:

I don't understand why we need the usedEncoding parameter? I understand
that it's a pointer but how do we decide what encoding to use? Do we
default to NSUTF8StringEncoding?

The original implementation in Foundation uses some heuristics to try to
guess the encoding, since there are unfortunately billions of plain text
files out there that don’t explicitly state their encoding. It’s not open
source, so we can’t know for sure [except for the people who work at
Apple], but I’m sure it includes things like:

- Look for a Unicode BOM at the start, in which case it’s probably UTF-16
(or maybe UTF-32? I don’t know the details.)
- If not, see whether all bytes are 0x00-0x7F ⟶ in that case use ASCII
- If not, does it contain any byte sequences that are illegal in UTF-8? ⟶
If not, use UTF-8
- Otherwise, does it contain any bytes in the range 0x80-0xBF?
⟶ If not, ISO-8859-1 (aka ISO-Latin-1) is a good guess
⟶ If so, CP-1252 (aka WinLatin1) is a good guess; it’s a nonstandard but
very common superset of ISO-8859-1 with extra characters in that byte range

There are likely other heuristics too. It used to be important to detect
the old MacRoman encoding used in pre-OS X apps, but it’s been long enough
that there shouldn’t be many docs like that in the wild anymore. There are
multibyte non-Unicode encodings that used to be very common in non-Roman
languages, like Shift-JIS, but I have no idea how to detect them or if
they’re even still relevant.

It could also be useful to check whether the start of the file looks like
XML or HTML, and if so, parse it enough to find where it specifies its
encoding. (Are there other text formats that include encodings? I’ve seen
special markings at the top of source files used for emacs or vi,
specifying tab widths and such, but I don’t know if those can specify
encodings too.)

I’m not involved in Swift development, but IMHO a basic implementation
that just uses the rules I sketched above would be pretty useful, and then
people with more domain knowledge could enhance that code to add more
heuristics later on.

—Jens

Oops, that should be “0x80–0x9F”.

—Jens

···

On Feb 22, 2017, at 9:09 PM, Jens Alfke via swift-users <swift-users@swift.org> wrote:

- Otherwise, does it contain any bytes in the range 0x80-0xBF?