String(cString:, encoding:) with non 1-byte encodings should trigger a warning somewhere

Hi, i've just been bitten by a bug in my code trying to understand :

var bs: [UInt8] = [0, 84, 0, 104]
String(bytes: bs, encoding: .utf16BigEndian)
// => return Th
var cs: [Int8] = [0, 84, 0, 104]
String(cString: &cs, encoding: .utf16BigEndian)
// => return empty string

I think there should be a warning (assert or anything else) for using the cString initializer for encodings where "0" doesn't mean "end of string". Such as utf16 and more.

1 Like

Background

This comes from NSString via the Foundation overlay. The standard library has alternatives with generic constraints which ensure the byte-width is appropriate for encoding here.

There is a lot of overlap and redundancy between the initializers added in the Foundation overlay and those directly from the standard library. We should really clean this up for Swift 5 (related thread). CC @lancep, @itaiferber.

@benjamin.g, could you file a bug at bugs.swift.org to help track this?

Potential solutions / alternatives

For the following String init:

init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding)

Using UnsafePointer<CChar> as the argument type is misleading. At the very least, if this is meant as a lowest-level interface, should it be UnsafeRawPointer? CC @Andrew_Trick for thoughts.

For:

init?<S : Sequence where S.Iterator.Element == UInt8>(bytes: S, encoding: String.Encoding)

The intent is to create a String from raw bytes, i.e. data. Should this initializer be considered redundant with an initializer taking Data and/or with init(decoding:as:)? The latter has a generic constraint that the element type is equivalent to the encoding's code unit type.

edit:

CC @johannesweiss. It might make sense to tackle initializer cleanup as part of a move to an early-validation model. We also might want to consider how we want to fit in performance-flags testing as well.

2 Likes

https://bugs.swift.org/browse/SR-8355

1 Like

@Michael_Ilseman thanks for the pointers. Since you mention cleaning the String constructor, i can only fully agree with that. I've tried to use the "String;decode" functions to decode from a buffer, but couldn't get anything to compile ( i tried various types of pointer combined with the mutable / non mutable variant until i thought i was obviously going the wrong direction).

Also, as a remark, there should be String constructors to interop with what the C-based Core* frameworks return. As an example CGPDFDictionaryApplierFunction returns a UnsafePointer:<:Int8:>: for the key.

Edit : i just realized the reason i couldn't get anything to compile with the "decoding" functions wasn't in my pointer type, but in the encoding parameter. I used String.Encoding.utf8 instead of UTF8.self . This should also probably be addressed in some way.
Edit(2): nope, still no luck. Errors like "Cannot convert value of type 'UnsafePointer:<:Int8:>:' to expected argument type 'UnsafePointer<_>?' don't provide much value... I'm just giving up at this point.
Edit(3): i suppose the problem comes from the CGPDFDictionaryApplierFunction using Int8 pointers instead of UInt8 pointers. So, back to my original remark about working together with the Core* frameworks.

I'm leaving this list of edits as a trace for those who wonders what's the train of thoughts for someone discovering this part of the API.

init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding)

It would be nice if that initializer could just go away in favor of String(decoding:as:). If you really need to decode null-terminated, UTF-16, then I would use a Sequence that looks for the "null" bytes rather than call it a "CString". If you must call it a CString, then yes, UnsafeRawPointer would be a better choice for the type.

init?<S : Sequence where S.Iterator.Element == UInt8>(bytes: S, encoding: String.Encoding)

This one also seems problematic. I frankly don't know what it means. It expects to be null terminated but "CString" is not in its name? It both has a length and needs to be null terminated? The encoding does not need to match the element type?

Clearly, we want String(decoding:as:) to be used in as many cases as possible. But you may still want a "lowest level" initializer for decoding UTF-16 from a raw byte buffer. Can't that just be a sized buffer?

String(bytes: unsafeRawBufferPointer, encoding: ...)

It seems like a low change of it being accidentally called instead of (decode:as:).

1 Like