Trial balloon: Ensure that String always contains valid Unicode


(Paul Cantrell) #1

I was quite surprised to learn that it’s possible to create Swift strings that do not contain things other than valid Unicode characters. Is it feasible to guarantee that this cannot happen?

String.init(bytes:encoding:) is failable, and does in fact validate that the given bytes are decodable with the given encoding in most circumstances:

    // Returns nil
    String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF8StringEncoding)

However, that initializer does not reject invalid surrogate characters in UTF-16:

    // Succeeds (wat?!)
    let bogusStr = String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF16BigEndianStringEncoding)!

Ever wonder why dataWithJSONObject(…) is declared “throws?” Now you know!

    // Throws an error
    try! NSJSONSerialization.dataWithJSONObject(
        ["foo": bogusStr], options: [])

And why does the URL escaping method in Foundation return an optional even though it escapes the string using UTF-8, which is a complete Unicode encoding? Same reason:
    // Returns nil
    bogusStr.stringByAddingPercentEncodingWithAllowedCharacters(
        NSCharacterSet.alphanumericCharacterSet())

AFAIK, the first method could lose its “throws” modifier and the second method would not need to return an optional if only String itself guaranteed that it would always contain valid Unicode. There are likely other APIs that would see similar benefits.

Are there downsides to making all String initializers guarantee that the Strings always contain valid Unicode? I can think of two possibilities:

Is there some circumstance where you actually want a String to contain unpaired UTF-16 surrogate characters? I can’t imagine what that would be, but perhaps someone else can.
Is it important to ensure that String.init(…) is O(1) when it uses UTF-16? This seems thin: I assume that the library has to copy the raw bytes regardless, and it’s O(n) for other character encodings, so…?

Cheers,

Paul


(Paul Cantrell) #2

Er, typo in the first sentence! I meant to say:

I was quite surprised to learn that it’s possible to create Swift strings that contain things other than valid Unicode characters.

···

On Dec 18, 2015, at 3:47 PM, Paul Cantrell via swift-evolution <swift-evolution@swift.org> wrote:

I was quite surprised to learn that it’s possible to create Swift strings that do not contain things other than valid Unicode characters. Is it feasible to guarantee that this cannot happen?

String.init(bytes:encoding:) is failable, and does in fact validate that the given bytes are decodable with the given encoding in most circumstances:

    // Returns nil
    String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF8StringEncoding)

However, that initializer does not reject invalid surrogate characters in UTF-16:

    // Succeeds (wat?!)
    let bogusStr = String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF16BigEndianStringEncoding)!

Ever wonder why dataWithJSONObject(…) is declared “throws?” Now you know!

    // Throws an error
    try! NSJSONSerialization.dataWithJSONObject(
        ["foo": bogusStr], options: [])

And why does the URL escaping method in Foundation return an optional even though it escapes the string using UTF-8, which is a complete Unicode encoding? Same reason:
    // Returns nil
    bogusStr.stringByAddingPercentEncodingWithAllowedCharacters(
        NSCharacterSet.alphanumericCharacterSet())

AFAIK, the first method could lose its “throws” modifier and the second method would not need to return an optional if only String itself guaranteed that it would always contain valid Unicode. There are likely other APIs that would see similar benefits.

Are there downsides to making all String initializers guarantee that the Strings always contain valid Unicode? I can think of two possibilities:

Is there some circumstance where you actually want a String to contain unpaired UTF-16 surrogate characters? I can’t imagine what that would be, but perhaps someone else can.
Is it important to ensure that String.init(…) is O(1) when it uses UTF-16? This seems thin: I assume that the library has to copy the raw bytes regardless, and it’s O(n) for other character encodings, so…?

Cheers,

Paul

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Guillaume Lessard) #3

That would be nice. I’m not sure that CFString or NSString guarantee correctness, though. If they don’t, then this could not be.

Guillaume Lessard

···

On 18 déc. 2015, at 14:47, Paul Cantrell via swift-evolution <swift-evolution@swift.org> wrote:

I was quite surprised to learn that it’s possible to create Swift strings that do not contain things other than valid Unicode characters. Is it feasible to guarantee that this cannot happen?


(Dmitri Gribenko) #4

Adding this would be a useful guarantee, I support this. The current
behavior looks inconsistent to me. OTOH, the current behavior of
String(bytes:encoding:) mirrors the behavior of the NSString method, so
this would create inconsistency. But I think the extra guarantee is worth
it.

Tony, what do you think?

Dmitri

···

On Fri, Dec 18, 2015 at 1:47 PM, Paul Cantrell via swift-evolution < swift-evolution@swift.org> wrote:

I was quite surprised to learn that it’s possible to create Swift strings
that do not contain things other than valid Unicode characters. Is it
feasible to guarantee that this cannot happen?

String.init(bytes:encoding:) is failable, and does in fact validate that
the given bytes are decodable with the given encoding in most circumstances:

    // Returns nil
    String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF8StringEncoding)

However, that initializer does *not* reject invalid surrogate characters
in UTF-16:

    // Succeeds (wat?!)
    let bogusStr = String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF16BigEndianStringEncoding)!

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/


(Lily Ballard) #5

I agree in principle that it would be great if String could enforce that
it's always valid.

But unfortunately, in practice, there's no way to do that without making
it expensive to bridge from Obj-C. Because, as you've demonstrated, you
can create NSStrings that contain things that aren't actually valid
unicode sequences, every single bridge from an NSString to a String
would have to be checked for validity. Not only that, but it's not clear
what the behavior would be if an invalid string is found, since these
bridges are unconditional - would Swift panic? Would it silently replace
the invalid sequence with U+FFFD? Or something else entirely? But the
question doesn't really matter, because turning these bridges from O(1)
into O(N) would be an unacceptable performance penalty anyway.

-Kevin Ballard

···

On Fri, Dec 18, 2015, at 01:47 PM, Paul Cantrell via swift-evolution wrote:

I was quite surprised to learn that it’s possible to create Swift
strings that do not contain things other than valid Unicode
characters. Is it feasible to guarantee that this cannot happen?

String.init(bytes:encoding:) is failable, and does in fact validate
that the given bytes are decodable with the given encoding in most
circumstances:

// Returns nil String( bytes: [0xD8, 0x00] as [UInt8],
encoding: NSUTF8StringEncoding)

However, that initializer does *not* reject invalid surrogate
characters in UTF-16:

// Succeeds (wat?!) let bogusStr = String( bytes: [0xD8, 0x00]
as [UInt8], encoding: NSUTF16BigEndianStringEncoding)!

Ever wonder why dataWithJSONObject(…) is declared “throws?” Now
you know!

// Throws an error try! NSJSONSerialization.dataWithJSONObject(
["foo": bogusStr], options: [])

And why does the URL escaping method in Foundation return an optional
even though it escapes the string using UTF-8, which is a complete
Unicode encoding? Same reason:

// Returns nil
bogusStr.stringByAddingPercentEncodingWithAllowedCharacters(
NSCharacterSet.alphanumericCharacterSet())

AFAIK, the first method could lose its “throws” modifier and the
second method would not need to return an optional if only String
itself guaranteed that it would always contain valid Unicode. There
are likely other APIs that would see similar benefits.

Are there downsides to making all String initializers guarantee that
the Strings always contain valid Unicode? I can think of two
possibilities:

* Is there some circumstance where you actually want a String to
   contain unpaired UTF-16 surrogate characters? I can’t imagine what
   that would be, but perhaps someone else can.
* Is it important to ensure that String.init(…) is O(1) when it
   uses UTF-16? This seems thin: I assume that the library has to
   copy the raw bytes regardless, and it’s O(n) for other character
   encodings, so…?

Cheers,

Paul

_________________________________________________
swift-evolution mailing list swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Javier Soto) #6

Right, I totally agree with you, but NSString compatibility would be
problematic then:
let s: NSString
let s2 = s as String // this would have to be a failable cast

···

On Sat, Dec 19, 2015 at 9:57 AM Guillaume Lessard via swift-evolution < swift-evolution@swift.org> wrote:

> On 18 déc. 2015, at 14:47, Paul Cantrell via swift-evolution < > swift-evolution@swift.org> wrote:
>
> I was quite surprised to learn that it’s possible to create Swift
strings that do not contain things other than valid Unicode characters. Is
it feasible to guarantee that this cannot happen?

That would be nice. I’m not sure that CFString or NSString guarantee
correctness, though. If they don’t, then this could not be.

Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Javier Soto


(Tony Parker) #7

I was quite surprised to learn that it’s possible to create Swift strings that do not contain things other than valid Unicode characters. Is it feasible to guarantee that this cannot happen?

String.init(bytes:encoding:) is failable, and does in fact validate that the given bytes are decodable with the given encoding in most circumstances:

    // Returns nil
    String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF8StringEncoding)

However, that initializer does not reject invalid surrogate characters in UTF-16:

    // Succeeds (wat?!)
    let bogusStr = String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF16BigEndianStringEncoding)!

Adding this would be a useful guarantee, I support this. The current behavior looks inconsistent to me. OTOH, the current behavior of String(bytes:encoding:) mirrors the behavior of the NSString method, so this would create inconsistency. But I think the extra guarantee is worth it.

Tony, what do you think?

NSString deals with this issue more on the ‘get’ side. For example, CFStringGetBytes has a ‘lossByte’ for use in replacement when the requested encoding cannot represent something stored by the receiver string. Also, the abstract NSString interface can be extended to add additional encodings (which is why the string encoding values are not an enumeration).

- Tony

···

On Dec 19, 2015, at 7:59 PM, Dmitri Gribenko <gribozavr@gmail.com> wrote:
On Fri, Dec 18, 2015 at 1:47 PM, Paul Cantrell via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/


(Dmitri Gribenko) #8

Currently String replaces invalid sequences with U+FFFD lazily during
access, but there are corner cases related to Objective-C bridging that can
still leak invalid Unicode.

Dmitri

···

On Mon, Jan 4, 2016 at 9:37 PM, Kevin Ballard via swift-evolution < swift-evolution@swift.org> wrote:

I agree in principle that it would be great if String could enforce that
it's always valid.

But unfortunately, in practice, there's no way to do that without making
it expensive to bridge from Obj-C. Because, as you've demonstrated, you can
create NSStrings that contain things that aren't actually valid unicode
sequences, every single bridge from an NSString to a String would have to
be checked for validity. Not only that, but it's not clear what the
behavior would be if an invalid string is found, since these bridges are
unconditional - would Swift panic? Would it silently replace the invalid
sequence with U+FFFD? Or something else entirely? But the question doesn't
really matter, because turning these bridges from O(1) into O(N) would be
an unacceptable performance penalty anyway.

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/


(Kenny Leung) #9

Could we not push NSString to adopt this behavior? I think it should be fair to push for changes in the other direction instead of having Swift be a slave to bridging.

-Kenny

···

On Jan 4, 2016, at 11:37 AM, Kevin Ballard via swift-evolution <swift-evolution@swift.org> wrote:

I agree in principle that it would be great if String could enforce that it's always valid.

But unfortunately, in practice, there's no way to do that without making it expensive to bridge from Obj-C. Because, as you've demonstrated, you can create NSStrings that contain things that aren't actually valid unicode sequences, every single bridge from an NSString to a String would have to be checked for validity. Not only that, but it's not clear what the behavior would be if an invalid string is found, since these bridges are unconditional - would Swift panic? Would it silently replace the invalid sequence with U+FFFD? Or something else entirely? But the question doesn't really matter, because turning these bridges from O(1) into O(N) would be an unacceptable performance penalty anyway.

-Kevin Ballard

On Fri, Dec 18, 2015, at 01:47 PM, Paul Cantrell via swift-evolution wrote:

I was quite surprised to learn that it’s possible to create Swift strings that do not contain things other than valid Unicode characters. Is it feasible to guarantee that this cannot happen?

String.init(bytes:encoding:) is failable, and does in fact validate that the given bytes are decodable with the given encoding in most circumstances:

// Returns nil
String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF8StringEncoding)

However, that initializer does not reject invalid surrogate characters in UTF-16:

// Succeeds (wat?!)
let bogusStr = String(
        bytes: [0xD8, 0x00] as [UInt8],
        encoding: NSUTF16BigEndianStringEncoding)!

Ever wonder why dataWithJSONObject(…) is declared “throws?” Now you know!

// Throws an error
try! NSJSONSerialization.dataWithJSONObject(
        ["foo": bogusStr], options: [])

And why does the URL escaping method in Foundation return an optional even though it escapes the string using UTF-8, which is a complete Unicode encoding? Same reason:

// Returns nil
bogusStr.stringByAddingPercentEncodingWithAllowedCharacters(
NSCharacterSet.alphanumericCharacterSet())

AFAIK, the first method could lose its “throws” modifier and the second method would not need to return an optional if only String itself guaranteed that it would always contain valid Unicode. There are likely other APIs that would see similar benefits.

Are there downsides to making all String initializers guarantee that the Strings always contain valid Unicode? I can think of two possibilities:

  • Is there some circumstance where you actually want a String to contain unpaired UTF-16 surrogate characters? I can’t imagine what that would be, but perhaps someone else can.
  • Is it important to ensure that String.init(…) is O(1) when it uses UTF-16? This seems thin: I assume that the library has to copy the raw bytes regardless, and it’s O(n) for other character encodings, so…?

Cheers,

Paul

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Lily Ballard) #10

That kind of lazy checking of arrays is used pretty rarely (since, as
you say, it only occurs with an `as!` expression). But doing lazy
checking of strings would end up having to check *every* string that
comes from ObjC (which, in a Swift app that uses Cocoa frameworks, is
likely to be most strings the app works with).

-Kevin Ballard

···

On Mon, Jan 4, 2016, at 02:41 PM, Félix Cloutier wrote:

There are precedents for lazily checking for validity after bridging.
Using `array as! [T]` on a NSArray without generics fails lazily if
you access an object that's not a T.

Félix

Le 4 janv. 2016 à 14:59:47, Dmitri Gribenko via swift-evolution <swift- >> evolution@swift.org> a écrit :

On Mon, Jan 4, 2016 at 9:37 PM, Kevin Ballard via swift-evolution <swift- >> evolution@swift.org> wrote:

__
I agree in principle that it would be great if String could enforce
that it's always valid.

But unfortunately, in practice, there's no way to do that without
making it expensive to bridge from Obj-C. Because, as you've
demonstrated, you can create NSStrings that contain things that
aren't actually valid unicode sequences, every single bridge from an
NSString to a String would have to be checked for validity. Not only
that, but it's not clear what the behavior would be if an invalid
string is found, since these bridges are unconditional - would Swift
panic? Would it silently replace the invalid sequence with U+FFFD?
Or something else entirely? But the question doesn't really matter,
because turning these bridges from O(1) into O(N) would be an
unacceptable performance penalty anyway.

Currently String replaces invalid sequences with U+FFFD lazily during
access, but there are corner cases related to Objective-C bridging
that can still leak invalid Unicode.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

_______________________________________________

swift-evolution mailing list swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Lily Ballard) #11

Every single method you implement that takes a `String` property and is
either exposed to Obj-C or is overriding an Obj-C declaration will have
to check the String parameter every single time the function is called.

Every time you call an Obj-C method that returns a String, you'll have
to check that String result.

Basically, any time a String object is backed by an NSString, which is
going to be very common in most apps, that backing NSString will have to
be checked.

-Kevin Ballard

···

On Mon, Jan 4, 2016, at 03:08 PM, Paul Cantrell wrote:

But doing lazy checking of strings would end up having to check
*every* string that comes from ObjC

I don’t think that’s necessarily true. There’s a limited set of places
where invalid Unicode can creep into an NSString, and so the lazy
check could probably bypass quite a few common cases — an ASCII string
for example. Without digging into it, I suspect any NSString created
from UTF-8 data can be safely bridged, since unpaired surrogate chars
can’t make it through UTF-8.


(Lily Ballard) #12

But doing lazy checking of strings would end up having to check
*every* string that comes from ObjC

I don’t think that’s necessarily true. There’s a limited set of
places where invalid Unicode can creep into an NSString, and so the
lazy check could probably bypass quite a few common cases — an ASCII
string for example. Without digging into it, I suspect any NSString
created from UTF-8 data can be safely bridged, since unpaired
surrogate chars can’t make it through UTF-8.

Every single method you implement that takes a `String` property and
is either exposed to Obj-C or is overriding an Obj-C declaration will
have to check the String parameter every single time the function is
called.

Every time you call an Obj-C method that returns a String, you'll
have to check that String result.

Not necessarily. While it’s true that an NSString is represented as
UTF-16 internally (right?), there’s a limited set of operations that
can introduce invalid Unicode. In theory, at least, an NSString could
keep a flag that tracks whether it could potentially contain be
invalid.

This is much better than the doomsday scenario you lay out in two
respects:

(1) That flag would start out false in many common situations
    (including NSStrings decoded from UTF-8, Latin-1, and ASCII), and
    could stay false with O(1) effort for substring operations. My
    guess is that this covers the vast majority of strings floating
    around in a typical app.

(2) Once a string is verified, the flag can be flipped true. No need
    to keep revalidating. Yes, there are threading concerns with that,
    but I trust the team that made the dark magic of Swift’s weak work
    may have some bright ideas on this.

The bottom line is that not every NSString → String bridge need to be
O(n). At least in theory. Someone with more intimate knowledge of
NSString can correct me if I’m wrong.

I thought it was a given that we can't modify NSString. If we can modify
it, all bets are off; heck, if we can modify it, why not just make
NSString reject invalid sequences to begin with?

Besides the fact that NSString is provided by the OS instead of the
Swift stdlib, relying on a modification to NSString also means that the
logic will only work on a new version of the OS that contains the
modified NSString.

Basically, any time a String object is backed by an NSString, which
is going to be very common in most apps, that backing NSString will
have to be checked.

Keep in mind that we’re *already* incurring that O(n) expense right
now for every Swift operation that turns an NSString-backed string
into characters — that plus the API burden of having that check
deferred, which is what originally motivated this thread.

That's true for native Strings as well. The native String storage is
actually a sequence of UTF-16 code units, it's not a sequence of
characters. Any time you iterate over the CharacterView, it has to
calculate the grapheme cluster boundaries. But that's ok, because unless
you call `count` on it, you're typically doing an O(N) operation
_anyway_. But there's plenty of things you can do with strings that
don't require iterating over the CharacterView.

-Kevin Ballard

···

On Mon, Jan 4, 2016, at 03:22 PM, Paul Cantrell wrote:

On Jan 4, 2016, at 5:11 PM, Kevin Ballard <kevin@sb.org> wrote:
On Mon, Jan 4, 2016, at 03:08 PM, Paul Cantrell wrote:


(Paul Cantrell) #13

But doing lazy checking of strings would end up having to check every string that comes from ObjC

I don’t think that’s necessarily true. There’s a limited set of places where invalid Unicode can creep into an NSString, and so the lazy check could probably bypass quite a few common cases — an ASCII string for example. Without digging into it, I suspect any NSString created from UTF-8 data can be safely bridged, since unpaired surrogate chars can’t make it through UTF-8.

Cheers, P

···

On Jan 4, 2016, at 4:43 PM, Kevin Ballard via swift-evolution <swift-evolution@swift.org> wrote:

That kind of lazy checking of arrays is used pretty rarely (since, as you say, it only occurs with an `as!` expression). But doing lazy checking of strings would end up having to check every string that comes from ObjC (which, in a Swift app that uses Cocoa frameworks, is likely to be most strings the app works with).

-Kevin Ballard

On Mon, Jan 4, 2016, at 02:41 PM, Félix Cloutier wrote:

There are precedents for lazily checking for validity after bridging. Using `array as! [T]` on a NSArray without generics fails lazily if you access an object that's not a T.

Félix

Le 4 janv. 2016 à 14:59:47, Dmitri Gribenko via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> a écrit :

On Mon, Jan 4, 2016 at 9:37 PM, Kevin Ballard via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I agree in principle that it would be great if String could enforce that it's always valid.

But unfortunately, in practice, there's no way to do that without making it expensive to bridge from Obj-C. Because, as you've demonstrated, you can create NSStrings that contain things that aren't actually valid unicode sequences, every single bridge from an NSString to a String would have to be checked for validity. Not only that, but it's not clear what the behavior would be if an invalid string is found, since these bridges are unconditional - would Swift panic? Would it silently replace the invalid sequence with U+FFFD? Or something else entirely? But the question doesn't really matter, because turning these bridges from O(1) into O(N) would be an unacceptable performance penalty anyway.

Currently String replaces invalid sequences with U+FFFD lazily during access, but there are corner cases related to Objective-C bridging that can still leak invalid Unicode.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Paul Cantrell) #14

But doing lazy checking of strings would end up having to check every string that comes from ObjC

I don’t think that’s necessarily true. There’s a limited set of places where invalid Unicode can creep into an NSString, and so the lazy check could probably bypass quite a few common cases — an ASCII string for example. Without digging into it, I suspect any NSString created from UTF-8 data can be safely bridged, since unpaired surrogate chars can’t make it through UTF-8.

Every single method you implement that takes a `String` property and is either exposed to Obj-C or is overriding an Obj-C declaration will have to check the String parameter every single time the function is called.

Every time you call an Obj-C method that returns a String, you'll have to check that String result.

Not necessarily. While it’s true that an NSString is represented as UTF-16 internally (right?), there’s a limited set of operations that can introduce invalid Unicode. In theory, at least, an NSString could keep a flag that tracks whether it could potentially contain be invalid.

This is much better than the doomsday scenario you lay out in two respects:

(1) That flag would start out false in many common situations (including NSStrings decoded from UTF-8, Latin-1, and ASCII), and could stay false with O(1) effort for substring operations. My guess is that this covers the vast majority of strings floating around in a typical app.

(2) Once a string is verified, the flag can be flipped true. No need to keep revalidating. Yes, there are threading concerns with that, but I trust the team that made the dark magic of Swift’s weak work may have some bright ideas on this.

The bottom line is that not every NSString → String bridge need to be O(n). At least in theory. Someone with more intimate knowledge of NSString can correct me if I’m wrong.

Basically, any time a String object is backed by an NSString, which is going to be very common in most apps, that backing NSString will have to be checked.

Keep in mind that we’re already incurring that O(n) expense right now for every Swift operation that turns an NSString-backed string into characters — that plus the API burden of having that check deferred, which is what originally motivated this thread.

Cheers,

Paul

···

On Jan 4, 2016, at 5:11 PM, Kevin Ballard <kevin@sb.org> wrote:
On Mon, Jan 4, 2016, at 03:08 PM, Paul Cantrell wrote:


(Félix Cloutier) #15

There are precedents for lazily checking for validity after bridging. Using `array as! [T]` on a NSArray without generics fails lazily if you access an object that's not a T.

Félix

···

Le 4 janv. 2016 à 14:59:47, Dmitri Gribenko via swift-evolution <swift-evolution@swift.org> a écrit :

On Mon, Jan 4, 2016 at 9:37 PM, Kevin Ballard via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
I agree in principle that it would be great if String could enforce that it's always valid.

But unfortunately, in practice, there's no way to do that without making it expensive to bridge from Obj-C. Because, as you've demonstrated, you can create NSStrings that contain things that aren't actually valid unicode sequences, every single bridge from an NSString to a String would have to be checked for validity. Not only that, but it's not clear what the behavior would be if an invalid string is found, since these bridges are unconditional - would Swift panic? Would it silently replace the invalid sequence with U+FFFD? Or something else entirely? But the question doesn't really matter, because turning these bridges from O(1) into O(N) would be an unacceptable performance penalty anyway.

Currently String replaces invalid sequences with U+FFFD lazily during access, but there are corner cases related to Objective-C bridging that can still leak invalid Unicode.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Paul Cantrell) #16

The bottom line is that not every NSString → String bridge need to be O(n). At least in theory. Someone with more intimate knowledge of NSString can correct me if I’m wrong.

I thought it was a given that we can't modify NSString. If we can modify it, all bets are off; heck, if we can modify it, why not just make NSString reject invalid sequences to begin with?

Good question. And if we can’t modify NSString, then yes, we’re up against a tough problem.

But should NSString legacy constraints really compromise the design of Swift’s native String type?

Félix and Dmitri’s comments suggest that there are ways to prevent that, and that there’s precedent for placing any distasteful behavior necessary for compatibility in the bridging, not in the core type.

Keep in mind that we’re already incurring that O(n) expense right now for every Swift operation that turns an NSString-backed string into characters — that plus the API burden of having that check deferred, which is what originally motivated this thread.

That's true for native Strings as well. The native String storage is actually a sequence of UTF-16 code units, it's not a sequence of characters. Any time you iterate over the CharacterView, it has to calculate the grapheme cluster boundaries.

Aren’t Swift strings encoded as UTF-8, —or at least designed to behave as if they are, however they might be stored under the hood?

https://github.com/apple/swift/blob/master/docs/StringDesign.rst#strings-are-encoded-as-utf-8
https://github.com/apple/swift/blob/master/docs/StringDesign.rst#how-would-you-design-it

Given the warning at the top about this having been a planning document, I see that this may no longer be true. But at least the original design rationale strongly suggests that String’s failable initializers should fail when given invalid Unicode.

But that's ok, because unless you call `count` on it, you're typically doing an O(N) operation _anyway_. But there's plenty of things you can do with strings that don't require iterating over the CharacterView.

Indeed, but per my earlier message, those things could all still be O(1) except in the case when you’re transcoding a string from something other than ASCII or UTF-8 — and those transcoding cases are O(n) already. That certainly seems like a better design for the core lib.

Really hoping a core team member can weigh in on this….

Cheers,

Paul

···

On Jan 4, 2016, at 5:39 PM, Kevin Ballard <kevin@sb.org> wrote:
On Mon, Jan 4, 2016, at 03:22 PM, Paul Cantrell wrote: