Int indexing into UTF16View


(David Hart) #1

Hello,

When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access. I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?

David.


(Tony Allevato) #2

It is an extremely rare case for a developer to know a priori what literal
numeric indices should be used when indexing into a string, because it only
applies when strings fall into a very specific format and encoding.

It's been discussed before during String-related proposals but AFAIK the
core team has come out against it—it would be an invitation for users who
don't understand the distinction to do very unsafe and wrong things with
strings. IMO, writing your own extension or using index.offset(by:) isn't a
huge penalty here.

···

On Thu, Jun 8, 2017 at 10:32 AM David Hart via swift-evolution < swift-evolution@swift.org> wrote:

Hello,

When working with Strings which are known to be ASCII, I tend to use the
UTF16View for the performance of random access. I would also like to have
the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531
10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this
should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Vladimir) #3

Hello,

When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access.

About the performance. Do we have a guarantee that 'barcode' declared in code and/or containing only ASCII chars internally stored as UTF16 ? Otherwise, as I understand, you'll have a performance penalty calling utf16 when internal storage is in UTF8 for example, no?

I would also like to have the convenience of indexing with Int:

···

On 08.06.2017 20:32, David Hart via swift-evolution wrote:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Dave Abrahams) #4

The standard library has been headed in the direction of supporting
underlying String encodings that are not UTF-16 (or UTF-16 subsets, like
ASCII and Latin-1), e.g. UTF-8, which would make it impossible to
support such an API performantly. So, such an addition would require a
change in our long-term strategy for String, which was laid out in
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

That's not to say it's impossible, but it would be a major course change.

···

on Thu Jun 08 2017, David Hart <swift-evolution@swift.org> wrote:

Hello,

When working with Strings which are known to be ASCII, I tend to use
the UTF16View for the performance of random access. I would also like
to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard
Library.
Any thoughts?

--
-Dave


(Ben Cohen) #5

Hi David,

My view is positional shredding of strings is enough of a use case that we ought to think about handling it better. I don’t think having to convert the string into an Array of Character, or bytes, or use Data, should be necessary, since this implies losing the stringiness of the slices you are creating, which is an inconvenience for many use cases. Strings-as-data is a thing we should support, and support well.

But I also don’t think giving String or its views integer indices is the right way to go either, incorrectly implying as it does random access

(or, even if we did end up making utf16 permanently random-access, encouraging people towards using utf16 to support this use case when often they’d be better served sticking with characters).

There’s a few things to note about the example you give:
1) Do you really want utf16 view slices for your variable types? Maybe in this case you do, but I would guess a lot of the time what is desired would be a (sub)string.
2) Do you really want to hard-code integer literals into your code? Maybe for a quick shell script use case, but for anything more this seems like an anti-pattern that integer indices encourage.
3) Are you likely to actually want validation at each step – that the string was long enough, that the string data was valid at that point?
4) This case doesn’t seem to need random access particularly, so much as the (questionable? see above) convenience of integer indexing. Although reordered in the example, it seems like the code could be written to progressively traverse the string from left to right to get the fields. Unless the plan is to repeatedly access some fields over and over. But I’m not sure we’d want to put affordances into the std lib to encourage accessing stringly typed data...

So the question is, what affordances should we put into the standard library, or maybe other libraries, to help with these use cases? This is a big design space to explore, and at least some ideas ought to feed into our ongoing improvements to String for future releases.

For example:

Now that we have Substring, it opens up the ability to efficiently consume from the front (since it’s just adjusting the substring range):

extension Collection where Self == SubSequence {
    // or some better name...
    mutating func removeFirst(_ n: IndexDistance) -> SubSequence? {
        guard let i = index(startIndex, offsetBy: n, limitedBy: endIndex)
            else { return nil }
        
        defer { self = self[i...] }
        return self[..<i]
    }
}

Once you have this, you could use it to write the example code, along with some error checking (or you could use ! if you were completely certain of the integrity of your data)

var s = barcode[...] // make a substring for efficient consumption
_ = s.consume(2) // drop initial prefix
guard let name = s.removeFirst(20)?.trim else { fatalError("Failed to read name") }
guard let pnrCode = s.removeFirst(6).map(PNRCode.init) else { fatalError("Failed to read pnrCode") }

Or you could build on the above to create a function that could shred a string into a dictionary of fields… (probably a bit domain-specific for the std lib at this point)

extension Collection {
    // could choose to handle or fail on gaps, out-of-order ranges, overlapping ranges etc
    func fields<P: Collection>(at positions: P) -> [String:SubSequence]? // or throw an error with a field name
    where P.Element == (key: String, value: CountableRange<IndexDistance>)
}

let barcodeSchema: DictionaryLiteral = [
    "name": 2..<22,
    "pnrCode": 23..<30,
    "fromCity": 30..<33,
    "toCity": 33..<36,
    "carrier": 36..<39,
    "flightNumber": 39..<44,
    "day": 45..<47,
    "seatNo": 47..<51,
]

let fields = barcode.fields(at: barcodeSchema)!

···

On Jun 8, 2017, at 10:32 AM, David Hart via swift-evolution <swift-evolution@swift.org> wrote:

Hello,

When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access. I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(David Hart) #6

It is an extremely rare case for a developer to know a priori what literal numeric indices should be used when indexing into a string, because it only applies when strings fall into a very specific format and encoding.

It's been discussed before during String-related proposals but AFAIK the core team has come out against it—it would be an invitation for users who don't understand the distinction to do very unsafe and wrong things with strings. IMO, writing your own extension or using index.offset(by:) isn't a huge penalty here.

Is it really an invitation when it’s hidden inside the UTF16View?

···

On 8 Jun 2017, at 12:35, Tony Allevato <tony.allevato@gmail.com> wrote:

On Thu, Jun 8, 2017 at 10:32 AM David Hart via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Hello,

When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access. I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution


(David Hart) #7

Hello,
When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access.

About the performance. Do we have a guarantee that 'barcode' declared in code and/or containing only ASCII chars internally stored as UTF16 ? Otherwise, as I understand, you'll have a performance penalty calling utf16 when internal storage is in UTF8 for example, no?

I'm fairly sure the internal storage of String is always UTF16.

···

On 8 Jun 2017, at 14:15, Vladimir.S <svabox@gmail.com> wrote:

On 08.06.2017 20:32, David Hart via swift-evolution wrote:

I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]
I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?
David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #8

At some point–although it’s unclear if it’s still the case after the recent
String revisions–it was the case that Foundation extended String to allow
integer subscripting of UTF16View.

The issue with offering it generally is that it doesn’t reliably serve any
use case. You’re using it for performance, which works currently because
the underlying storage is UTF16-encoded, but this is explicitly _not_
guaranteed to remain the case going forward, and then your performance
would be impacted.

If I recall, the core team has said that the only guarantee of the
performance you’re looking for is to store the string yourself as a
sequence of bytes (using Data, for example). The argument is that, in your
barcode example, you’re not really manipulating it as a string at all, but
are relying on its being a particular sequence of bytes. As it happens,
when this topic has come up previously, it has also been the case that the
use case presented treated the string as a known sequence of bytes.

···

On Thu, Jun 8, 2017 at 16:26 David Hart via swift-evolution < swift-evolution@swift.org> wrote:

On 8 Jun 2017, at 12:35, Tony Allevato <tony.allevato@gmail.com> wrote:

It is an extremely rare case for a developer to know a priori what literal
numeric indices should be used when indexing into a string, because it only
applies when strings fall into a very specific format and encoding.

It's been discussed before during String-related proposals but AFAIK the
core team has come out against it—it would be an invitation for users who
don't understand the distinction to do very unsafe and wrong things with
strings. IMO, writing your own extension or using index.offset(by:) isn't a
huge penalty here.

Is it really an invitation when it’s hidden inside the UTF16View?

On Thu, Jun 8, 2017 at 10:32 AM David Hart via swift-evolution < > swift-evolution@swift.org> wrote:

Hello,

When working with Strings which are known to be ASCII, I tend to use the
UTF16View for the performance of random access. I would also like to have
the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531
10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this
should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Félix Cloutier) #9

It's not. Static strings use UTF-8.

···

Le 8 juin 2017 à 16:38, David Hart via swift-evolution <swift-evolution@swift.org> a écrit :

On 8 Jun 2017, at 14:15, Vladimir.S <svabox@gmail.com> wrote:

On 08.06.2017 20:32, David Hart via swift-evolution wrote:
Hello,
When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access.

About the performance. Do we have a guarantee that 'barcode' declared in code and/or containing only ASCII chars internally stored as UTF16 ? Otherwise, as I understand, you'll have a performance penalty calling utf16 when internal storage is in UTF8 for example, no?

I'm fairly sure the internal storage of String is always UTF16.

I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]
I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?
David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Charles Srstka) #10

Knowing this is required for accessibility support on macOS, since it’s needed to implement NSAccessibility methods such as accessibilityAttributedString(for:), accessibilityRTF(for:), accessibilityFrame(for:), etc.

Charles

···

On Jun 8, 2017, at 2:35 PM, Tony Allevato via swift-evolution <swift-evolution@swift.org> wrote:

It is an extremely rare case for a developer to know a priori what literal numeric indices should be used when indexing into a string, because it only applies when strings fall into a very specific format and encoding.


(David Hart) #11

Hello,

When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access. I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Hi David,

My view is positional shredding of strings is enough of a use case that we ought to think about handling it better. I don’t think having to convert the string into an Array of Character, or bytes, or use Data, should be necessary, since this implies losing the stringiness of the slices you are creating, which is an inconvenience for many use cases. Strings-as-data is a thing we should support, and support well.

But I also don’t think giving String or its views integer indices is the right way to go either, incorrectly implying as it does random access

(or, even if we did end up making utf16 permanently random-access, encouraging people towards using utf16 to support this use case when often they’d be better served sticking with characters).

There’s a few things to note about the example you give:
1) Do you really want utf16 view slices for your variable types? Maybe in this case you do, but I would guess a lot of the time what is desired would be a (sub)string.

Not really. I would be quite happy with Substring. The variable type is just the consequence of my choice of the UTF16 view.

2) Do you really want to hard-code integer literals into your code? Maybe for a quick shell script use case, but for anything more this seems like an anti-pattern that integer indices encourage.

I agree. I took the code out of a project of mine and simplified it. I have the Int ranges defined as named constants.

3) Are you likely to actually want validation at each step – that the string was long enough, that the string data was valid at that point?

No. I pre-validate the string and it's length beforehand.

4) This case doesn’t seem to need random access particularly, so much as the (questionable? see above) convenience of integer indexing. Although reordered in the example, it seems like the code could be written to progressively traverse the string from left to right to get the fields. Unless the plan is to repeatedly access some fields over and over. But I’m not sure we’d want to put affordances into the std lib to encourage accessing stringly typed data...

Indeed. I initially rewrote the code to traverse the string from left to write, advancing indices by deltas. But (1) it made the code less readable, (2) further removed from the barcode spec which is defined in terms of offsets, and (3) more brittle - editing deltas forces chain modifications:

let nameStart = barcode.index(name.startIndex, offsetBy: 2)
let nameEnd = barcode.index(nameStart, offsetBy: 20)
let name = barcode[nameStart..<nameEnd]

let pnrCodeStart = barcode.index(nameEnd, offsetBy: 1)
let pnrCodeEnd = barcode.index(pnrCodeStart, offsetBy: 7)
let pnrCode = barcode[pnrCodeStart..<pnrCodeEnd]

So the question is, what affordances should we put into the standard library, or maybe other libraries, to help with these use cases? This is a big design space to explore, and at least some ideas ought to feed into our ongoing improvements to String for future releases.

For example:

Now that we have Substring, it opens up the ability to efficiently consume from the front (since it’s just adjusting the substring range):

extension Collection where Self == SubSequence {
    // or some better name...
    mutating func removeFirst(_ n: IndexDistance) -> SubSequence? {
        guard let i = index(startIndex, offsetBy: n, limitedBy: endIndex)
            else { return nil }
        
        defer { self = self[i...] }
        return self[..<i]
    }
}

Once you have this, you could use it to write the example code, along with some error checking (or you could use ! if you were completely certain of the integrity of your data)

var s = barcode[...] // make a substring for efficient consumption
_ = s.consume(2) // drop initial prefix
guard let name = s.removeFirst(20)?.trim else { fatalError("Failed to read name") }
guard let pnrCode = s.removeFirst(6).map(PNRCode.init) else { fatalError("Failed to read pnrCode") }

Definitely interesting! But it has some of the same issues as my code above.

Or you could build on the above to create a function that could shred a string into a dictionary of fields… (probably a bit domain-specific for the std lib at this point)

extension Collection {
    // could choose to handle or fail on gaps, out-of-order ranges, overlapping ranges etc
    func fields<P: Collection>(at positions: P) -> [String:SubSequence]? // or throw an error with a field name
    where P.Element == (key: String, value: CountableRange<IndexDistance>)
}

let barcodeSchema: DictionaryLiteral = [
    "name": 2..<22,
    "pnrCode": 23..<30,
    "fromCity": 30..<33,
    "toCity": 33..<36,
    "carrier": 36..<39,
    "flightNumber": 39..<44,
    "day": 45..<47,
    "seatNo": 47..<51,
]

let fields = barcode.fields(at: barcodeSchema)!

This brings us back to more natural (for this algorithm) Int ranges, but it does require writing the extra extension. And if that's the case, it's easier to just write a string Int range subscript extension.

What I was hoping the Standard Library would provide an opt-in, slightly less safe but more convenient, pragmatic solution. In a similar way, the language lets us use the unwrapping operator to unwrap an optional we know to be non-nil. It's not as safe as optional binding, but it makes the code more readable for cases where we know more than the type system can provide. I don't know if I'm making sense.

···

On 11 Jun 2017, at 02:49, Ben Cohen <ben_cohen@apple.com> wrote:

On Jun 8, 2017, at 10:32 AM, David Hart via swift-evolution <swift-evolution@swift.org> wrote:


(David Hart) #12

At some point–although it’s unclear if it’s still the case after the recent String revisions–it was the case that Foundation extended String to allow integer subscripting of UTF16View.

The issue with offering it generally is that it doesn’t reliably serve any use case. You’re using it for performance, which works currently because the underlying storage is UTF16-encoded, but this is explicitly _not_ guaranteed to remain the case going forward, and then your performance would be impacted.

If I recall, the core team has said that the only guarantee of the performance you’re looking for is to store the string yourself as a sequence of bytes (using Data, for example). The argument is that, in your barcode example, you’re not really manipulating it as a string at all, but are relying on its being a particular sequence of bytes. As it happens, when this topic has come up previously, it has also been the case that the use case presented treated the string as a known sequence of bytes.

I understand what you are saying. It's just a pitty that it's making the code much more complicated than it should be.

···

On 8 Jun 2017, at 14:11, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Thu, Jun 8, 2017 at 16:26 David Hart via swift-evolution <swift-evolution@swift.org> wrote:

On 8 Jun 2017, at 12:35, Tony Allevato <tony.allevato@gmail.com> wrote:

It is an extremely rare case for a developer to know a priori what literal numeric indices should be used when indexing into a string, because it only applies when strings fall into a very specific format and encoding.

It's been discussed before during String-related proposals but AFAIK the core team has come out against it—it would be an invitation for users who don't understand the distinction to do very unsafe and wrong things with strings. IMO, writing your own extension or using index.offset(by:) isn't a huge penalty here.

Is it really an invitation when it’s hidden inside the UTF16View?

On Thu, Jun 8, 2017 at 10:32 AM David Hart via swift-evolution <swift-evolution@swift.org> wrote:
Hello,

When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access. I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Michael Ilseman) #13

Hello,

When working with Strings which are known to be ASCII, I tend to use the UTF16View for the performance of random access. I would also like to have the convenience of indexing with Int:

let barcode = "M1XXXXXXXXX/CLEMENT EELT9QBQGVAAMSEZY1353 244 21D 531 10A1311446838”
let name = barcode.utf16[2..<22]
let pnrCode = barcode.utf16[23..<30]
let seatNo = barcode.utf16[47..<51]
let fromCity = barcode.utf16[30..<33]
let toCity = barcode.utf16[33..<36]
let carrier = barcode.utf16[36..<39]
let flightNumber = barcode.utf16[39..<44]
let day = barcode.utf16[44..<47]

I define my own subscript in an extension to UTF16View but I think this should go in the Standard Library.
Any thoughts?

David.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Hi David,

My view is positional shredding of strings is enough of a use case that we ought to think about handling it better. I don’t think having to convert the string into an Array of Character, or bytes, or use Data, should be necessary, since this implies losing the stringiness of the slices you are creating, which is an inconvenience for many use cases. Strings-as-data is a thing we should support, and support well.

But I also don’t think giving String or its views integer indices is the right way to go either, incorrectly implying as it does random access

(or, even if we did end up making utf16 permanently random-access, encouraging people towards using utf16 to support this use case when often they’d be better served sticking with characters).

There’s a few things to note about the example you give:
1) Do you really want utf16 view slices for your variable types? Maybe in this case you do, but I would guess a lot of the time what is desired would be a (sub)string.

Not really. I would be quite happy with Substring. The variable type is just the consequence of my choice of the UTF16 view.

2) Do you really want to hard-code integer literals into your code? Maybe for a quick shell script use case, but for anything more this seems like an anti-pattern that integer indices encourage.

I agree. I took the code out of a project of mine and simplified it. I have the Int ranges defined as named constants.

3) Are you likely to actually want validation at each step – that the string was long enough, that the string data was valid at that point?

No. I pre-validate the string and it's length beforehand.

4) This case doesn’t seem to need random access particularly, so much as the (questionable? see above) convenience of integer indexing. Although reordered in the example, it seems like the code could be written to progressively traverse the string from left to right to get the fields. Unless the plan is to repeatedly access some fields over and over. But I’m not sure we’d want to put affordances into the std lib to encourage accessing stringly typed data...

Indeed. I initially rewrote the code to traverse the string from left to write, advancing indices by deltas. But (1) it made the code less readable, (2) further removed from the barcode spec which is defined in terms of offsets, and (3) more brittle - editing deltas forces chain modifications:

let nameStart = barcode.index(name.startIndex, offsetBy: 2)
let nameEnd = barcode.index(nameStart, offsetBy: 20)
let name = barcode[nameStart..<nameEnd]

let pnrCodeStart = barcode.index(nameEnd, offsetBy: 1)
let pnrCodeEnd = barcode.index(pnrCodeStart, offsetBy: 7)
let pnrCode = barcode[pnrCodeStart..<pnrCodeEnd]

So the question is, what affordances should we put into the standard library, or maybe other libraries, to help with these use cases? This is a big design space to explore, and at least some ideas ought to feed into our ongoing improvements to String for future releases.

For example:

Now that we have Substring, it opens up the ability to efficiently consume from the front (since it’s just adjusting the substring range):

extension Collection where Self == SubSequence {
    // or some better name...
    mutating func removeFirst(_ n: IndexDistance) -> SubSequence? {
        guard let i = index(startIndex, offsetBy: n, limitedBy: endIndex)
            else { return nil }
        
        defer { self = self[i...] }
        return self[..<i]
    }
}

Once you have this, you could use it to write the example code, along with some error checking (or you could use ! if you were completely certain of the integrity of your data)

var s = barcode[...] // make a substring for efficient consumption
_ = s.consume(2) // drop initial prefix
guard let name = s.removeFirst(20)?.trim else { fatalError("Failed to read name") }
guard let pnrCode = s.removeFirst(6).map(PNRCode.init) else { fatalError("Failed to read pnrCode") }

Definitely interesting! But it has some of the same issues as my code above.

This has one less issue than your code above: because it works on Characters you’ll get proper grapheme breaking. Even for ASCII, the provided code is safer as grapheme breaking is almost-but-not-quite-trivial for ASCII. To have code units be the same as Characters, you’ll need to validate that it is both ASCII *and* does not contain a CR-LF sequence inside, as CR-LF is a single grapheme.

This is not intuitive right away and is a good example of the kinds of traps that can arise when working directly on code units rather than Character, even if all ASCII. This can cause seemingly well tested code to blow up in production.

Or you could build on the above to create a function that could shred a string into a dictionary of fields… (probably a bit domain-specific for the std lib at this point)

extension Collection {
    // could choose to handle or fail on gaps, out-of-order ranges, overlapping ranges etc
    func fields<P: Collection>(at positions: P) -> [String:SubSequence]? // or throw an error with a field name
    where P.Element == (key: String, value: CountableRange<IndexDistance>)
}

let barcodeSchema: DictionaryLiteral = [
    "name": 2..<22,
    "pnrCode": 23..<30,
    "fromCity": 30..<33,
    "toCity": 33..<36,
    "carrier": 36..<39,
    "flightNumber": 39..<44,
    "day": 45..<47,
    "seatNo": 47..<51,
]

let fields = barcode.fields(at: barcodeSchema)!

This brings us back to more natural (for this algorithm) Int ranges, but it does require writing the extra extension. And if that's the case, it's easier to just write a string Int range subscript extension.

What I was hoping the Standard Library would provide an opt-in, slightly less safe but more convenient, pragmatic solution.

Could you elaborate on what you mean by “less safe”?

In a similar way, the language lets us use the unwrapping operator to unwrap an optional we know to be non-nil. It's not as safe as optional binding, but it makes the code more readable for cases where we know more than the type system can provide. I don't know if I'm making sense.

In the case of unwrapping, the safety given up is clear and the semantics are very obvious. For anything involving Unicode, correctness is non-obvious and corner cases can be treacherous.

···

On Jun 11, 2017, at 10:25 PM, David Hart via swift-evolution <swift-evolution@swift.org> wrote:
On 11 Jun 2017, at 02:49, Ben Cohen <ben_cohen@apple.com <mailto:ben_cohen@apple.com>> wrote:

On Jun 8, 2017, at 10:32 AM, David Hart via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution


(Jordan Rose) #14

I think Tony was getting at knowing the best indices for a string, vs. a format or API that is documented to use UTF-8 or UTF-16 indexes specifically (like, unfortunately, most of Cocoa and Cocoa Touch). It stinks that those may not be random-access if the underlying string buffer turns out to not be UTF-16, but that's true with NSString as well.

Jordan

···

On Jun 10, 2017, at 19:01, Charles Srstka via swift-evolution <swift-evolution@swift.org> wrote:

On Jun 8, 2017, at 2:35 PM, Tony Allevato via swift-evolution <swift-evolution@swift.org> wrote:

It is an extremely rare case for a developer to know a priori what literal numeric indices should be used when indexing into a string, because it only applies when strings fall into a very specific format and encoding.

Knowing this is required for accessibility support on macOS, since it’s needed to implement NSAccessibility methods such as accessibilityAttributedString(for:), accessibilityRTF(for:), accessibilityFrame(for:), etc.