Call for Users (and Authors): Offset indexing pitch

lukasa · March 12, 2019, 11:30am

Just to clarify: on what data types would this extension need to be provided to achieve this goal?

I think the use-cases you outline here would all be covered by simply implementing this on RandomAccessCollection. That would cover all the reasonable data types for this kind of work: Array<UInt8>, Data, UnsafeRawBufferPointer, and even NIO's ByteBufferView.

However, this list does reiterate my concern about using this on Collection. Any parser operating on Collection with offset-based indexing will almost certainly have quadratic performance. That would be extremely unfortunate.

bobergj · March 12, 2019, 2:04pm

ArraySlice (and Slice) sharing indices with the base collection is a really great feature!
Rather than adding more API I feel there is just a need of more documentation or tutorials to show what "idiomatic Swift" should look like when it comes to collection indices.

Someone wrote this example above:

func parseToken(at index: Int, in: [Int]) {
  let currentToken = array[index]
  if index + 1 < array.count {
    let nextToken = array[index + 1]
   // handle both tokens
  } else {
    // handle the single token
  }
}

Obviously this is a toy example, but building on that, I would say parsing a collection of tokens in idiomatic Swift looks more like:

let tokens = ... some collection
var tokensSlice = Slice(tokens)
while !tokensSlice.isEmpty {
   parseToken(tail: &tokensSlice)
}

func parseToken<C: Collection>(tail: inout Slice<C>) {
    let startIndex = tail.startIndex
    if let nextIndex = tail.index(startIndex, offsetBy: 1, limitedBy: tail.endIndex), nextIndex != tail.endIndex {
        // handle both tokens
        tail = tail.suffix(from: tail.index(after: nextIndex))
    } else {
        // handle the single token
        tail = tail.suffix(from: tail.index(after: startIndex))
    }
}

Now lets say you encounter an error and want to spit out the range of error in the original collection? Easy, because the slice indices are shared with the collection.

Someone else above mentioned CSV parsing, there is CSV parser among the benchmarks that uses a similar technique as above: https://github.com/apple/swift/blob/master/benchmark/single-source/CSVParsing.swift
(disclaimer: this is not a production csv parser and it uses utf-16 indexing which is no longer the fastest encoding)

nuclearace · March 12, 2019, 3:08pm

Most of the times I've wanted to use offset slicing/indexing isn't in code that's meant to be hardcore performant or in critical code. It's been in code for "scripts" or other throw away pieces of code I'm writing for examples/fun. And this is an area that given Swift's limited stdlib and particularly its String and regex capabilities, is extremely lacking, and does not make Swift feel as easy to use as a scripting language.

And while I understand the performance issues related to indexing on certain collections, or collection-like entities, the lack of expressive syntax/features to avoid having to drop down to calling various index calculating methods is actively harmful to users who know the risks associated with these kinds of operations, and are just seeking to be productive in their day.

This is implying that users who are taught this won't also be taught the dangers/pitfalls of using these APIs. It would be great if we could popup a window the first time a newbie programmer tried to use these without consulting documentation, but given that's an impossible task, the best we can do is accurately document the performance characteristics of these APIs. This should be done both in standard documentation, as well as in the Swift book. I'd go further and say that we should probably add an entire section to the Swift chapter on collections detailing Swift's idea of Indexes, since they're a core part of the collection system. In that section we should first explain the uses of the various indexing methods, and then introduce the short-hand offset based things.

I find the argument of "trivially implemented" falls down when you have features that often repeated or requested. Result is a perfect example of that. And I say that this would create apis that would be extremely useful in solving many of those requests, or open the door to solutions for those.

sveinhal · March 12, 2019, 3:45pm

Where offsets are known in advance?
I hear people say this, but I have yet to see a single example not better served with a scanner or some kind of sequential reader.

For very trivial examples such as a credit card number validator, it’s best to convert it into an array of digits anyways, and for those kinds of operations a simple .map is better.

Do you have a concrete example where you’d use these apis?

beccadax · March 12, 2019, 4:36pm

You’re preaching to the choir. Nonetheless, it is occasionally nice to do something relative.

Ben_Cohen · March 12, 2019, 5:14pm

I don't think confusion is the motivation. But sometimes you want the nth thing, and it's clearer not to have to do a calculation to get it, but to just state that directly. This also natural combines with the check that there are n things in the first place, similar to how first is a convenience that avoids having to check if the collection is empty by combining that with the fetch.

Ponyboy47 · March 12, 2019, 5:35pm

You're entirely right that I'm implying that. Programmers are lazy people (as a whole). Beginners don't usually reach for the right tool, they reach for the easiest tool. Indexes can be a complicated topic to explain in a blog post about how to do X as simply as possible. If I were a beginner coder and saw integer subscripts into collections, that's what I would reach for without ever thinking there were even another way of doing things.

This is something that I don't think is good enough. At what point would you infer that your code is slow because of the way you're subscripting the items of the collection? I don't know about you, but reading Swift's documentation about integer subscripts into collections would not be the first thing I checked...or even the 100th.

There would invariably be some StackOverflow answer that thousands of people would run into (myself included) describing why collection code can be so slow and that explains the correct usage of indexes instead of integer offsets.

You may feel that way but "trivially implementable" has been the reason for rejecting proposals in the past by the core team. I don't see the core team giving in to something that is trivially implementable just because a lot of people want it. Otherwise swift would end up with a bloated standard library littered with all kinds of sugar that adds a ton of convenience but not a whole lot of substance.

If you really want integer offsets into a collection, then it can be done in 5 lines of code:

extension Collection {
    public subscript(_ offset: Int) -> Element {
        return self[index(startIndex, offsetBy: offset)]
    }
}

I find this trivial enough that when I really don't want to use indexes and don't care about performance that I have used this exact code (usually as an extension to String itself though). It's not very often I find myself needing all the safety provided by the pitched code or the ability to slice (but I can't speak for everyone). Even if I did need more? It would also be trivial to do (the whole pitch is less than 100 lines for everything). If you really want integer subscripting for a script you're writing, I don't think it's unreasonable to write 5 extra lines of code to get what you want.

nuclearace · March 12, 2019, 7:02pm

I'll try to keep this response short, since most of my arguments have been brought up by many others previously, so I'll avoid bringing them all back up again. But what I think this boils down to is a language ergonomics issue. Especially compared to languages like Python that have a rich slicing syntax, Swift is much more cumbersome out of the box to do things that other languages can accomplish in a simple expression. And as I stated previously, I understand the reasons Swift has historically kept this verbosity on the user.

So I think one of the things that should be decided when this come to review is: should Swift continue to prioritize making possibly slow code very obvious, and make the index calculations explicit in the user code. Or should Swift also provide more expressive ways to have those calculations done in the stdlib.

And starting to touch on the things pitched here, I'd almost say they're too skeletal for most use cases that I would want. To me, the long term goal for these kinds of features would be what I mentioned earlier, rich slicing. Indeed, one of the most useful operations I would want from these new APIs isn't even covered, that being the ability to express end-based offsets. (Which I will add, feel incredibly odd that these range based subscripts don't behave more like Python slicing where you can specify negative offsets)

So, I don't think this pitch, as it stands, holds much weight unless the longterm goals of rich slicing, end based offsetting, etc would be quickly followed up on, and have a strong chance of passing evolution.

Ponyboy47 · March 12, 2019, 7:11pm

At work I primarily code using python so I am very familiar with a rich slicing syntax and in many ways I do wish Swift had such nice ergonomics. However I am often bitten by the poor performance of python as well (there's really only so much you can do to make python faster without switching to a completely different language).

I see both sides of the coin and my preference just happens to fall on the side of performance over convenience. Convenience can be provided at any point through extensions once you know enough about the proper coding practices, but it's much more work to increase performance after you've already written large swaths of code using bad practices.

I agree with you here. At the end of the day, this pitch/proposal will be decided by the core team and their desired future direction for Swift. I think you and I have probably said enough for now. ;)

Ben_Cohen · March 12, 2019, 7:51pm

I wouldn't go that far. Sometimes the offset is calculated as part of an algorithm that can be proven to never go beyond the end of the collection, so the only cause of failure would be programmer error and trapping on generation or use of the subscript is the right behavior.

You could argue that this is rare enough that, given our time again, that trapping version should be spelled c.index(i:offsetBy: n, limitedBy: c.endIndex)! to avoid the confusion you're describing. Though that could also be a performance pessimization in cases where the offset is guaranteed algorithmically to be within the limit.

Joe_Groff · March 12, 2019, 9:04pm

When talking about manipulating strings, it doesn't hurt to look at Perl, one of the great string-manipulating languages. Although Perl promotes its deeply-integrated regexes as the primary string manipulation mechanism, it also provides a free function to do indexed slicing of strings, substr($x, offset, length). Although it isn't O(1) and doesn't use subscript syntax or otherwise feel "first class", that hasn't been a huge deal. Javascript similarly has substr[ing] methods on its string type. Maybe we could take the same tack in Swift; spelling the offset-based accessor as .substr(x, y) instead of as a subscript would let us offer the easy thing for the times you really need it, while still signaling that, since this isn't a subscript, it isn't intended to be the primary primitive way you decompose strings.

lukasa · March 12, 2019, 9:18pm

I think if backward compatibility weren't a problem, I'd argue that the clarity would be worth the risk of performance pessimization. In practice, the risk here is almost irrelevant: it's just a minor wrinkle in a complex web of protocols. These things happen: c'est la vie.

I think this is an interesting direction to explore for String, if we decided to restrict offset-based indexing to RandomAccessCollection.

Nevin · March 12, 2019, 10:16pm

Presupposing the ultimate goal here is to make Swift a really great language for working with strings, perhaps we ought to consider what an ideal API for walking a string chunk by known-length-chunk would look like.

One possibility is to provide a way to get the substring after a given substring, with a specified length. For example:

let nextSection = text.substring(after: currentSection, length: n)

The implementation is straightforward, though there is a question of how to handle lengths that extend beyond the end of the string (trap, truncate, or optional). Whatever the specifics, an API like this would encourage (through convenience) writing code that traverses the string only once.

And of course there could be similar methods for, eg:

let nextLine = text.substring(after: currentLine, until: "\n")

let nextWord = text.nextWord(after: currentWord)

nick.keets · March 13, 2019, 8:33am

Sure, a concrete example would be git's pkt-line: https://github.com/git/git/blob/master/Documentation/technical/pack-protocol.txt#L66

Here's how I would parse it in Python:

def parse(pkt_line):
    if int(pkt_line[:4], 16) != len(pkt_line):
        raise Exception('bad length')
    tok = pkt_line[4:].split('\0')
    request_command, _, pathname = tok[0].partition(' ')
    if request_command not in ('git-upload-pack', 'git-receive-pack', 'git-upload-archive'):
        raise Exception('bad request-command')
    print('request-command:', request_command)
    print('pathname:', pathname)
    if len(tok) >= 2:
        host_parameter = tok[1]
        if host_parameter[:5] != 'host=':
            raise Exception('bad host-parameter')
        hostname, _, port = host_parameter[5:].partition(':')
        print('hostname:', hostname)
        if port:
            print('port:', port)
    if len(tok) >= 4:
        if tok[2] != '':
            raise Exception('bad extra-parameters')
        extra_parameters = tok[3].split('\0')
        for extra_parameter in extra_parameters:
            print('extra-parameter:', extra_parameter)

Is it quadratic? I honestly don't know, I guess when I'll have to parse a million of those lines I'll investigate.

TellowKrinkle · March 13, 2019, 9:40am

Just want to mention that in Swift, you should really be doing this kind of parsing on Data or [UInt8], not String (which are int-subscriptable, though you have to be careful with Data due to it being able to be a slice as well). Not sure if you hit any in this case, but it's very easy to accidentally accept inputs that you shouldn't have if you use Swift String ==, for example ";" == "\u{37e}" In addition, if the extra parameters contained multibyte UTF-8 characters, they would cause a bad length exception for both Swift and Python strings, since Swift counts graphemes and Python counts unicode scalars (AFAIK)

I do think Swift should get better ways to parse ASCII-ish things, either through an ASCIIString struct or maybe just through String.utf8.

Also your example would have worked great with a Scanner which I do wish Swift had.

lukasa · March 13, 2019, 9:49am

That function is not quadratic in the length of its input in either Python or Swift, because it does not loop. len(pkt_line) is (possibly) linear time in Swift, constant time in Python. Half-open slicing at constant indices (e.g. [4:] or [:5]) is constant time in Python and Swift. Split and partition operations are linear time in Python and Swift. This operation is thus linear time, dominated by repeated string walks to locate tokens.

The accidentally quadratic use-case occurs when you have a loop over the String itself, and then use offset-based indexing within that loop where the offset index is in some way related to the loop iteration or the String length.

I should note, though, that what you have in Python is not a string but a bytestring (i.e. Python 2's str not unicode, which in Swift is [UInt8] or Data and not String). Such a data type in Swift is a RandomAccessCollection and so can do offset-based indexing in constant-time.

nick.keets · March 13, 2019, 10:15am

It is actually using normal (Python 3) strings, if they were byte strings all the splitting and matching would need to be done with b'...' arguments.

nick.keets · March 13, 2019, 10:17am

I'd love to see an equivalent example using a Scanner to get an idea how it would work.

lukasa · March 13, 2019, 10:26am

Heh, I'm somewhat troubled by the presence of a U+0000 in your strings, but that's fine.

nuclearace · March 13, 2019, 1:47pm

I feel like this is probably the best solution given the current contention around subscripts and the amount of helpful work they should do. Perhaps we keep subscripts as only operating on Indexs, and instead provide methods for doing slicing/offsetting based on the idea of the Intth Element in a collection. I think this also has the added benefit of allowing greater discoverability and composability, while keeping the idea that subscript based access should be performant.