Call for Users (and Authors): Offset indexing pitch

I don't think confusion is the motivation. But sometimes you want the nth thing, and it's clearer not to have to do a calculation to get it, but to just state that directly. This also natural combines with the check that there are n things in the first place, similar to how first is a convenience that avoids having to check if the collection is empty by combining that with the fetch.

3 Likes

You're entirely right that I'm implying that. Programmers are lazy people (as a whole). Beginners don't usually reach for the right tool, they reach for the easiest tool. Indexes can be a complicated topic to explain in a blog post about how to do X as simply as possible. If I were a beginner coder and saw integer subscripts into collections, that's what I would reach for without ever thinking there were even another way of doing things.

This is something that I don't think is good enough. At what point would you infer that your code is slow because of the way you're subscripting the items of the collection? I don't know about you, but reading Swift's documentation about integer subscripts into collections would not be the first thing I checked...or even the 100th.

There would invariably be some StackOverflow answer that thousands of people would run into (myself included) describing why collection code can be so slow and that explains the correct usage of indexes instead of integer offsets.

You may feel that way but "trivially implementable" has been the reason for rejecting proposals in the past by the core team. I don't see the core team giving in to something that is trivially implementable just because a lot of people want it. Otherwise swift would end up with a bloated standard library littered with all kinds of sugar that adds a ton of convenience but not a whole lot of substance.


If you really want integer offsets into a collection, then it can be done in 5 lines of code:

extension Collection {
    public subscript(_ offset: Int) -> Element {
        return self[index(startIndex, offsetBy: offset)]
    }
}

I find this trivial enough that when I really don't want to use indexes and don't care about performance that I have used this exact code (usually as an extension to String itself though). It's not very often I find myself needing all the safety provided by the pitched code or the ability to slice (but I can't speak for everyone). Even if I did need more? It would also be trivial to do (the whole pitch is less than 100 lines for everything). If you really want integer subscripting for a script you're writing, I don't think it's unreasonable to write 5 extra lines of code to get what you want.

1 Like

I'll try to keep this response short, since most of my arguments have been brought up by many others previously, so I'll avoid bringing them all back up again. But what I think this boils down to is a language ergonomics issue. Especially compared to languages like Python that have a rich slicing syntax, Swift is much more cumbersome out of the box to do things that other languages can accomplish in a simple expression. And as I stated previously, I understand the reasons Swift has historically kept this verbosity on the user.

So I think one of the things that should be decided when this come to review is: should Swift continue to prioritize making possibly slow code very obvious, and make the index calculations explicit in the user code. Or should Swift also provide more expressive ways to have those calculations done in the stdlib.

And starting to touch on the things pitched here, I'd almost say they're too skeletal for most use cases that I would want. To me, the long term goal for these kinds of features would be what I mentioned earlier, rich slicing. Indeed, one of the most useful operations I would want from these new APIs isn't even covered, that being the ability to express end-based offsets. (Which I will add, feel incredibly odd that these range based subscripts don't behave more like Python slicing where you can specify negative offsets)

So, I don't think this pitch, as it stands, holds much weight unless the longterm goals of rich slicing, end based offsetting, etc would be quickly followed up on, and have a strong chance of passing evolution.

3 Likes

At work I primarily code using python so I am very familiar with a rich slicing syntax and in many ways I do wish Swift had such nice ergonomics. However I am often bitten by the poor performance of python as well (there's really only so much you can do to make python faster without switching to a completely different language).

I see both sides of the coin and my preference just happens to fall on the side of performance over convenience. Convenience can be provided at any point through extensions once you know enough about the proper coding practices, but it's much more work to increase performance after you've already written large swaths of code using bad practices.

I agree with you here. At the end of the day, this pitch/proposal will be decided by the core team and their desired future direction for Swift. I think you and I have probably said enough for now. ;)

2 Likes

I wouldn't go that far. Sometimes the offset is calculated as part of an algorithm that can be proven to never go beyond the end of the collection, so the only cause of failure would be programmer error and trapping on generation or use of the subscript is the right behavior.

You could argue that this is rare enough that, given our time again, that trapping version should be spelled c.index(i:offsetBy: n, limitedBy: c.endIndex)! to avoid the confusion you're describing. Though that could also be a performance pessimization in cases where the offset is guaranteed algorithmically to be within the limit.

1 Like

When talking about manipulating strings, it doesn't hurt to look at Perl, one of the great string-manipulating languages. Although Perl promotes its deeply-integrated regexes as the primary string manipulation mechanism, it also provides a free function to do indexed slicing of strings, substr($x, offset, length). Although it isn't O(1) and doesn't use subscript syntax or otherwise feel "first class", that hasn't been a huge deal. Javascript similarly has substr[ing] methods on its string type. Maybe we could take the same tack in Swift; spelling the offset-based accessor as .substr(x, y) instead of as a subscript would let us offer the easy thing for the times you really need it, while still signaling that, since this isn't a subscript, it isn't intended to be the primary primitive way you decompose strings.

7 Likes

I think if backward compatibility weren't a problem, I'd argue that the clarity would be worth the risk of performance pessimization. In practice, the risk here is almost irrelevant: it's just a minor wrinkle in a complex web of protocols. These things happen: c'est la vie.

I think this is an interesting direction to explore for String, if we decided to restrict offset-based indexing to RandomAccessCollection.

1 Like

Presupposing the ultimate goal here is to make Swift a really great language for working with strings, perhaps we ought to consider what an ideal API for walking a string chunk by known-length-chunk would look like.

One possibility is to provide a way to get the substring after a given substring, with a specified length. For example:

let nextSection = text.substring(after: currentSection, length: n)

The implementation is straightforward, though there is a question of how to handle lengths that extend beyond the end of the string (trap, truncate, or optional). Whatever the specifics, an API like this would encourage (through convenience) writing code that traverses the string only once.

And of course there could be similar methods for, eg:

let nextLine = text.substring(after: currentLine, until: "\n")

let nextWord = text.nextWord(after: currentWord)
1 Like

Sure, a concrete example would be git's pkt-line: https://github.com/git/git/blob/master/Documentation/technical/pack-protocol.txt#L66

Here's how I would parse it in Python:

def parse(pkt_line):
    if int(pkt_line[:4], 16) != len(pkt_line):
        raise Exception('bad length')
    tok = pkt_line[4:].split('\0')
    request_command, _, pathname = tok[0].partition(' ')
    if request_command not in ('git-upload-pack', 'git-receive-pack', 'git-upload-archive'):
        raise Exception('bad request-command')
    print('request-command:', request_command)
    print('pathname:', pathname)
    if len(tok) >= 2:
        host_parameter = tok[1]
        if host_parameter[:5] != 'host=':
            raise Exception('bad host-parameter')
        hostname, _, port = host_parameter[5:].partition(':')
        print('hostname:', hostname)
        if port:
            print('port:', port)
    if len(tok) >= 4:
        if tok[2] != '':
            raise Exception('bad extra-parameters')
        extra_parameters = tok[3].split('\0')
        for extra_parameter in extra_parameters:
            print('extra-parameter:', extra_parameter)    

Is it quadratic? I honestly don't know, I guess when I'll have to parse a million of those lines I'll investigate.

Just want to mention that in Swift, you should really be doing this kind of parsing on Data or [UInt8], not String (which are int-subscriptable, though you have to be careful with Data due to it being able to be a slice as well). Not sure if you hit any in this case, but it's very easy to accidentally accept inputs that you shouldn't have if you use Swift String ==, for example ";" == "\u{37e}" In addition, if the extra parameters contained multibyte UTF-8 characters, they would cause a bad length exception for both Swift and Python strings, since Swift counts graphemes and Python counts unicode scalars (AFAIK)

I do think Swift should get better ways to parse ASCII-ish things, either through an ASCIIString struct or maybe just through String.utf8.

Also your example would have worked great with a Scanner which I do wish Swift had.

4 Likes

That function is not quadratic in the length of its input in either Python or Swift, because it does not loop. len(pkt_line) is (possibly) linear time in Swift, constant time in Python. Half-open slicing at constant indices (e.g. [4:] or [:5]) is constant time in Python and Swift. Split and partition operations are linear time in Python and Swift. This operation is thus linear time, dominated by repeated string walks to locate tokens.

The accidentally quadratic use-case occurs when you have a loop over the String itself, and then use offset-based indexing within that loop where the offset index is in some way related to the loop iteration or the String length.

I should note, though, that what you have in Python is not a string but a bytestring (i.e. Python 2's str not unicode, which in Swift is [UInt8] or Data and not String). Such a data type in Swift is a RandomAccessCollection and so can do offset-based indexing in constant-time.

1 Like

It is actually using normal (Python 3) strings, if they were byte strings all the splitting and matching would need to be done with b'...' arguments.

I'd love to see an equivalent example using a Scanner to get an idea how it would work.

Heh, I'm somewhat troubled by the presence of a U+0000 in your strings, but that's fine.

I feel like this is probably the best solution given the current contention around subscripts and the amount of helpful work they should do. Perhaps we keep subscripts as only operating on Indexs, and instead provide methods for doing slicing/offsetting based on the idea of the Intth Element in a collection. I think this also has the added benefit of allowing greater discoverability and composability, while keeping the idea that subscript based access should be performant.

2 Likes

This is a huge part of the motivation behind Contiguous Strings, which is now in review.

1 Like

Here's what I would do if I were to use an ASCII String version of my string scanner to parse:

func parse(pktLine: ASCIIString) throws {
	var scanner = Scanner(pktLine)
	guard scanner.read(count: 4).flatMap({ Int($0, radix: 16) }) == pktLine.count else {
		throw ParseException("Bad length")
	}
	guard let requestCommand = scanner.read(toNext: " "),
	      ["git-upload-pack", "git-receive-pack", "git-upload-archive"].contains(requestCommand)
	else {
		throw ParseException("Bad request-command")
	}
	print("request-command: \(requestCommand)")
	guard let pathname = scanner.read(toNext: "\0") else {
		throw ParseException("Missing pathname")
	}
	print("pathname: \(pathname)")
	if scanner.isEmpty { return }

	guard scanner.remove(prefix: "host=") else {
		throw ParseException("Bad host-parameter")
	}
	guard var hostnameAndPort = scanner.read(toNext: "\0").map(Scanner.init), !hostnameAndPort.isEmpty else {
		throw ParseException("Missing hostname")
	}
	let hostname = hostnameAndPort.read(toNext: ":")!
	let port = hostnameAndPort.rest
	print("hostname: \(hostname)")
	port.map { print("port: \($0)") }

	guard let extraPrefix = scanner.read(toNext: "\0") else { return }
	guard extraPrefix.isEmpty else {
		throw ParseException("bad extra-parameters")
	}
	while let extraParameter = scanner.read(toNext: "\0") {
		print("extra-parameter: \(extraParameter)")
	}
}
1 Like

Have you considered the potential confusion where array slices will sometimes have completely different behavior for:

A[4] // may actually be the first element of the slice if the startIndex is 4.
A[offset: 4] // Always startIndex+4

? I haven't read all the comments in detail, so I'm sorry if someone already asked this.

-Chris

2 Likes

Yes, that's been discussed in this thread and I briefly alluded to to in the pitch and examples:

This is worth elaborating on with code examples (especially generic functions over RAC) in a new iteration of the pitch.

1 Like

Sorry for bringing this back up again. It seems like the consensus was to wait and try again later. But for anyone interested, I performed some benchmarks using some of the suggested solutions mentioned in the thread. Namely the suggested subscript(offset:), using a RAC and the other approach of a caching* wrapper (referred to as offset view).

*: I'm not exactly sure what Brent meant by cached indices. But in any case the wrapper I used only keeps track of the last index it visited and the offset to that index from the startIndex. So when a new offset is requested, it moves from the cached index by the difference of the cached offset and the new offset.

I used three substring finding algorithms to benchmark. The algorithms all do the same thing in that they each find the first index of some substring. Please ignore the performance of the algorithms relative to each other. Instead focus on the comparison between the four approaches (index, offset view, random access and subscript).

I've included an image that shows the graphs of the benchmarks. The left contains all four approaches, and the right excludes the subscript approach. This addition is so that it's easier to compare the other three.

Source code for the benchmarks can be found on Github

From the results, it's quite clear to see that creating a random access collection is the fastest approach, even against using indices directly. The caveats for the random access approach are the additional storage requirements O(n) and the lack of being able to mutate the underlying collection directly.
The offset view approach works well and is by no means 'slow'; in fact; it can outperform the index model approach in some cases. These cases are where the brunt of the work can be moved away from needing to traverse indices directly.
The subscript directly on the collection is well and truly the least performant. But still is the easiest to use (by little though).


In terms of my personal experience with doing this little project, by far the hardest part was writing the algorithms using the Swift index model. The other three approaches were all about as easy to use as the other.

2 Likes