[Pitch] Make offset index available for String

Hi,

I would like to discuss the String.Index problem within Swift. I know the current situation of String.Index is based on the nature of the underlying data structure of the string.

But could we just make String.Index contain offset information? Or make offset index subscript available for accessing characters in String?

for example:

let a = "01234"
print(a[0]) // 0
print(a[0...4]) // 01234
print(a[...]) // 01234
print(a[..<2]) // 01
print(a[...2]) // 012
print(a[2...]) // 234
print(a[2...3]) // 23
print(a[2...2]) // 2
if let number = a.index(of: "1") {
    print(number) // 1
    print(a[number...]) // 1234
}

0 equals to Collection.Index of collection.index(startIndex, offsetBy: 0)
1 equals to Collection.Index of collection.index(startIndex, offsetBy: 1)
...
we keep the String.Index, but allow another kind of index, which is called "offsetIndex" to access the String.Index and the character in the string.
Any Collection could use the offset index to access their element, regarding the real index of it.

I have made the Collection OffsetIndexable protocol available here, and make it more accessible for StringProtocol considering all API related to the index.

If someone wants to make the offset index/range available for any collection, you just need to extend the collection:

extension String : OffsetIndexableCollection {
}

extension Substring : OffsetIndexableCollection {
}

I hope the Swift core team could consider bring the offset index to string, or make it available to other collection, thus let the developer decide whether their collection could use offset indices as an assistant for the real index of the collection.

Thanks!
Jiannan

People used to the offset index system instead of the String.Index. Use offset indices to name the elements, count the element is normal and natural.

This offset index system has a long history and a real meaning to the collection. The subscript s[i] has a fix meaning “getting the i-th element in this collection”, which is normal and direct. Get the range with offset index, is also directly means the substring is from the i-th character up to the j-th character of the original string.

People used to play subscript, range with offset index, with countable numbers. Use string[string.index(i, offsetBy: 5)] is not as directly and easily as string[i + 5]. Also the Range<String.Index> is not as directly as Range. Developers need to transfer the Range<String.Index> result of string.range(of:) to Range to know the exact range of the substring. Range<String.Index> has a real meaning to the machine and underlying data location for the substring, but Range also has a direct location information for human beings and the abstract concept of the collection.

Offset index system is based on the nature of the collection. Each element of the collection could be located by offset, which is a direct and simple conception to any collection. Right? Even the String with String.Index has some offset index property within it. For example the count of the String, is the offset index of the endIndex.The enumerated() generated a sequence with elements contains the same offset as I provided. And when we apply Array(string), the string divided by each character and make the offset indices available for the new array.

The offset index system is just an assistant for collection, not a replacement to String.Index. We use String.Index to represent the normal underlying of the String. We also could use offset indices to represent the nature of the Collection with its elements. Providing the offset index as a second choice to access all the collection, is not only for the String struct, it is for all collections, since it is the nature of the collection concept, and developers could choose to use it or not.

We don’t make the String.Index O(1), but translate the offset indices to the underlaying Collection.Index. Each time use subscript with offset index, translating them using c.index(startIndex, offsetBy:i), c.distance(from: startIndex, to:i)

We can make the offset indices available through an extension to Collection (as my GitHub repo demo: https://github.com/frogcjn/OffsetIndexableCollection-String-Int-Indexable-).

or we could make it at compile time:
for example:


	c[1...]

compile to

	c[c.index(startIndex, offsetBy:1)...]

	let index: Int = s.index(of: "a")

compile to

	let index: Int = s.distance(from: s.startIndex, to: s.index(of:"a"))

	let index = 1 // if used in s only
	s[index..<index+2]

compile to

	let index = s.index(s.startIndex, offsetBy: 1)
	s[index..<s.index(index, offsetBy: 2)]

	let index = 1 // if used both in s1, s2
	s1[index..<index+2]
	s2[index..<index+2]

compile to

	let index = 1
	let index1 = s1.index(s.startIndex, offsetBy: index)
	let index2 = s2.index(s.startIndex, offsetBy: index)
	s1[index1..<s.index(index1, offsetBy: 2)]
	s2[index2..<s.index(index2, offsetBy: 2)]

I really want the team to consider providing the offset index system as an assistant to the collection.

Thanks!
Jiannan

Isn’t Collection.Index supposed to be O(1)? I believe this proposal violates that.

Isn’t Collection.Index supposed to be O(1)? I believe this proposal violates that.

I don’t think using an ExpressibleByIntegerLiteral type to represent the location meaning it is O(1).

My point is that allowing for integer subscripting gives the (incorrect) impression that such an operation is O(1), when in fact it is O(n).

Here’s the issue with indexing strings with numbers:

let a = "Ḁ̵͉͉͉͎̩̐͂̃ͧ̕͢1234"
if let number = a.index(of: "1") {
	print(number)
	// With numbers: 1
	// Currently Swift.String.Index(_compoundOffset: 56, _cache: Swift.String.Index._Cache.character(1)),
	// which represents the 14th UTF-16 codepoint
	print(a[number...])
	// With numbers, this would now have to figure out what codepoint is the 2nd character, an O(n) lookup.
	// With the current index system, the index already represents the correct codepoint.
}

As you can see, this makes it very easy to accidentally make things O(n) where you thought they were O(1) (like using an index you got from string.index(of:)) and accidentally make O(n^2) where you thought they were O(n) (like if you used for i in 0..<a.count { a[i] })

1 Like

Or we can make the distance information stored in String.Index.

Isn’t that what we do right now?

No. The String.Index contains the U32 offset, instead of the index distance to the beginning.

So you’re saying the index of “1” in the string “<Some character that takes 14 UTF-16 codepoints>1” should be 1? Or 14? Or something else?

No. My implementation keep the String.Index.

If you use your code with my extension, it will be the right O(1). the number is the String.Index instead of the Int.

So you’re saying that "Ḁ̵͉͉͉͎̩̐͂̃ͧ̕͢1234".index(of: "1") should return 14?

No. It could return the String.Index. The same result of current Swift 4 version.

Both code will work with my implementation: https://github.com/frogcjn/OffsetIndexableCollection-String-Int-Indexable-

if let number = a.index(of: "1") as Int? {
if let number = a.index(of: "1") as String.Index? {

So if you want to keep the O(1) representation, then use String.Index.
If you want to use the offset index version, then use Int.

It just make a second choice available, not replacing the String.Index.

Ahh okay. There’s still a few issues with this.

  • At the moment, the Swift compiler prefers Int over other types when choosing the type of something. This means that the default String.index(of:) method would end up being the Int one.
  • The ability to access elements in a String with Ints still makes it much easier to write O(n^2) code when looping over Strings. If someone wanted to loop over the indices into a String, which do you think they would try first, for index in 0..<str.count { str[index...] } or for index in str.indices { str[index...] }?

In fact, the only time I can think of that I would want the ability to access a String with an integer index where doing so would also not be slower than using a String.Index is if I was using a hardcoded literal. Otherwise, pretty much anywhere I would want to use an integer would also be made slower by using the integer.

Could you make this possible?
There are also enum index problems with using String.Index

let index1: String.Index
let index2: String.Index
for i in index1..<index2 { // this is not possible with String.Index, but possible with offset index
}

These two kinds of indexing are the trade-off. Which one is better? I don’t think there is a better one. They both have their own unique value.

The problem with this is that String indexes don’t have a reference to the string they’re indexes for, and therefore don’t know how far forward to move each time they’re advanced. This used to be the case, but was changed in [RFC] New collections model: collections advance indices, so you can look at the reasoning there.

As for using OffsetIndex as a solution to the problem, doing a for i in index1..<index2 would have O(n^2) time complexity, which would (I think) surprise many people. If you do want an index that works in a for i in index1..<index2, it would be better to make an index that wraps the base string and a String.Index and uses that to implement Strideable, since it wouldn’t have the performance issues, but I think the use cases of something like that are too small to implement in the standard library.

Finally, what are the enum index problems you’re talking about? I don’t know of any ways I would want to use a String.Index with an enum…

1 Like

I don’t know what you mean by “U32 offset”, but I believe you are mistaken. String.Index stores the encoded UTF16 offset, alongside small transcoded offsets and the distance to the next grapheme. Could you elaborate on how your approach is different?