Allow String.*View.Index.samePosition(in:) to work on Substring (and maybe StringProtocol)

Generally, I try to define all my string processing code on Substring (or StringProtocol) to avoid forcing constant reallocations of the string buffer. This usually works fine, but on methods like this:

func firstWord(in str: String) -> Substring {
	let index = str.unicodeScalars.index {
		CharacterSet.whitespacesAndNewlines.contains($0)
	} ?? str.unicodeScalars.endIndex

	// The character set should only match the first UnicodeScalar in a Character
	let strIndex = index.samePosition(in: str)!
	return str[..<strIndex]
}

it breaks if I attempt to change the input type to a Substring (and breaks even more if I attempt to accept StringProtocol). Substring.unicodeScalars is a String.UnicodeScalarView, which uses String.UnicodeScalarView.Index as its index. The samePosition(in:) method on these indexes doesn't work with Substring

Possible Solutions

  • Add a samePosition(in: Substring) method to String.*View.Index. This would (I assume) be the easiest change, but I would guess this would be much harder to implement on StringProtocol if support for that is wanted.
  • Move samePosition to StringProtocol, as an index(samePositionAs: Index) method. This would allow all StringProtocol members to provide their own implementations of the operation for their own *View Index types, but is a much bigger change than the first option
  • Something else, these are the two solutions I could think of, if you can think of another please share it

Thoughts?

@Michael_Ilseman or @lorentey would know better, but from a quick glance a few extra constraints should help:

func firstWord<S>(in str: S) -> S.SubSequence
where S: StringProtocol, S.Index == String.Index,
      S.UnicodeScalarView.Index == S.Index,
      ... {
}

If that is indeed the case, then there is some design space to be explored to make index conversion more generic.

That still doesn't work because samePosition(in:) is only defined for String, String.UnicodeScalarView, String.UTF16View, and String.UTF8View. It's not generic at all.

The following playground works for me in Xcode 9.2:

import Foundation

func firstWord(in str: String) -> Substring {
    let index = str.unicodeScalars.index {
        CharacterSet.whitespacesAndNewlines.contains($0)
        } ?? str.unicodeScalars.endIndex
    return str[..<index]
}
func firstWord2(in str: Substring) -> Substring {
    let index = str.unicodeScalars.index {
        CharacterSet.whitespacesAndNewlines.contains($0)
        } ?? str.unicodeScalars.endIndex
    return str[..<index]
}
func firstWord3<S: StringProtocol>(in str: S) -> Substring
where S.Index == String.Index, S.UnicodeScalarView.Index == S.Index, S.SubSequence == Substring {
    let index = str.unicodeScalars.index {
        CharacterSet.whitespacesAndNewlines.contains($0)
        } ?? str.unicodeScalars.endIndex
    return str[..<index]
}
func firstWord4<S: StringProtocol>(in str: S) -> S.SubSequence
    where S.Index == String.Index, S.UnicodeScalarView.Index == S.Index {
        let index = str.unicodeScalars.index {
            CharacterSet.whitespacesAndNewlines.contains($0)
            } ?? str.unicodeScalars.endIndex
        return str[..<index]
}
func firstWord5<S: StringProtocol>(in str: S) -> S.SubSequence
    where S.UnicodeScalarView.Index == S.Index {
        let index = str.unicodeScalars.index {
            CharacterSet.whitespacesAndNewlines.contains($0)
            } ?? str.unicodeScalars.endIndex
        return str[..<index]
}



firstWord(in: "abc def ghi") // => abc
firstWord2(in: "abc def ghi"[...]) // => abc

firstWord3(in: "abc def ghi"[...]) // => abc
firstWord3(in: "abc def ghi") // => abc

firstWord4(in: "abc def ghi"[...]) // => abc
firstWord4(in: "abc def ghi") // => abc

firstWord5(in: "abc def ghi"[...]) // => abc
firstWord5(in: "abc def ghi") // => abc

String's index interchangeability doesn't seem to be reflected in StringProtocol currently, so we add those constraints ourselves. When the future of StringProtocol is more certain, we can add these constraints directly on the protocol.

edit: Added firstWord5, which shows that we don't need the S.Index == String.Index constraint

Ahh I see. Either way, this is just a method to do what I did without using samePosition(in:) (which I didn't realize would compile). What if you wanted to do this:

func prefix(of str: String, until condition: (UnicodeScalar) -> Bool) -> Substring {
	var index = str.unicodeScalars.index(where: condition) ?? str.unicodeScalars.endIndex

	while true {
		if let characterIndex = index.samePosition(in: str) {
			return str[..<characterIndex]
		}
		index = str.index(before: index)
	}
}

I guess when simplifying my use of the function for my example I accidentally made it into something that didn't actually need use samePosition(in:), but the fact that samePosition(in:) doesn't work with Substring seems to still be an issue.

Ah, so you want to know when the index is grapheme-aligned, that is, the index is a member of str.indices. Yes, this is a gap in Substring.

Off the top of my head, I can't think of a reason why this couldn't be defined for Substring. There might be some more effort involved in having this be present on StringProtocol, as StringProtocol might need further constraints on its associated types.

Looks like a good pitch to me

Sorry to resurrect this thread, but it seems that String.Index.samePosition(in:) still does not offer a way to do index conversion on Substring views, even though Substring uses String.Index. Has this gap been closed in another way?

Swift 5.7 made an important step in the right direction by having Substring.index(before:) and Substring.index(after:) automatically and implicitly round the supplied index down to the nearest character boundary within the substring.

I think it would make sense to provide this rounding-down operation as explicit methods on String and Substring. SE-0180 called this out as important future work; it’s time we followed up on that.

Note: Character boundaries within a Substring do not necessarily match character boundaries within its base string — so we cannot use the String-based methods to infer character positions within substrings: we need dedicated methods for rounding down indices within a substring.