End goal = O(1) time ← String.count

What are your thoughts on creating a new character type only for ascii characters , & preventing code points such as "\u{E9}" and "\u{65}\u{301}" (and instead, requiring user to type out é )?

Could this help ensure each character only increases .count by 1?

Could this help avoid traversing the entire string & no longer need to check for code units such as "\u{65}" & the actual values of each code?

I am not sure I follow. U+0065 is the ASCII character “e”. “é” on the other hand is not ASCII no matter how it is entered into the file.

\u{65} is resolved to e at compile time. That notation has nothing to do with why a String’s count is O(n).

In "café", the UTF‐8 bytes stored in memory are 63|61|66|65 CC 81, where I have marked the character boundaries with |. It is O(1) to figure out that there are 6 bytes, but it requires actually inspecting the contents of those bytes to figure out that they represent only 4 grapheme clusters. Inspecting every last byte like that is unavoidably O(n).

Several types are able to satisfy this, depending on what you plan on doing:

let string = "Hello, world! Καλημέρα, κόσμε!"

let graphemeClusters: [Character] = Array(string)
let unicodeScalars: [Unicode.Scalar] = Array(string.unicodeScalars)
let utf8Bytes: [UInt8] = Array(string.utf8)
let utf16Units: [UInt16] = Array(string.utf16)
let data: Data = Data(string.utf8)

Each traverses the entire string once during conversion to or from String, but is O(1) while doing its own operations.

By redefining what the element of the sequence is, each of the above types ensures that the addition of 1 element increases the count by precisely 1.

String’s definition of a Character cannot satisfy this even by restricting it to ASCII.

assert(("\r".count + "\n".count)       == 2)
assert(("\r"       + "\n"      ).count == 1)
6 Likes

Correction, "\u{E9}", not "\u{65}"

By preventing a user from using code points such as "\u{E9}," " \u{65}", etc. & requiring the user to type out which character it would be resolved to instead, then maybe this type of string can be treated more like an array of integers. In this type of string, “é” would represent only 1 grapheme cluster instead of 2. Every character would only represent 1 grapheme cluster.

The O(N) cost to convert a string into an array & the superfluous memory allocated for the array can possibly be eradicated for scenarios where the only purpose they converted into an array was to perform .count w/ O(1) time.

Also, suppose an algorithm needs to compare / validate the length of multiple strings match:

let string1 = "é"
let string2 = "e"

We want each string above to return a count of 1.
string1.utf8.count -- returns 2
string2.utf8.count -- returns 1
Data(string1.utf8).count -- returns 2
Data(string2.utf8).count -- returns 1

Suppose we have these strings instead:
let string1 = "é"
let string2 = "e"
let string3 = ":heart_eyes:"

We want each string above to return a count of 1.
string1.utf16.count -- returns 1
string2.utf16.count -- returns 1
string3.utf16.count -- returns 2

string.unicodeScalars.count & string.count return the same value until a string with multiple code points is introduced, ie: "\u{65}\u{301}" (resolves to é). Regarding characters that are created w/ code points: I'm interested in seeing if it's possible to create a special string or character type that doesn't take time to decipher how many grapheme clusters a character represents.

“e” and “◌́” are each one grapheme (atomic unit of meaning), “é” is one integrated cluster of graphemes (roughly a group of graphemes which work together in a way that breaks the notion of sequential order). “Grapheme cluster” is the more technical term for what Swift calls Character. Every Character represents exactly 1 grapheme cluster.

[Character] behaves as you seem to mean by “like an array of integers”. Both String and [Character] have elements that are Character instances, but the two types behave in fundamentally different ways.

[Character] keeps its elements separate from one another, effectively storing not just the text, but a set of arbitrary boundaries within it according to how it was constructed. It is capable of answering the question “How many characters did I insert?” in O(1), but it still requires O(n) to figure out “How many characters does the text I represent have now?” The tradeoff is that when loading a text file, it is O(n) to find the boundaries in the first place during initialization.

String on the other hand stores its contents as an uninterrupted stream. That means it is completely unable to answer the question “How many characters did I insert?”. But instead, it is capable of loading a text file without even bothering to figure out how many characters there are. It is O(n) to answer the question “How many characters does the text I represent have now?”, but for many uses of String, you never need to ask for that in the first place.

This illustrates the behavioural difference:

let characters: [Character] = ["ᄒ", "ᅡ", "ᆫ", "ᄀ", "ᅮ", "ᆨ", "ᄋ", "ᅥ"]
let alsoCharacters: [Character] = ["한", "국", "어"]
assert(String(characters) == String(alsoCharacters))

assert(characters.count == 8)
assert(String(characters).count == 3)
assert([Character](String(characters)).count == 3)

These escapes are only present in the source code. By the time they are compiled into a binary, they have already been replaced with exactly the same bytes as their direct forms. Escaped literals have nothing to do with runtime efficiency.

This is impossible. It takes time to decipher either:

  • how many clusters are in the stream (String), or
  • if the cluster delineation is still valid ([Character])

You could create a type that audits and records the delineation after each mutation. It could help in situations where you mutate fewer times than you count. But the initialize‐then‐count process would still be O(n), and the efficiency of all mutating operations would suffer greatly.

8 Likes

Not quite. Like measuring any other array, [Character].count is O(1), and that tells you how many characters are represented. It is only O(n) to determine the storage requirements of the represented text, because each character may comprise an arbitrary number of UnicodeScalar values.

Note: if you try @SDGGiesbrecht's example, you might find the results confusing unless you use something that properly displays grapheme clusters composed of multiple unicode scalar values. For example, Terminal.app works but iTerm doesn't.

3 Likes

It is certainly possible to create a new, tree-based String-like type where str.count, str.unicodeScalars.count, str.utf8.count, and str.utf16.count are all constant-time operations.

Perhaps far more importantly, index(_:offsetBy:) would take logarithmic time, and (as long as we assume the grapheme cluster length has a constant upper bound) so would replaceSubrange.

The big trade off is that I don’t think such a new type would be appropriate for use in most of String’s use cases. (I.e., relatively short, and/or mostly immutable pieces of contiguous text.)

  1. Text storage would not be contiguous.

  2. I expect the constant factors involved to be somewhat high, rendering this type quite slower than String for most basic string applications.

  3. There would be some memory overhead spent on maintaining the tree structure.

  4. As @SDGGiesbrecht said, mutations would need to be careful to correctly update character/scalar/utf8/utf16 counts in the affected subtrees. This can be quite costly in the worst case, although the common cases are simple.

That said, this would make an excellent type to serve as the backing store for, say, an editable text document.

2 Likes

I am not sure whether we mean the same thing in different words or not.

To clarify from my end: [Character]’s count tells you how many entries it has, but not how many entries it is supposed to have. If you built it by doing [Character](someString), then you know it is still “in standard form”. But if you have built it by concatenation, the divisions between its elements may no longer make sense in terms of the actual text. Determining its “standard form” to properly answer that question is not O(1). That was the point of the Korean example. Its array has 8 elements, but its text has 3 characters. 8 was arrived at in O(1), but was not the correct answer.


Also, to everyone here, I would like to say that I am not claiming the language’s string utilities are complete. There may well be a use case that would benefit from a new type or feature. My responses thus far have just been trying to clear up apparent confusion about what several terms mean and how the String type and the compiler work at present.

At this point, if you (or anyone really) can give us a complete example of a use case you wish were faster, we would be able to properly analyze its efficiency. It would enable us to look for specific optimization opportunities, whether already existing or worth adding to the language.

5 Likes

Thanks for clearing that up. So, it'd have been more accurate to have said "I'm interested in seeing if it's possible to create a special property, string or character type that does not take time to decipher how many graphemes are in a grapheme cluster."

Deciphering how many graphemes are in each character (aka grapheme cluster) also takes time. It may help to eradicate / modify this process somehow (ie. by using a constant value of 1 grapheme per grapheme cluster instead of finding out the actual number).

Consider the string below, which is created which the \u{E9} code point:

let str = "\u{E9}"

This new approach (aka - not counting the # of graphemes in a grapheme cluster) would return 6 for str.customCount; it wouldn't be useful for any strings that are created with code points.
It also may not be useful for every alphabet.

I think there's a few misconceptions going on, that I'd like to clarify.

  1. Misconception 1: "\u{65}" and "e" are somehow different.

    They are absolutely, exactly equivalent. Just as "\n" is 1 character (U+0010), with no \ or n in it. They're just a difference in notation. Once the compiler has parsed this and baked it into the final binary, it's just a stream of utf8 bytes that encode these characters, with no distinction as to how they were expressed in the source code.

  2. Misconception 2: That pasting a character literally into a string like "é" is different from expressing its constituent scalars like "\u{65}\u{301}"

    Again, it's only a difference in notation. And in fact, you can paste the same looking glyph and have it be a different composed of different scalars. Consider é ("\u{00E9}") and ("\u{0065}\u{0301}"). There might be some step on this forum that normalizes the two into the same representation, but they were different when I pasted them. Compare:

  3. Misconception 3: Expensive string traversals are only necessary because some characters like é can be expressed in one of multiple ways.

    The expensive string traversals are necessary because many of what end-users would call characters (formally, extended graphame clusters, EGCs) aren't single bytes. On a global scale, most human text is not ASCII. E.g. emojis, all non-Latin alphabets (remember that there are more Chinese speakers than English speakers, world-wide), and all the other countless math, science, etc. symbols that exist in unicode.

I should also note:

String has many (at least 3, IIRC) internal representations. All of them are exposed via the same API that makes it look like a Collection of Character (EGCs), but there are optimizations under the hood. There's a bit in a bitfield in the header of Strings that indicates when a String is ASCII-only. For this special case, the "1 character = 1 byte" assumption holds true, and various fast paths are implemented for it. (for count, for example)

10 Likes

Unfortunately the isASCII flag isn’t sufficient for a lot of the fast paths. We would need a “doesn’t contain a 2 byte newline” flag as well :frowning:

I have a PR that I ended up not merging because scanning for \r\n before doing the math eliminated most of the wins.

6 Likes

This is really unfortunate. I would go as far as adding a special unsafe init that asserts the string is ASCII and there is no \r\n in it to enable full fast path on it.

1 Like

We do have a few spare flag bits on String that could probably be used for this. Doing the check once during initialization should be a lot cheaper than doing it every time. We'd just need to weigh the tradeoffs to make sure that's the best use we can put that flag bit to, and do the work to validate that we get the magnitude of speedup we're hoping for and that the operations that speed up are important enough.

7 Likes

We've long wanted isSingleScalarGrapheme bit (CC @Alejandro): State of String: ABI, Performance, Ergonomics, and You! · GitHub, in which grapheme breaking trivially reduces to scalar decoding. We have the isASCII flag (i.e. trivially-encoded) and a isNFC flag (i.e. canonical, which is only set for ASCII for perf reasons but we could set it more aggressively).

3 Likes

No.

> "\r\n".count
Int = 1
> "\r\n".unicodeScalars.count
Int = 2

This sequence, while two unicode scalars, is a single grapheme cluster (in Swift, a single Character.)

What I have done in the past is to operate in the realm of representing fixed-size unicode scalars, and convert to/from that for Swift String types.

I use unicode scalars, asciiValue, & various character representations often; subtracting 65 here... adding 32 there, etc for conversions. Is that what you are referring to?

Rather than dealing with a collection of Characters, which are variable size, I deal with say an [Unicode.Scalar] (which I believe wraps UInt32) or with [UInt8] if I know I'm dealing with the ASCII range, or UTF-8 data and can cope with random access being within a scalar value.

Which is really why String isn't purely accessible or fixed length - it represents itself as (simplifying) a collection of printable glyphs and whitespace, not of the composing scalars or of UTF-8-formatted bytes.

1 Like

Well, yes, unicode scalar's value property is UInt32, & those other similar properties & methods such as String.UnicodeScalarView.count are great, but also can't always do what I need.

If you were willing to accept some false negatives, could you not always search for \n and assume all multi-line strings have complex graphemes due to the line-break? So it would be more of a isKnownSingleScalarGrapheme bit.

Checking is a buffer contains an exact match of a single byte (\n) is very fast and can be SIMD-ised. I would imagine you could roll this in to the same loop which checks ASCII-ness.

Yeah, the issue was I was doing it every time rather than once on string creation. Caching the result like that should alleviate the specific issue that blocked my earlier attempt.

1 Like

I can't speak to what you need (e.g. the algorithmic need you have for const-order complexity string manipulations/introspections).

I can speak to not being able to work in terms of fixed-size Characters even if you restrict yourself to the ASCII range. Hence recommending work instead in terms of something else like fixed-width scalars (code points), or sub-scalar values like utf-8 bytes (code units) - with the trade offs that come from that.

Swift's choice of having String be based on Character is somewhat unique across programming languages, and comes from a place of maximal correctness and expressibility rather than efficiency. In other languages, you would almost certainly be working in terms of code points or code units - and would have both the efficiency gains and potential correctness issues that result from that choice.

Swift's String exposes views of both code points and code units, but does not necessarily make these first class types. I believe this is because the internal StringObject is bridging multiple representations (e.g. ASCII, UTF-8 and UTF-16). As a result, there is no canonical internal representation that would be cheap to expose, hence exposing view sequences instead.