String O(1) access to Character

devmeremenko · January 20, 2019, 2:21pm

Hi, I wonder whether a Character from a string can be received in constant time.

Currently, String does not support RandomAccessCollection which means that time complexity of the access to the random character is O(k) where k is an index from the beginning.

What if we use ArraySlice (which supports RandomAccessCollection) for this problem?

let source = "string"
let slice = ArraySlice(source)
        
(0..<slice.count).forEach { print(slice[$0]) }

Whether the time complexity is reduced to O(n) or ArraySlice anyway calls index(_: offsetBy:) on an underlying string that turns into O(n^2)?
Will Array("some string") share a common buffer with a passed string? (which means that a new memory for characters of a "some string" will not be allocated).
Do we have any other options to make constant time access to characters of the string?

Saklad5 · January 20, 2019, 5:40pm

Unfortunately, I don’t think that is possible. Characters are extended grapheme clusters, which can be composed of multiple Unicode scalars. While you might be able to take a few shortcuts, you will never be able to receive a single Character in O(1) time.

I’m relatively new to Swift, so I may be wrong. That’s just my understanding based on the documentation.

Ben_Cohen · January 20, 2019, 7:43pm

Unlike Slice (which wraps an underlying Collection and provides a window into it), creating an ArraySlice creates a new buffer. So this use of ArraySlice is effectively the same as creating an Array of the string's characters (i.e. using ArraySlice like this is pretty much the same as writing Array(source)[...]). This is how it is able to give a hard guarantee of O(1) access, unlike Slice which "passes on" the complexity of the type that it wraps.

This means that the code does provide constant-time access to individual characters, but it does it by allocating a new buffer, not sharing common storage. This buffer will have Character-size elements. A Character is 16 bytes, which means almost all characters will use far more storage than they need. Conversely, the emoji that don't fit inline within that 16 byte storage will allocate additional memory (unlike a string, where all the characters are stored contiguously in the string's buffer). So this ends up being a fairly inefficient way of doing this.

Another option if you do want to make this space tradeoff to gain random access would be to create an array of the indices i.e. Array(source.indices). This is still going to be using more memory than you need, but 8 rather than 16 bytes, and won't ever allocate additional storage. You can then access the individual elements at random via source[idxs[4]].

But the next question is, why do you need random access to graphemes? There are some algorithms you might want it for, and so are willing to pay the cost of allocating memory to ensure it, but in many cases, there are other ways to achieve what you want without random access, by writing code that moves progressively through the string from start to end, using index(_, offsetBy:) and similar indexing API. These are fiddlier to use, but does mean that what you are trading off is programming model simplicity vs space, rather than time vs space. Hopefully there are further API that can be explored that make common tasks simpler, without compromising on efficiency.

devmeremenko · January 20, 2019, 9:30pm

Thank you for the very detailed answer, Ben!

One of the reasons why the random access is needed is an implementation of searching algorithms. There are KMP, Z-algorithm and many others that involve getting characters by calculated Int-indexes.

Unfortunately, it’s not easy to implement them by moving progressively through the string from start to end. Especially, if you have to work with strings during an interview.

So, I guess Array(string.indices) can be a really good trade-off in the time-bound context.

This is still going to be using more memory than you need, but 8 rather than 16 bytes, and won't ever allocate additional storage.

Does allocate additional storage mean that the size of an item will remain constant 8 bytes?

Ben_Cohen · January 21, 2019, 5:35am

Well... sometimes. But often you can cut back what the algorithm needs to something narrower than full random access to the entire search space. For example, many only really need to move backwards to a previously visited index never further back than the length of the pattern (i.e. if you fail matching at the nth element of the pattern you only need to go back m elements and resume matching from there). You can achieve this with a small stack or indices you push onto as you match, then discard.