Indices for chains of Strings?

I am developing a rope data structure[1] in Swift. A rope is a balanced
binary tree representation for a string that makes operations like
"insert," "delete," and "substring" scale to very large strings. Each
internal node represents the concatenation of the strings in its
subnodes. In my implementation, every leaf is a String.

In a rope, integer indices are used to address individual characters
and ranges of characters. Internal nodes are labeled with the length
of their left subtree, and methods like character(at:) or substring(_)
steer their search by comparing the length at each node with the target
index.

I'm trying to come up with Swift-y integer indices for my rope
structure. My first impulse was to use String.IndexDistance as my rope
index. However, to index a character within a leaf, I have to convert
from String.IndexDistance to String.Index using index(_:offsetBy:).
That has O(k) complexity, which is not suitable for my purposes.

To avoid the O(k) complexity, it looks to me like I need to use an index
on a view that conforms to RandomAccessCollection. In Swift 4.2 and
earlier, the sole random-access view is .unicodeScalars. (I am frankly
surprised that it is not .utf16.) I have surmised that the preferred,
random-access view in Swift 5 may be .utf8. Maybe someone can tell me
if I am on the right track here?

Dave

[1] Hans-J. Boehm, Russ Atkinson, and Michael Plass. "Ropes: an
Alternative to Strings." Software—Practice and Experience,
vol. 25, no. 12, December 1995, pp. 1315-1330.

I would drop the idea of integers and create my own index type built on String’s natural index:

extension Rope {
    struct Index {
        let ropeSegment: SomeIndex // Points at a particular string.
        let stringIndex: String.Index // The index in that string.
    }
}

Then fill in the rest of the methods necessary for protocol conformances.

3 Likes

In my experience, integer indices are usually not the right choice for tree-based collections (or, in general, any collection type). Consider using a custom index type that contains some representation of the path inside the tree, along with an String.Index. (Add weak references to the root node and the closest node for validation and performance, respectively, and an integer offset to simplify your Comparable implementation.)

You can still provide integer offset-based access as an additional subscript operation.

If you choose to represent pieces with Swift-native String values, then the UTF-8 view will give you random access. However, you’ll lose direct mutation operations in exchange, which is probably not the right trade-off.

For a rope that follows String’s equality semantics, it will be interesting to balance the complexities of grapheme clusters with performance expectations. I’d probably want to ensure characters are never split between two Strings inside the rope. (This too is a nontrivial task when you need to support arbitrary insertions.)

3 Likes

Thanks, Karoy and Jeremy. I needed an outside perspective to help reset
my thinking.

I will use the rope as storage for an NSTextStorage, so random access to
integer ranges will be necessary. An NSTextStorage is indexed in terms
of UTF-16 code units, so that would be the integer index to start with.

I have doubts about constructing the NSTextStorage's string property
from the rope's leaves, since TextKit takes pains to avoid a copy and
expects for access to be O(1).

I am leaning toward using the usual idiom for NSTextStorage implementation,

class MyTextStorage : NSTextStorage {
	private var storage = NSTextStorage()

	override var string: String {
		return storage.string
	}
	/* et cetera */
}

and applying changes both to storage and to my rope. The leaves in
the rope need not store actual characters, the lengths of the leaves
will suffice.

(You may ask, what was the rope for if it doesn't need to store the
characters? I am augmenting my rope to record the presence of nested
text containers and the presence of one or more insertion points between
adjacent characters.)

Dave