String character offset <-> byte offset

Using this code now to calculate the "char to byte offset" table, but it's not good as it goes through String characters:

Not so great attempt
let offsets: [Int]
let byteLen: Int
let charLen: Int

init(_ string: String) {
    var offset = 0
    var offsets: [Int] = [0]
    for c in string {           // 1
        offset += c.utf8.count
        offsets.append(offset)
    }
    self.offsets = offsets
    self.byteLen = offset
    self.charLen = string.count
}

To be compatible with NSAttributedString I have to drop to the older NSRange coordinates. So somehow need to go through (pseudo code):

String -> NSString / (or utf16?) -> iterate through those chars -> get utf8 subsequence for each utf16 character -> use those utf8 subsequences counts to calculate the "char" to byte table. Not sure I am on the right track here.

The higher level overview is to have a roundtrip conversion of NSRange-expressed coordinates (applicable to NSAttributedString) to byte offsets and back to be able applying those to a NSAttributedString copy on the receiver side.

NSAttributedString ->
serialise attributes to a custom structure ->
send those structs (*) to the receiving side along with the plain string ->
recreate NSAttributedString on the recipient side, should match the original

(*) the additional restriction of the custom struct representation is that offsets are expressed as byte (UTF8) offsets.

That sounds like a recipe for trouble, unless details you did not provide constrain it to safe territory. See this post.

Instead, I suggest you dismantle the NSAttributedString using attributes(_:at:effectiveRange:) or one of its siblings, paying special attention to the effectiveRange, and rearranging it into an array such as [(String, [NSAttributedString.Key: Any])] or a simplification thereof. That array can be safely coded, stored or transmitted. Since each slice of the string has become a separate instance, their lengths are irrelevant, and their contents cannot jump across the boundaries.

To assemble the NSAttributedString again on the other side, turn each slice into its own attributed string and then concatenate them.

4 Likes

My windows and android colleagues would not be happy with that representation.

We settled upon the UTF8 representation for the string itself. As for the offsets /ranges of style information - we settled upon "offset in bytes" in that UTF8 representation, otherwise it deemed too fragile to work across platforms (even "đŸ‘Ș".count and "(đŸ‘Ș" as NSString).length disagree, and that's effectively two different platforms already).

Then you can store the string as a Data(string.utf8). Make sure you do not encode it as String, or it could change during storage or transmission (which is a problem on all platforms). If it is stored in JSON for example, it had better end up as something like base 64 and not just a string literal. As soon as a string leaves your program’s memory, you have lost control of its byte and scalar representations; all you can reasonably expect is that it will remain semantically equivalent (NFKD/NFKC).

You can convert from an NSRange Int to a UTF‐8 byte offset by getting the index string.utf16.index(string.utf16.startIndex, offsetBy: intFromNSRange) and then doing string.utf8.distance(from: string.utf8.startIndex, to: utf16index).

1 Like

Hmm. these strings travel "as strings" to the server / getting stored there / returned back. They can be used as strings on the server side (e.g. for searching API's). Between client and server they travel in XML and/or JSON form. The strings are usual user content (names, messages, links, etc) like in a typical messaging app, so things like emojis or other funny unicode characters will be there.

Do you mean that server (or a library on another client side) can change from, say:

"LATIN CAPITAL LETTER A WITH RING ABOVE"
U+00C5, UTF8: C385

to

"LATIN CAPITAL LETTER A" + "COMBINING RING ABOVE"
U+0041 + U+030A, UTF8: 41CC8A

or change from "0D0A" to "0A" or vice versa?

I'll verify how those cases behave. Do you have other cases in mind?

Yes, that is exactly what I mean.

For someone just using it as text, the particular representation does not matter, and that is why anyone might change the representation at will. It is only because you want to externally describe positions in the string that suddenly the particular representation becomes vitally important. But no one in the middle knows that. And since you often cannot be certain that no one in the middle is using a newer version of Unicode than you are, you cannot be sure that your own normalization on the receiving end will undo everything the middlemen may have changed.

“Encrypting” it so that all middlemen only see it as unintelligible binary data instead of as text prevents them from altering the representation.

I see what you mean. Yet, in our system we can consider that to be a bug. I.e. any component on the way from the client app -> server -> another client app that changes UTF8 representation "at will" either considered having erroneous behavior, or it does it only for its own purposes that do not require styling information (e.g. to implement plain text searching on the server side), or if that's what that components wants to do it must also do the corresponding change to the external styling information, so the two representations always agree.

I haven't seen such bug in our system yet during the minimal testing I've done so far, but I do not preclude a possibility of such bugs happening in the future, e.g. after server or clients components upgrade to use another version of unicode.

That's a no go if server needs to know the plain text representation (e.g. to support search).

I do not consider your software that is running on the server to be a middleman, because I presume it is under your control and can be taught to understand it just fine. By middlemen, I only mean things outside your control that do the passing on or persisting for you.

For example, let’s say you have a JSON file that reads ["à", ["0–2", "bold"]] and is marked with the MIME‐type text, where the 0–2 is a scalar range describing that the whole string (a and ◌̀) should be bold. Your application saves it to the disk, planning to read it back at some later date. But before that day comes, the user opens it in a text editor by accident. The editor normalizes what it perceives to be a text file to NFC upon opening and saves it that way when the user closes it. Now when your application loads the file and tries to apply the 0–2 range, it crashes because the 2 reaches past the end of the string, which is now only 1 scalar in length. The point is that the middleman (in this case the text editor) did not know that it must adjust the external styling information—and it could not know, because it did not know what that information even meant. If instead you mark the file as the MIME‐type octet stream, then the text editor will refuse to open it and nothing else is going to shuffle the bytes either (except malicious actors, but that is a separate issue).

If you really are sure that your components are a closed system and nothing can pull the rug out from underneath, then go ahead and do it that way if it is easiest. It just should not be done blindly without careful consideration. I cannot make that judgement for you because I cannot see what it is you are working on.

The general advice for anyone else who might be reading and unsure about what to do, is simply that the whole problem can be avoided by storing the information in‐line. That way the markers automatically adjust no matter how the representation changes. With description formats like ["[à(bold)]..."] or [["à", ["bold"]], ["...", []]] you can safely not bother to even think about the representation.

3 Likes

html with uff-8 text is what I've used before. It's cross-platform, everybody knows about it. I think rft (rich text format) has also been used for this, although I don't think it's popular anymore.

I'll admit that in my use cases the server generated the html and sent it to the clients for display. Users couldn't generate styled text like that on the client but they could generate utf-8 text with emojis and whatever else they wanted, which was sent to the server for storage.

Understood.

It can crash, but that would be a bug on my book, I'd prefer the code to not perform the operation in question instead of crashing.

By the same account user can delete the closing style tag by accident... I'd say we do not support that operation; ideally we do not crash either when user did that, but we won't make any guarantees.

That's possible, but that can also interfere with the search component functionality of our server that won't be able finding exact matches for "the bold style" string if the string is stored as "the ["bold", ["bold"]] style". Unless we teach that server component to strip those tags when doing the search. Or unless we send/store two representations to the server and back, one with plain text and another with styled text.

This is the current iteration of my code that builds a helper table that converts between byte offsets and character indices.
struct StringByteOffsetTable {
    
    let string: String
    var indices: [String.Index] = []
    var byteOffsets: [String.Index : Int] = [:]

    init(_ string: String) {
        self.string = string
        
        var byteOffset = 0
        
        for index in string.indices {
            assert(index != string.endIndex)
            let char = string[index]
            let byteCount = char.utf8.count
            for _ in 0 ..< byteCount {
                indices.append(index)
            }
            byteOffsets[index] = byteOffset
            byteOffset += byteCount
        }
        indices.append(string.endIndex)
        byteOffsets[string.endIndex] = byteOffset
    }
    
    func stringIndex(forByteOffset byteOffset: Int) -> String.Index {
        indices[byteOffset]
    }
    func range(forByteRange range: NSRange) -> Range<String.Index> {
        stringIndex(forByteOffset: range.lowerBound) ..< stringIndex(forByteOffset: range.upperBound)
    }
    
    func nsRange(forByteRange range: NSRange) -> NSRange {
        string.nsRange(from: self.range(forByteRange: range))
    }

    func byteOffset(for index: String.Index) -> Int {
        byteOffsets[index]!
    }
    func byteRange(for range: Range<String.Index>) -> NSRange {
        NSRange(lowerBound: byteOffset(for: range.lowerBound), upperBound: byteOffset(for: range.upperBound))
    }
}

Not particular fast (O(string length)), and also requires lots of memory also O(string length) + (utf8 count): one string index per every byte offset, and one byte offset per every string index.

1 Like
While working on it found something unexpected.
import UIKit

func make_a_funny_copy(_ string1: String) -> String {
    let textView = UITextView()
    textView.text = string1
    let string2 = textView.text!
    assert(string2 == string1)
    return string2
}

func test(_ string1: String, len: Int?) -> Bool {
    
    let len = len ?? (string1 as NSString).length
    
    let string2 = make_a_funny_copy(string1)
    assert(string2 == string1)

    let nsRange1 = NSRange(location: 0, length: len)
    let nsRange2 = NSRange(location: 0, length: len)
    assert(nsRange1 == nsRange2)
    
    let range1 = Range(nsRange1, in: string1)!
    let range2 = Range(nsRange2, in: string2)!
    
    let nsRange11 = NSRange(range1, in: string1)
    let nsRange22 = NSRange(range2, in: string2)
    assert(nsRange11 == nsRange22)
    
    return range1 == range2
}

func foo() {
    let s1 = "Hello, World!"
    assert(test(s1, len: 0)) // ok
    assert(test(s1, len: 1)) // ok
    assert(test(s1, len: nil)) // ok
    
    let s2 = "ABCđŸ‘ȘZ"
    assert(test(s2, len: 0)) // ok
    assert(test(s2, len: 1)) // ok
    assert(test(s2, len: nil)) // CRASHES
    
    print("done")
}

So, we have two "equal" strings to begin with, take an NSRange are make two Range<String.Index> in those two string, and the resulting ranges are sometimes equal and sometimes not equal depending upon a string. Not sure if this is a bug or expected behaviour, and it's just luck that the ranges are equal sometimes - in which case it would make sense to make such comparrisson of ranges made from "different" strings (even if the strings are == each other) to always fail to make behaviour more consistent and reliable.

If you really do need the entire table (otherwise it might be faster just to convert the few individual indices you need with the methods I mentioned earlier), then I suspect iterating the .utf8 would be faster than iterating the Characters.

init(_ string: String) {
  var fastString = string
  fastString.makeContiguousUTF8()  // Might speed up the rest.
  self.string = fastString

  var byteOffset = 0
  var characterStart = fastString.startIndex
  func process(byteBoundary: String.UTF8View.Index) {
    if let characterBoundary = byteBoundary.samePosition(in: fastString) {
      byteOffsets[characterBoundary] = byteOffset
      characterStart = characterBoundary
    }
    indices.append(characterStart)
  }
  for byteBoundary in fastString.utf8.indices {
    defer { byteOffset += 1 }
    process(byteBoundary: byteBoundary)
  }
  process(byteBoundary: fastString.utf8.endIndex)
}

In case it matters (in the current context probably not), your NSRange conversions ignore the possibility of NSNotFound.

It is never valid to use a String.Index on any string instance besides the one you got it from. Calling == across indices from two different string instances violates that. In this case, the string you created from source (Swift) has its memory in UTF‐8 under the hood, but the string you got from UIKit (Objective C) has its memory in UTF‐16 under the hood. Thus when you call ==, under the hood you are comparing apples to oranges, and you will not get a meaningful answer. (The String.Index type does not store the string it came from—otherwise the index would use more memory than the string itself—so the two indices have no way to work out what each would mean in the other’s format even if they wanted to.)

That would certainly be nice. I do not know if it is actually feasible or not at the engineering level. You can report it as a bug if you would like and someone else will be able to decide.

1 Like
Thank you, based on that code sample here's my latest version.
extension String {
    func nsRange(fromByteRange byteRange: NSRange) -> NSRange {
        let utf8 = self.utf8
        let utf16 = self.utf16
        
        let a = utf8.index(utf8.startIndex, offsetBy: byteRange.location)
        let b = utf8.index(a, offsetBy: byteRange.length)
        let loc = utf16.distance(from: utf16.startIndex, to: a)
        let len = utf16.distance(from: a, to: b)
        let nsRange = NSRange(location: loc, length: len)
        return nsRange
    }
    
    func byteRange(fromNSRange nsRange: NSRange) -> NSRange {
        let utf8 = self.utf8
        let utf16 = self.utf16
        
        let a = utf16.index(utf16.startIndex, offsetBy: nsRange.location)
        let b = utf16.index(a, offsetBy: nsRange.length)
        let loc = utf8.distance(from: utf8.startIndex, to: a)
        let len = utf8.distance(from: a, to: b)
        let byteRange = NSRange(location: loc, length: len)
        return byteRange
    }
}

Do you know if "string.utf8" / "string.utf16" caches its result so the subsequent calls of those are fast (provided the string is not changed)?

I am pretty sure both are lazy. let utf8 = self.utf8 does nothing but create a shell pointing at the string’s memory which then treats the elements and indexing differently when you interact with it.

Some caching occurs both inside the string instance and inside the index instance to make repeated use faster, but its precise behaviour is undocumented and subject to change.

What you have now looks perfect to me (as long as you are certain the function will never be fed an invalid range).

I have a PR in progress that vectorizes looking up UTF16 offsets in UTF8 bytes for a nice perf win. No promises of course (still have some correctness issues to figure out), but could help here if/when it happens :slight_smile:

Very, very tangential, but what if you sent text with styled with ANSI color codes? Is there any way for normalization to break those?

You mean like "\u{1B}[31mThis is red./Î‘Ï…Ï„ÎżÌ ΔÎč́ΜαÎč ÎșÎżÌÎșÎșÎčÎœÎż.\u{1B}[0m"? That is another example of an in‐line format. If the ÎżÌ expands or contracts, the second escape might be pushed or pulled to a different numerical distance from the beginning of the string. But since it will still be right after the ., it will not matter. The intent will be preserved no matter what happens.

1 Like

Landed a first pass at this, and have a followup that's worth another 20% or so. Exact performance depends on the contents of the string, but benchmark results suggest it speeds up finding UTF16 offsets in Swift Strings by 2x or more.

2 Likes

I have a helper structure that helps working with string indices:

struct StringIndex {
    let string: String
    let index: String.Index
}

What's the best way to assert that the two arbitrary indices belong to "the same" string? I presume "a.string == b.string" is not the right check as two different strings might compare equal, e.g.:

U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE)  ==
U+0041 + U+030A (LATIN CAPITAL LETTER A + COMBINING RING ABOVE)

Would this check on utf8 representations be correct for the assert?

a.utf8CString == b.utf8CString

Note, in my code all indices are created with "string.startIndex", "string.index(after:)", etc. and I am not mixing and matching indices created via different views... (it would also be cool if I can assert that fact as well, if that's possible / necessary).

I do not know if that is possible. @Michael_Ilseman @lorentey