Detect extended grapheme cluster boundary

Sammcb · April 9, 2021, 3:45am

Hi everyone,

I am currently working on a Swift program that will stream data from a large text file. I was wondering if Swift has a publicly accessible function for detecting if a character is an extended grapheme cluster boundary. I found this StringGraphemeBreaking.swift file in the stdlib that appears to be able to detect boundary characters. Is this publicly accessible? I can't seem to find any references to it in the developer documentation.

Thanks in advance for any help!

SDGGiesbrecht · April 9, 2021, 3:53am

A Character is an extended grapheme cluster. So please clarify what you mean. Are you trying to find out whether an index is at a the boundary between two grapheme clusters? Do you want to know if a scalar has some particular property with regards to grapheme determination?

Sammcb · April 9, 2021, 4:01am

Sorry, I meant an 1 byte chunk. So for example, if a file being read only containers a string consisting of the , then if the stream only ready the first 8 bytes, it would believe the file contained a . I was thinking of using the grapheme boundary detection to determine if my algorithm should keep reading bytes until it hit a grapheme boundary and recognized the string as the .

SDGGiesbrecht · April 9, 2021, 12:17pm

It is not possible to determine whether a scalar is the end of a grapheme cluster without knowledge of the following scalar. For example, if the known portion of the string is “fac”, I cannot know whether it will continue with “ts” to make “facts”, in which case there is a boundary between the third and fourth scalars, or whether it will continue with “◌̧ade” to make “façade”, in which case the third and fourth scalars form a cluster.

If you want to ensure you do not cut clusters in half, you will have to buffer the bytes into a string, and then pull all but the last cluster, which might be incomplete. Once you hit the end of the file, you can safely consume whatever remains, knowing the final cluster does not have a pending continuation.

For example:

var byteBuffer: [UInt8] = []
var clusterBuffer: String = ""

// Pull the next chunk from the stream.
while let chunk: [UInt8] = stream.nextEightBytes() {
  
  // Add the chunk to the buffer.
  byteBuffer.append(contentsOf: chunk)
  
  if let stringChunk = String(bytes: byteBuffer, encoding: .utf8) {
    // This “if” will have failed if we were in the middle of a scalar.
    // In that case the buffer will grow, and with it the next attempted string chunk.
    
    // Clear the buffer of successfully extracted scalars.
    byteBuffer = []
    
    // Move the decoded string portion into the cluster buffer.
    clusterBuffer.append(contentsOf: stringChunk)
    
    if !clusterBuffer.isEmpty {
      // Extract all but the last (possibly incomplete) cluster.
      let beforeLast = clusterBuffer.index(before: clusterBuffer.endIndex)
      let completeClustersOnly = String(clusterBuffer[..<beforeLast])

      // Clear the buffer of the successfully extracted clusters.
      clusterBuffer.removeSubrange(..<beforeLast)
      
      // Do someting with our result.
      print("Chunk adjusted to cluster boundaries: “\(completeClustersOnly)”")
    }
  }
}

// This is the final cluster left over in the buffer.
print("Chunk adjusted to cluster boundaries: “\(clusterBuffer)”")

Sammcb · April 9, 2021, 4:32pm

If I do know the following scalar, does Swift provide a way to quickly check whether the original scalar is a grapheme cluster boundary?

I know there are Unicode.Scalar.Properties I can access to determine if a scalar could be extended and such. I guess I was wondering what the differences between isGraphemeExtend, isExtender, and isEmojiModifier are and if Swift provides a way to use the Unicode.Scalar.Properties to determine if a scalar is a grapheme cluster boundary knowing the next scalar?

SDGGiesbrecht · April 9, 2021, 4:50pm

A scalar is not itself a boundary, but a piece of a cluster. It may be the last piece before a boundary or the first piece after. To determine if there is a boundary between two scalars, put them together and check whether the result has one cluster or two:^†

let possibleEnd: Unicode.Scalar = "e"
let possibleStart: Unicode.Scalar = "\u{301}" // ◌́

let boundaryBetween = "\(possibleEnd)\(possibleStart)".count > 1

On the other hand, if you have a String.Index that came from a String and not a String.UnicodeScalarView, String.UTF8View or String.UTF16View, then you already know the index is at a cluster boundary by definition.

† Edit: This assumes you already know there is a boundary before the first character. See the next post.

SDGGiesbrecht · April 9, 2021, 5:16pm

Responding to your latest edit:

These will not help on their own. Finding cluster boundaries is complicated and requires knowledge from farther back in the string:

var flags = "🇭🇺🇸🇦" // [H][U]|[S][A]
print(flags.count) // 2

flags.unicodeScalars.removeFirst()
flags.unicodeScalars.removeLast()
print(flags) // 🇺🇸 [U][S]
print(flags.count) // 1

Notice how the boundaries completely changed. If we were to only look into the middle two pieces of the longer string, we would get it entirely wrong.

The parsing example I gave earlier is only reliable because it began at the beginning of the string.

For an in‐depth explanation of those scalar properties you mentioned, and how they combine to determine the positions of grapheme breaks, see the relevant section of Unicode Technical Report #29: Text Segmentation.