[Solved] How can I convert a `String of Big5 Code Literal` to its corresponded character from Big5?

ShikiSuen · February 7, 2023, 1:19pm

The following is what I have worked on so far.
I wonder why the generated CFString's length is 0.

import Foundation

// Ref1: https://developer.apple.com/documentation/corefoundation/cfstringencodings
// Ref2: https://developer.apple.com/documentation/corefoundation/cfstringbuiltinencodings

let combinedCode = "A2D0"  // Fullwidth Alphabet "Ｂ" in Big5.
var charBytes = combinedCode.compactMap { CChar($0.hexDigitValue ?? 0) }
print(charBytes)
let string = CFStringCreateWithCString(nil, &charBytes, CFStringEncoding(CFStringEncodings.big5.rawValue))
if let string = string, (string as NSString).length > 0 {
  print("Successful: \(string as NSString)")
} else {
  print("Failed.")
}

tera · February 7, 2023, 1:42pm

It does't look you are making charBytes correctly. Try this one:

let charData = Data([0xA2, 0xD0])
let charBytes = (charData as NSData).bytes // TODO: remove NSData dependency
let string = CFStringCreateWithBytes(kCFAllocatorDefault, charBytes, charData.count, CFStringEncoding(CFStringEncodings.big5.rawValue), false)

CFStringCreateWithCString would also work but in that case your byte must be 0 terminated.

ShikiSuen · February 7, 2023, 1:43pm

Thanks for your response.
Sorry for my English, but could you please tell me what does 0 terminated mean?

tera · February 7, 2023, 1:46pm

C strings are "zero terminated" (contain zero at the end), e.g. this will do in the above example if you want to switch from CFStringCreateWithBytes to CFStringCreateWithCString:

Data([0xA2, 0xD0, 0])

ShikiSuen · February 7, 2023, 2:02pm

Thanks. I have changed my codes to the following:

import Foundation

// Ref1: https://developer.apple.com/documentation/corefoundation/cfstringencodings
// Ref2: https://developer.apple.com/documentation/corefoundation/cfstringbuiltinencodings

let combinedCode = "A2D0"  // Fullwidth Alphabet "Ｂ" in Big5.
var charBytesRAW: [Int] = combinedCode.compactMap(\.hexDigitValue)
var charBytes = [Int]()
var buffer: Int = 0
charBytesRAW.forEach { neta in
  if buffer == 0 {
    buffer += neta
  } else {
    buffer = Int(buffer) * 16
    charBytes.append(buffer + neta)
    buffer = 0
  }
}
charBytes.append(0)
print(charBytes)
let string = CFStringCreateWithCString(nil, &charBytes, CFStringEncoding(CFStringEncodings.big5.rawValue))
if let string = string, (string as NSString).length > 0 {
  print("Successful: \(string as NSString)")
} else {
  print("Failed.")
}

The charBytes array becomes [162, 208, 0] but it still fails.

tera · February 7, 2023, 2:24pm

You've got two errors here, the first is using Int instead of UInt8, and the second how you are passing resulting array into CFStringCreateWithXXX.

Once you get correct UInt8 array:

print(charBytes)
let data = Data(charBytes)
precondition(data == Data([0xA2, 0xD0, 0])) // TODO: remove afterwards
let bytes = (data as NSData).bytes // TODO: remove NSData dependency
let string = CFStringCreateWithCString(nil, bytes, CFStringEncoding(CFStringEncodings.big5.rawValue))

As for the NSData conversion (as in my example) - that's a quick & dirty way to getting bytes out of data, to do it in a modern way you'd want to use withUnsafeBytes on Data.

Edit: I'd not recommend using intermediate Array here, you may construct Data directly without making Array first.

tera · February 7, 2023, 2:43pm

This is how to go directly from Data to String:

precondition(data == Data([0xA2, 0xD0, 0]))
let cfEncoding = CFStringEncodings.big5
let nsEncoding = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(cfEncoding.rawValue))
let stringEncoding = String.Encoding(rawValue: nsEncoding)
let string = String(data: data, encoding: stringEncoding)

I'd clean this further to have something like this:

String(data: "A2D0".hexToData, encoding: .big5)

where big5 would be an extension on StringEncoding:

extension String.Encoding {
	static var big5 = ...
}

ShikiSuen · February 7, 2023, 2:51pm

Thanks for your help. I finally made it:

import Foundation

let combinedCode = "A2D0"  // Fullwidth Alphabet "Ｂ" in Big5.
var charBytesRAW: [Int] = combinedCode.compactMap(\.hexDigitValue)
var charBytes = [UInt8]()
var buffer: Int = 0
charBytesRAW.forEach { neta in
  if buffer == 0 {
    buffer += neta
  } else {
    buffer = Int(buffer) * 16
    charBytes.append(UInt8(buffer + neta))
    buffer = 0
  }
}
let data: NSData = Data(charBytes) as NSData
let string = CFStringCreateWithCString(nil, data.bytes, CFStringEncoding(CFStringEncodings.big5.rawValue))
if let string = string, (string as NSString).length > 0 {
  print("Successful: \(string as NSString)")
} else {
  print("Failed.")
}

ShikiSuen · February 7, 2023, 2:52pm

This is what I will favorite, too.
Maybe this gist is useful in this case:
Convert Hexadecimal String to Array or Data with Swift3 style. (github.com)

Martin · February 7, 2023, 2:56pm

Just note that String(data:encoding:) does not need or expect NULL-terminated input, it will convert the 0x00 byte to a U+0000 character.

ShikiSuen · February 7, 2023, 2:58pm

Thanks for your advise. We can drop the last if it is 0x00.

tera · February 7, 2023, 9:30pm

Yep, many ways to convert string to hex data.

Full example:

import Foundation

let combinedCode = "A2D0"  // Fullwidth Alphabet "Ｂ" in Big5.
let string = String(data: combinedCode.hexData!, encoding: .big5)
print(string)

extension String {
    var hexData: Data? {
        var firstDigit: UInt8?
        var data = Data()
        
        for char in self {
            guard let hex = char.hexDigitValue else { return nil } // not a hex string
            let digit = UInt8(hex)
            if let first = firstDigit {
                data.append(first * 0x10 + digit)
                firstDigit = nil
            } else {
                firstDigit = digit
            }
        }
        if firstDigit != nil { return nil } // odd hex string
        return data
    }
}

extension String.Encoding {
    static var big5: String.Encoding = {
        let cfEncoding = CFStringEncodings.big5
        let nsEncoding = CFStringConvertEncodingToNSStringEncoding(CFStringEncoding(cfEncoding.rawValue))
        let stringEncoding = String.Encoding(rawValue: nsEncoding)
        return stringEncoding
    }()
}

Edit: The name "hexData" is not quite good, can be easily confused with "make a data with hex representation of a given string" (e.g. to go from "ABC" to 414243 hex string stored in Data), while here you are doing the opposite. Data(hexString: "A2D0") looks better.

ShikiSuen · February 8, 2023, 9:57am

Thanks. Yours always looks simpler than mine.

I also made mine more universal (i.e. handling any codepage supported by CoreFoundation):


public extension String {
  func parsedAsHexLiteral(encoding: CFStringEncodings? = nil) -> String? {
    guard count % 2 == 0 else { return nil }
    guard range(of: "^[a-fA-F0-9]+$", options: .regularExpression) != nil else { return nil }
    let encodingRaw: UInt32 = {
      if let encoding = encoding {
        return UInt32(encoding.rawValue)
      } else {
        return CFStringBuiltInEncodings.UTF8.rawValue
      }
    }()
    let charBytesRAW: [Int] = compactMap(\.hexDigitValue)
    var charBytes = [UInt8]()
    var buffer = 0
    charBytesRAW.forEach { neta in
      if buffer == 0 {
        buffer += neta
      } else {
        buffer = Int(buffer) * 16
        charBytes.append(UInt8(buffer + neta))
        buffer = 0
      }
    }
    let data = Data(charBytes)
    let dataBytes = data.withUnsafeBytes {
      [Int8](UnsafeBufferPointer(start: $0, count: data.count))
    }
    let string = CFStringCreateWithCString(nil, dataBytes, CFStringEncoding(encodingRaw))
    if let string = string {
      return string as String
    }
    return nil
  }
}

Update: I managed to remove NSData dependency. However, it looks like the usage of withUnsafeBytes needs upgrade. At this moment I still can't figure out how to do it.

tera · February 8, 2023, 11:33am

Good call – NSData's bytes is quite problematic in swift:

NSData bytes is known to be valid until the NSData object is deallocated and exact point in time when this can happen is quite liberal in current Swift - e.g. it can happen right after the last usage of "data" variable ("data.bytes" in this example) which could immediately result in bytes memory invalid / reused for something else – unless of course you keep the reference to NSData object long enough:

let string = CFStringCreateWithCString(nil, data.bytes, CFStringEncoding(CFStringEncodings.big5.rawValue))
// use NSData object somehow to make sure it is valid till this point
// make sure this "usage" is not optimised away (release builds, etc)

Or just switch to a safer API like below (or even safer API – String(data:encoding:) – as suggested before.)

Indeed, like so:

let cfString = data.withUnsafeBytes { p in
    CFStringCreateWithBytes(nil, p.baseAddress!, data.count, CFStringEncoding(encodingRaw), false)
}

(or a similar CFStringCreateWithCString usage in which case the data has to be zero terminated as was discussed previously).

bbrk24 · February 8, 2023, 1:24pm

Isn’t that the exact purpose of withExtendedLifetime?

ShikiSuen · February 8, 2023, 2:00pm

This is the final acceptable refactor of mine by Isaac Xen:

public extension String {
  func parsedAsHexLiteral(encoding: CFStringEncodings? = nil) -> String? {
    guard !isEmpty else { return nil }
    var charBytes = [Int8]()
    var buffer: Int?
    compactMap(\.hexDigitValue).forEach { neta in
      if let validBuffer = buffer {
        charBytes.append(.init(bitPattern: UInt8(validBuffer << 4 + neta)))
        buffer = nil
      } else {
        buffer = neta
      }
    }
    let encodingRaw = encoding.map { UInt32($0.rawValue) } ?? CFStringBuiltInEncodings.UTF16BE.rawValue
    let result = CFStringCreateWithCString(nil, &charBytes, encodingRaw) as String?
    return result?.isEmpty ?? true ? nil : result
  }
}

tera · February 8, 2023, 3:29pm

Glad you solved that.

I wonder about something else:

Where are you getting that hex sting from, or, in other words, why do you use this form to begin with? It's quite unusual to store text in such a form. Is this due to ascii compatibility?

ShikiSuen · February 8, 2023, 4:40pm

It is because I want to implement a new feature into my vChewing Input Method. This feature is:
Implements a Big5 code input mode by zonble · Pull Request #355 · openvanilla/McBopomofo (github.com)

Zonble: I know some traditional Bopomofo users who are familiar with Big5 code input method (內碼輸入法) since the DOS era. They use traditional Bopomofo to input Hanzi but Big5 code for punctuations and symbols. Using big5 code has already become a muscle memory. The feature is for such users.

Here's my implementation (need to set the video quality to highest to see the codes):
Twitter 上的 vChewing 威注音輸入法："思索再三之後，威注音還是決定新增 Big5 內碼模式。威注音就該功能的實作方法與小麥注音完全不同，且也可以用 Big5 碼敲漢字，還計畫也新增 GB2312 內碼模式。這些特性統稱為「區位輸入模式」。 https://t.co/TNkXQKGgFO" / Twitter

My implementation is obviously different than Zonble's work. I don't even need to change the finite state machine.

ShikiSuen · February 8, 2023, 4:44pm

By the way, here's the char tables for reference:

Big5: Big5 (Traditional Chinese) character code table (ash.jp)
GB2312: GB2312 (Simplified Chinese) character code table (ash.jp)

These two are the code pages used in DOS and Win9x for handling Traditional and Simplified Chinese.