Lossy decoding from Data to String

I am trying to implement these two methods to read and write strings with the specified encoding to memory.

func read(from address: Int, maxLength: Int, encoding: String.Encoding) -> String?
func write(to address: Int, string: String, encoding: String.Encoding) -> Bool

For the write primitive, I implemented string to data encoding using:

string.data(using: encoding, allowLossyConversion: true)

And it works correctly, replacing any unsupported character with ?.
For the read primitive my idea was the following:

  1. Calculate the maximum number of bytes needed to represent a string of maxLength using encoding:
" ".maximumLengthOfBytes(using: encoding) * maxLength
  1. Read the data from memory into an allocated buffer of the calculated size
  2. Lossly decode this data into a string of the specified encoding

For this last part, I tried using:

String(data: data, encoding: encoding)

But it does not work reliably, because when an invalid character is found, it just returns nil. The issue is that, due to the fact that data is most likely oversized compared to the actual string data required, there will be some junk, which results in the decoding to fail.

So then I tried with this:

String(decoding: data, as: Unicode.UTF8.self)

Which works, but instead of using ?s it includes invalid bytes (e.g. Hello World\0\0\0\0´┐Ż\u{10})

Furthermore, the api is fairly different between the two constructors, and I cannot figure out if there is a standard way to convert between an encoding (String.Encoding) to something I can pass as the as: argument in String(decoding:as:).

I'd recommend you storing the data size along with the data itself when writing, e.g.:

<dataSize> <string data itself>

and using that length during reading. dataSize could be say 4 or 8 bytes.

For "ABCDE" string and 4 bytes used for the length prefix it would become:

00 00 00 05    41 42 43 44 45

This would be analogous to how "Pascal strings" worked in the past ÔÇô those stored a single prefix byte as the string length.

1 Like

That's a good idea, but unfortunately I may not be the one who originally allocated the string.

To provide a bit of context, I am building a memory introspection library in Swift (similar in some parts to the Frida toolkit), which can operate on any external native process.

It's probably going to work unreliable in many cases (depending upon your needs of course). E.g. the external app using one encoding and your app guesses encoding wrong. Or the external app uses "Hello\0World" as a string but your app deciphers that as separate two strings. Or the external app uses a sequence of several strings and your app treats that as a single string:

let array = ["String1-grizzly", "String2-grizzly", "String3-grizzly"]
array.withUnsafeBytes { buffer in
    for i in 0 ..< buffer.count {
        print(String(format: "0x%02X", buffer[i]), terminator: ",")
    }
    print()
}
// 0x53,0x74,0x72,0x69,0x6E,0x67,0x31,0x2D,0x67,0x72,0x69,0x7A,0x7A,0x6C,0x79,0xEF,
// 0x53,0x74,0x72,0x69,0x6E,0x67,0x32,0x2D,0x67,0x72,0x69,0x7A,0x7A,0x6C,0x79,0xEF,
// 0x53,0x74,0x72,0x69,0x6E,0x67,0x33,0x2D,0x67,0x72,0x69,0x7A,0x7A,0x6C,0x79,0xEF

let data = Data([
    0x53,0x74,0x72,0x69,0x6E,0x67,0x31,0x2D,0x67,0x72,0x69,0x7A,0x7A,0x6C,0x79,0xEF,
    0x53,0x74,0x72,0x69,0x6E,0x67,0x32,0x2D,0x67,0x72,0x69,0x7A,0x7A,0x6C,0x79,0xEF,
    0x53,0x74,0x72,0x69,0x6E,0x67,0x33,0x2D,0x67,0x72,0x69,0x7A,0x7A,0x6C,0x79,0xEF
])
let string = String(data: data, encoding: .ascii)
print(string)
// String1-grizzlyïString2-grizzlyïString3-grizzlyï

You'd need to apply some heuristics to make it work in the most typical scenarios.

Wrong encoding would not be an issue, it's the library user's responsability to pick the correct encoding when reading and writing the data.

This looks very similar to what I did here:

var data = Data(count: maxBytes)

guard
  data.withUnsafeMutableBytes({ buffer in
    read(from: address, buffer: buffer)
  }) == true
else { return nil }

guard let string = String(data: data, encoding: encoding) else { return nil }

But the constructor for the string returns nil the moment it finds a byte it cannot encode using the specified encoding, and because the read buffer is most likely bigger than the actual string is (unless it's using ascii or other fixed-length encodings), it always fails.

One fix could probably be to implement my own strlen function that considers the characters and the encoding, but I am trying to avoid that :confused:
I just wish Swift had an allowLossyConversion parameter for the string constructor as well...

If I understand your task right you "want to see the sting but not gibberish at the end" (even if it so happened that the gibberish at the end was intentionally part of the string... let's consider that's a rare usecase and you don't support that). You could probably use String(decoding:as:) and filter the result (perhaps by trimming the gibberish from the end).

I don't mind the gibberish at the end.
After I obtained the full string (actual string data + trailing gibberish), I would just create a substring from the start index to either start + maxLength or the first occourence of "\0" if the user specified that the string is zero-terminated.

The issue is that String(decoding:as:) behaves differently from the string.data(using:allowLossyEncoding:) counterpart.
The decoding: argument is not of type String.Encoding like using:, and that would result in the read and write primitives to have a confusing argument for specifying the encoding, that differs between the two.

Furthermore while string.data(using:allowLossyEncoding:) falls back to ?s for invalid characters, String(decoding:as:) seems to be escaping those invalid characters.

The awkwardness of API aside, do you fear that Unicode.UTF8.self / Unicode.ASCII.self would work differently compared to String.Encoding.utf8 / String.Encoding.ascii ?

You could probably use this pair that has a consistent type encoding parameter:

    static func decodeCString<Encoding: _UnicodeEncoding>(
        _ cString: UnsafePointer<Encoding.CodeUnit>?,
        as encoding: Encoding.Type, 
        repairingInvalidCodeUnits: Bool = true
    ) -> (result: String, repairsMade: Bool)?

    init<Encoding: _UnicodeEncoding>(
        decodingCString: UnsafePointer<Encoding.CodeUnit>, 
        as sourceEncoding: Encoding.Type
    )

I still don't understand completely why are you talking about using string.data(...) at all, as you previously indicated that you are not the one who's writing the data and that happens in some external / third party app.

String.Encoding is just an enum, I'm not sure how the conversion methods are implemented under the hood, but my guess is that string.data(using:allowLossyEncoding:) comes from Cocoa's NSString, while String(decoding:as:) is implemented in the Swift stdlib (stdlib/public/core/String.swift in the Swift repo). That could explain the difference between the two methods.

I will try those methods, though I think that would limit the decoding to null-terminated strings only (and I don't think it would handle well charsets that could have \0 bytes that do not signify string termination)

Regarding why I need string.data(...), I may not be the one that's writing the data, but that can happen. The user may want to allocate a memory region and then write the string in it, or instead find an already allocated string and overwrite it.

This is correct, though over the years we've slowly been moving more and more of the Foundation ones to be wrappers over the stdlib ones, at least for common encodings like utf8 and ascii.

That makes sense! Is the long term plan to extend the public API to include new functions in the style of String(decoding:as:), using Unicode.Encoding instead of the String.Encoding enum or just to migrate the already existing Foundation methods to stdlib?