How to check efficiently if String is ASCII encoded?

Zollerboy1 · October 21, 2022, 11:08am

I need to do quite expensive parsing on a given string, however, I know in advance, that the string is only allowed to contain ASCII code units.

So, I wanted to not even start parsing, if the string is not an ASCII string.

How can I do this check most efficiently?

If I'm correctly deciphering the standard library code of String, it has this information somewhere in _StringGuts, it's just not publicly exposed. Is there some way to get this flag or do I have to iterate through the string and check each character, e.g. using string.allSatisfy(\.isASCII)?

sveinhal · October 21, 2022, 11:42am

I can't answer about how to extract already computed values from String's guts, but it may be faster to check if all your string's bytes are 7-bit?

But maybe not. Iterating over a string's characters should be equivalent to iterating its bytes when the string is ASCII, and and if some of them aren't .allSatisfy should return early.

¯\_(ツ)_/¯

tera · October 21, 2022, 12:17pm

Can you give us an idea how big is the typical string you are dealing with, how long does the expensive processing take and how long does string.allSatisfy(\.isASCII) take (assuming ascii string).

I was thinking about some quick O(1) way of getting "string size in bytes" and see if it is equal to string length but that doesn't necessarily mean the string is ascii, e.g. it may still contain bytes > 0x7F.

Karl · October 21, 2022, 12:27pm

This. The precalculated flag is currently not exposed.

String, and its default Character-based collection conformance, are designed for Unicode text processing. If you're processing an ASCII string, it will be more efficient for your parser to work in terms of UTF-8 code-units anyway.

The fastest way to calculate this information yourself would to be to check that the top bit of every byte is 0. You can load 8 bytes in to a single UInt64, then bitmask to check them all at once in a SIMD-like fashion. This is what the standard library does.

scanon · October 21, 2022, 12:29pm

@Michael_Ilseman or @Alejandro should confirm, but my recollection is that the "isASCII" bit is really "is known to be ASCII"; i.e. it's possible for an ASCII string to not have that bit set. If that's right, then simply exposing the bit doesn't solve your problem; you can check the bit, but if it's not set, you next have to check each byte.

Alejandro · October 21, 2022, 5:44pm

Yeah, for small strings we can always detect ascii-ness even if the original string was not ascii. For large strings, unless the string was ascii to begin with, we don't track every operation to see if the string will eventually be ascii.

tera · October 21, 2022, 11:06pm

Is there a way to determine String's "underlying representation" byte count in O(1) time? To subsequently do the comparison: "string.byteCount == string.count". Strictly speaking it is not ascii check, but it is close.

Zollerboy1 · October 21, 2022, 11:25pm

I might have (slightly) overstated the actual expensiveness of the parsing.
However, I am parsing big decimals (basically floating point numbers of almost arbitrarily high precision) from a String. This String can obviously be quite long for numbers with a high precision and potentially I want to parse a lot of these decimals since a String is basically the only feasible data type to encode them in let's say JSON or similar.

While implementing the parser function I just thought, are there any fast paths for failing and thought that checking for ascii should be pretty easy. However, then I only found the string.allSatisfy(\.isASCII) way of doing this and was just a bit disappointed that there was no O(1) way of doing this, considering the fact that Swift already does all the unicode checking.

I will live without an ASCII check for now, since my code is fast enough without it for the moment, but I'm sure there are use cases where such a flag on String should be pretty valuable.

Zollerboy1 · October 21, 2022, 11:27pm

Why is that? AFAIK, Swift does quite extensive unicode checking for Strings. Shouldn't it be quite easy to maintain an isASCII flag that is set to false should a non ASCII character be encountered? As I understand it, every character in a String must be looked at for the unicode checking either way, no?

Alejandro · October 21, 2022, 11:30pm

Yes, we do this for the negative case, but we don't ever set the flag back to true (for large strings).

Zollerboy1 · October 21, 2022, 11:36pm

Ah yeah, so the flag in _StringGuts is not actually an isASCII flag but rather an isKnownToBeASCII flag and that's probably why it's not exposed. Thanks for the clarification!

Karl · October 22, 2022, 3:21am

I think it is unlikely to be worth adding a specific check for a non-ASCII string, as opposed to validating that all bytes are ASCII digits/periods. In fact, I think it is unlikely to be worth an up-front validation check at all - just consume the bytes that you expect, and fail if you see an invalid byte.

These sorts of use-cases are precisely why UTF-8 was developed, and why it has become the most popular Unicode encoding. For lots of formats, you really just want to look for particular ASCII characters, and any characters outside of that (whether they are ASCII or non-ASCII) are treated uniformly; either by rejecting them, or treating them as opaque.

tera · October 23, 2022, 12:46pm

Optimising of failing path is rarely worth doing (that's why in the so called "zero cost exception handling" only success path is optimised, while failures are deemed exceptionally rare and failing path can be far from zero cost). Example of one of those rare cases: you are parsing 100K strings, among those strings 90K strings are bogus, without optimisation processing time is 1 second, with optimisation it is 1/10 second.

taylorswift · October 23, 2022, 5:31pm

why are you decoding them from Strings? it sounds like you control the encoding schema, and JSON does not specify any binary format for numerical values, so you could just encode them directly as numbers.

Zollerboy1 · October 23, 2022, 5:59pm

I'm writing a BigDecimal library that I can then use in the ORM I want to write for SurrealDB. SurrealDB can store big decimals and gives them to me as strings, so I have to decode them from strings.