Formatting %s with RTL and system default encoding

(May be this should be in the Swift Format category... I'm not too familiar with this forum. I will move this topic if this is not the right place. )

says that

"Because the %s specifier causes the characters to be interpreted in the system default encoding, the results can be variable, especially with right-to-left languages. For example, with RTL, %s inserts direction markers when the characters are not strongly directional. For this reason, it’s best to avoid %s and specify encodings explicitly."

Is it true that my UTF-8 characters are interpreted in the system default encoding if I do

"my string".withCString { String(format: "%-10s", $0) }

?

While I'm at it, how do I know what my system default encoding in Swift? Isn't it always UTF-8?

If one must specify encodings explicitly, how is it done when using this kind of formatting?

I'm not sure I understand the relationship between "interpreted in the system default encoding" and the "For example with RTL inserts direction markers when the characters are not strongly directional". Does this mean that the default encoding is used to establish the default directionality (are we by default in Hebrew/Arabic or rather in Latin as this machine is a Latin machine)?

I was going to ask whether the latest Unicode Bidi algorithm is implemented by %s? (E.g. are Directional isolates and Bracket Pairs taken into consideration?), but it seems that %s garbles the Arabic text:

let rtl = "The Car is السيارة in arabic."

print (rtl);

print (rtl.withCString {String(format: "%s", $0)})

The Car is السيارة in arabic.

The Car is السيارة in arabic.

Lastly, is there something more native in Swift (which does not require using .withCString) to format strings (minimum length, padding, etc.)

That category is about a tool that formats Swift source files by applying standard indentation and line breaks. Your question rightly belongs here in “Using Swift”.

Those are an ancient holdover from C, which you don’t want to touch unless something forces you to. Just use direct string interpolation:

let rtlIsolate = "\u{2067}"
let ltrIsolate = "\u{2066}"
let firstStrongIsolate = "\u{2068}"
let popDirectionalIsolate = "\u{2069}"

let arabic = "السيارة"
let sentence
  = "“the car” is “\(rtlIsolate)\(arabic)\(popDirectionalIsolate)” in Arabic."

I chose the RTL isolate because it was known to be Arabic, and in that case explicitly using RTL will force even a string with no directionally strong characters like ...! to properly appear as “!...”.

If you don’t know what is being interpolated, you can use the “first strong isolate”, so that the rendering engine will pick the direction based on the first character with unambiguous direction in whatever string it is given. It won’t be able to tell what to do with strings like “...!”, but it will still get 99% of strings right, and the origin of the interpolated string can correct it by wrapping the string with directional specifiers before passing it to you.

These methods are just forwarding to C. I assume so, but I don’t now any more that what the documentation says.

The system default has nothing to do with Swift. I am pretty sure this is referring to what you get when you run locale in a terminal. If so, then it is entirely configurable by the user. And since they are controlled by the environment, they can also be modified at will by other pieces of the program.

You never have to specify an encoding when interpolating Swift strings. All Swift strings are in Unicode. Encoding only matters when converting to and from binary data, such as when loading a file.

Ultimately it means the system attempts to choose the right special characters above for you. But frankly it isn’t something you should ever trust a machine to do.

Swift strings are just a sequence of characters in memory, and always uses logical order. Whatever rendering engine you use to display the string—such as SwiftUI or AppKit—may or may not support bidirectional text. But none of them are part of the open source Swift project, so you would have to go ask at their particular sites. (Though I assume they would all answer “yes” in 2020.)

It looks like UTF‐8 (from the Swift string) was reinterpreted as though it were Latin‐1 (the system encoding). Again, that won’t happen if you stick to the Swift APIs and stay away from the ones intended only for interoperating with C.

You really don’t want to do such things anyway:

1 Like

Use %@ instead of %s. %@ is a swift string.

	let sentence = "“the car” is “\(rtlIsolate)\(arabic)\(popDirectionalIsolate)” in Arabic."
	print(sentence)

	let sentence2 = String(format: "“the car” is “%@%@%@” in Arabic.", rtlIsolate, arabic, popDirectionalIsolate)
	print(sentence2)
1 Like

Yes, thank you.

Unfortunately, I believe you cannot get the string to be space filled and aligned with %@, while %s does.

"%-20@" does not left-align and fill up to 20 characters if the string is shorter than 20 characters(graphemes) long.

1 Like

Thank you for your detailed answer.

Is print() [on the console/standard output] part of the open source Swift project? It does seem to print Arabic, I wonder if it does support the latest Unicode Bidi Algorithm.

For what it's worth, I think the system then (with %s) treats the UTF-8 as MacOsRoman, since " ç " (c cédille) came out as "√ ß" (C3 A7) in an additional test I did.

For GUI applications, I agree with your point.

But I wanted to left align and pad strings when outputting to the console. Admittedly, this is not frequent. Do I have to write my own code to do what %-10s would do (left align, padded to at least 10 characters) but with interpolated strings?

Yes. print() is part of Swift, but all it does is send bytes to standard output. Whether or not it is displayed correctly is up to your terminal application (and it may just defer those sorts of decisions to the platform’s text layout engine).

Yes, you will have to write your own, or else search for a third‐party library attempting to vend this functionality.

My honest recommendation is to not even try, and instead design your output assuming nothing will ever line up vertically. It is not just bidirectional text you’ll be fighting with. Unlike with ASCII, in Unicode you can never assume anything will know what to do with all available characters. A control character might be non‐printing on one machine (0 columns), but unrecognized on a another and displayed as a placeholder box (1 column). A combining character might be properly combined on one machine (0 columns), but unrecognized and left separate on another (1 column). An emoji sequence could be properly assembled on one machine (1 column), but left as a series of separate emoji on another (∞ columns).

1 Like

FWIW, if it's for (iOS/macOS) logging, the in-house Logging API (with WWDC video) might also interest you. Otherwise, if you decide to implement your own padding, check out DefaultStringInterpolation to see how to add them.

1 Like

Thank you.

" print() is part of Swift, but all it does is send bytes to standard output. Whether or not it is displayed correctly is up to your terminal application"

I suppose my question was where the bidi work happens (are RTL bytes reordered in the print()?) . It is perfectly sensible (actually better design) to let the terminal application handle that kind of thing.

Thank you Lantua, I watched the video. Quite informative, looks like the Logging API supports defining the proper alignment and width for the interpolated string. Nice.
\(myEntry, align: .right(columns: 15)).

One caveat is that the interpolation is available only to Logger APIs. Since you're using OSLogInterpolation which is available only to Logger.Message. OTOH, String and Substring uses DefaultStringIngerpolation, which lacks the alignment interpolations you want.