How to include accented characters in Swift URL without percent encoding?

kristofk · August 8, 2020, 3:05pm

I am trying to make a web request to a URL that needs to keep accented characters instead of percent encoding them. E.g. é must NOT change to e%CC%81. I cannot change this.

These are the allowed characters that shouldn't be percent encoded: AaÁáBbCcDdEeÉéFfGgHhIiÍíJjKkLlMmNnOoÓóÖöŐőPpQqRrSsTtUuÚúÜüŰűVvWwXxYyZz0123456789-

Here is an example of a url I need

https://helyesiras.mta.hu/helyesiras/default/suggest?q=hány%20éves

You can try this url in your web borwser to confirm its working. (The site is in Hungarian.) If you try the proper percent encoded version of this url (https://helyesiras.mta.hu/helyesiras/default/suggest?q=ha%CC%81ny%20e%CC%81ves) then the website will give an error. (Also in Hungarian.)

I have my custom encoder to get this URL string. However to make a web request I need to convert the String to URL.

I tried 2 ways:

URL(string:)

let urlStr = "https://helyesiras.mta.hu/helyesiras/default/suggest?q=hány%20éves"
var url = URL(string: urlStr)

// ERROR: Returns nil

URLComponents with percentEncodedQueryItems

var urlComponenets = URLComponents()
urlComponenets.scheme = "https"
urlComponenets.host = "helyesiras.mta.hu"
urlComponenets.path = "/helyesiras/default/suggest"
urlComponenets.percentEncodedQueryItems = [           // ERROR: invalid characters in percent encoded query items
    URLQueryItem(name: "q", value: "hány%20éves")
]
let url = urlComponenets.url

Is it possible to create URLs without Foundation APIs checking its validity? Or can I create my own validation rules?

Karl · August 8, 2020, 3:53pm

URLs are defined as being ASCII. Non-ASCII characters are either percent-encoded or IDNA transformed depending on the protocol scheme and component of the URL.

The problem is that your percent-encoded version is "incorrect". Taking a look in my browser's web inspector, the actual URL requested is: https://helyesiras.mta.hu/helyesiras/default/suggest?q=h%C3%A1ny%20%C3%A9ves. This seems to work, as far as I can tell.

I believe this is a unicode normalisation issue. There are many ways of writing the "á" character, and the server does not appear to normalise its input, meaning it only recognises some of the ways á can be encoded.

kristofk · August 8, 2020, 5:38pm

You are right, I just needed to use a different percent encoding. I found this Percent encoding table for reference.

I implemented my own encoding table and encoder in my project.

~~What I don't understand is why in Swift the default is Windows-1252 instead of UTF-8 and why I cannot change it?~~ It used to be possible with this function.

jrose · August 8, 2020, 5:42pm

The encoding used for unescaped characters in string literals is whatever your file is saved as; if you save it as UTF-8, you'll get UTF-8 content in the string literal. (In fact, Swift doesn't officially support anything but UTF-8 source files; the fact that Windows-1252 works at all is because string literal contents are generally not inspected by the compiler.)

kristofk · August 8, 2020, 6:02pm

I am mistaken, you are right! And I am very confused now...

My project is a command line tool and I get the input as an argument.
You can see in my original question that in the url I get, the "á" is encoded as "a%CC%81".

However now I tried hard coding the input instead of reading it from the argument and the percent encoding is different. "á" is "%C3%A1".

Is it possible that encoding an argument vs a string literal changes the outcome?

jrose · August 8, 2020, 6:05pm

Hm, unfortunately yes. @compnerd, sorry to summon you, but what's the current plan for people who want Unicode arguments on Windows? (Is there a reason that's not the default behavior?)

kristofk · August 8, 2020, 6:06pm

I'm running on macOS.

jrose · August 8, 2020, 6:10pm

Ah, on macOS it might be your Terminal settings: whatever the terminal is set to is the input the Swift tool will see. What do you have selected under Preferences > Profiles > [your profile] > Advanced > Text encoding?

kristofk · August 8, 2020, 6:20pm

It is UTF-8.

But to further compicate things:

I run my project from Xcode and the arguments are from the build scheme > Arguments Passed on Launch.

Is it possible that the Xcode Build Scheme or the Xcode console is the culprit?

jrose · August 8, 2020, 6:36pm

Oh, shoot. I misread your two different cases: there's no encoding issue here. Instead, you're looking at the difference between "U+00E1 LATIN SMALL LETTER A WITH ACUTE" and "‎U+0061 LATIN SMALL LETTER A, U+0301 COMBINING ACUTE ACCENT". These are both valid ways to encode "á". You can force the use of one or the other using one of the "Normalizing Strings" properties documented under NSString: Apple Developer Documentation.

However, you now have a question to answer: which form do your servers expect? Are they consistent? Since it looks like this is an arbitrary query field, I'd say the server is at fault for not performing the normalization itself.

kristofk · August 8, 2020, 6:49pm

This is insane! I spent a whole day on this! I guess I learned something new today about string normalisation. Thank you.

I think you are right that the server is where this is usually handled but from now on I'll also normalise my strings in this project. Just to be safe.

Karl · August 8, 2020, 7:04pm

FWIW, U+00E1 is the NFC-normalised form. That's also the W3C's recommended normalisation, so there's a good chance that's the one you want. By pure coincidence, they even use this specific character in their FAQ example.

For example, take the Hungarian word világ. The fourth letter could be stored in memory as a precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single character) or as a decomposed sequence of U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (two characters).

So if it accepts the composed form and rejects the decomposed form, your options are either NFC or NFKC.

compnerd · August 8, 2020, 7:36pm

@jrose, hmm, pretty sure that we use the Unicode encoded arguments on Windows. Pretty much everything has been plumbed in UTF-16, if you want to use ASCII/UTF-8, everything has to be converted. Thats the only way to guarantee that everything can be correctly accessed.