Pitch: Unicode Named Character Escape Sequence

SDGGiesbrecht · November 30, 2018, 9:52am

This is an entirely additive proposal, so I realize we can simply choose not to use it if it were implemented.

My work falls into these two categories roughly half and half:

The “writing text” part. I translate. Often it is mixed in with source code of some kind or another, but even then it mostly involves fully readable string literals with the occasional interpolation. As you say, this is not what this proposal is (intended to be) about. It is not useful in this problem domain. But to a developer who speaks one language, it is easy to read a phrase like the following, and wrongly understand that this feature will be a necessity for his translators in order to localize his application correctly:

mattt:

When working with text containing, for example, both Arabic and Latin script, the use of non-printing, directional formatting characters like RIGHT-TO-LEFT MARK U+200F (RLM) may be necessary to achieve the desired results.

The very example provided belongs to this “working with text” domain ("The phrase is مرحبا بالعالم!‏ in Arabic."). I’m just making it clear that it is not as useful in this domain as the proposal can make it sound to someone unfamiliar the topic.
The “fascinating details of Unicode” part. I’ve worked on input methods, which requires working with unpaired controls and isolated combining characters. I’ve worked on letter frequency analysis scripts which needed to carefully handle decomposition and undo canonical reordering in a way more logical to the grammar. I do a lot of work in this domain too.

In this problem domain, in order to do anything right, you need a vastly better understanding of how Unicode works. By the time you reach that level of understanding, you are very familiar with the hex codes and don’t use the names for anything. The hex code carries a lot more useful information: Which block are we dealing in, latin‐1 (008x) or punctuation (20xx)? Is this ASCII (00xx), BMP (xxxx), or extended (xxxxx) (ergo how many bytes in UTF‐x)? Is it spacing (02xx) or combining (03xx)? Is it a modifier (02xx) or punctuation (20xx)? None of this information is captured in a name very well. APOSTROPHE vs RIGHT SINGLE QUOTATION MARK—Which one is actually recommended as an apostrophe? Consult the charts and you’ll find it’s the second one. For these sorts of reasons, I find the names unhelpful in a heavily technical Unicode setting. In every case either the character itself or the hex code is more useful, more communicative, easier to discover the first place, and faster to input.