Will the string processing efforts also expose Bidi_Class?

Karl · April 22, 2022, 2:07pm

Is it planned to expose a codepoint's Bidi_Class as part of the string processing efforts?

It has come up a couple / of / times. We expose a lot of other data, but don't currently expose the things necessary to properly process bidirectional text, and it would be really nice to add.

If it is not planned, is this an area where contributions might be welcome? It seems there's a lot going on right now with string, so it may not be the best time.

Michael_Ilseman · April 22, 2022, 2:55pm

There is no proposal out there adding these, but they would be very good to add API to Unicode.Scalar.Properties for them.

IIRC @Alejandro recently added this data so it should be possible to expose as API now.

Very, very much so! The barrier to contribute should be low, since it's just adding API following existing practices. It's just the overhead of writing a proposal and going through the evolution process. CC @Alejandro to double check this is all that's needed.

It is a great time to pitch/propose this. What release it lands in is a matter of scheduling and SE.

scanon · April 22, 2022, 2:58pm

As a side note, the best time to make this sort of minor improvement is almost always right when you notice the need. If you think "I'll do it later," more often than not you get busy with something and don't get back to it for months or years. Better to implement a fix and start the process right away.

Karl · April 23, 2022, 2:09am

One more thing I happen to need is the Joining_Type property, which is used in shaping. Would this also be appropriate for the standard library's Unicode data?

The reason I need it is for Internationalized Domain Names. There are a bunch of contextual rules for when certain characters are allowed in domain names, most importantly for joiners (there used to be a lot more rules, but now only Bidi and joiners are checked). ZWNJs in particular...

may occur in a formally cursive script (such as Arabic) in a
context where it breaks a cursive connection as required for
orthographic rules, as in the Persian language, for example.

In this and one other specific context, they are allowed, otherwise they are banned. Luckily the RFC describes a handy rule for checking this, but it depends on the Joining_Type of the neighbouring code-points.

I don't know how frequently the property is used outside of that. I wouldn't mind including my own copy in my library, but I could also see about adding it to the standard library's unicode data if it is thought to be generally useful.

Michael_Ilseman · April 23, 2022, 7:54pm

I don't know about Joining_Type specifically. I think it would make sense to either do a small targeted proposal for the bidi classes, or else try for a more complete update. For example, we will be supporting more properties soon with the regex work as well as scripts. The latter would be especially nice to expose as API. All that's missing (AFAIK) is the mechanical work of adding API, doc comments, and the SE process.

You might want to coordinate with @Alejandro if you're interested in that, he's up to date with data details.

Karl · May 5, 2022, 8:04pm

So, it appears that UTS18 Level 2 support (which is what the regex work targets, I believe?) requires these properties:

RL2.7 Full Properties

To meet this requirement, an implementation shall support all of the properties listed below that are in the supported version of the Unicode Standard (or Unicode Technical Standard, respectively), with values that match the Unicode definitions for that version.

UTS18

That list of properties includes both Bidi_Class and Joining_Type.

So I believe we would need to include this data, would we not? I looked through the regex proposals, and it seems that we do want that Level 2 support, and I can't find anything which excludes these particular properties.

But it doesn't appear to be implemented yet. There are still PRs to improve that support (they don't include these properties), so it seems like it's still ongoing work. It's not on the "near-future work" list, but I also couldn't find it as part of any other draft proposals.

So yeah my understanding is that the data is needed anyway (@Alejandro or @nnnnnnnn, can you perhaps confirm?). As you say, exposing the property is just a matter of extending the existing conventions. I'd be happy to help draft that, if help is welcome/wanted.

In case you're curious, I have actually implemented IDNA, using the standard library's NFC normalization via the _Unicode SPI. I figured it would be valuable for additional testing, feedback, and to help try the real, supported APIs when they are ready to ship (I won't be enabling this by default because the APIs are not stable; there's a fallback which gets this from Foundation). It passes all of the UTS46 compatibility tests, except for the things which need these 2 properties. So that's a concrete use-case which can go in to a proposal.