A path forward on rationalizing unicode identifiers and operators

David_Sweeris · October 3, 2017, 11:05pm

Keep in mind that Swift already goes far above and beyond in terms of operators

Yep, that's is a large part of why I'm such a Swift fan :-D

Fortunately, no one is seriously proposing a major curtailing of the capabilities here, we’re just trying to rationalize the operator set, which is a bit of a mess at present.

I guess I don't really understand why it's currently "a bit of a mess". Maybe I should go take a look at the relevant compiler code to try to get a better understanding of what you're talking about. Or is this purely a matter of finding places where we disagree with Unicode's character classification? If so, I think I'm back at not quite getting it... In any case, I've apparently misunderstood something because I was under the impression that this would lead to a "major curtailing".

in that: (a) it allows overloading of almost all standard operators; (b) it permits the definition of effectively an infinite number of custom operators using characters found in standard operators; (c) it permits the definition of custom precedences for custom operators; and (d) it additionally permits the use of a wide number of Unicode characters for custom operators. Most systems programming languages don't even allow (a), let alone (b) or (c). Even dramatically curtailing (d) leaves Swift with an unusually expansive support for custom operators.

Yes, but many of those custom operators won't have a clear meaning because operators are rarely limited to pre-existing symbols like "++++++++" (which doesn't mean anything at all AFAIK), so operators that are widely known within some field probably won't be widely known to the general public, which, IIUC, seems to be your standard for inclusion(?). Please let me know if that's not your position... I hate being misunderstood probably more than the next person, and I wouldn't want to be guilty of that myself.

The approach to operator handling in Swift is very intentional. IMO, it is well known that:

1) Operators can make code significantly easier to understand by reducing noise from complex expressions: writing x.matmul(y) is insane <https://www.python.org/dev/peps/pep-0465/> if you’re doing a lot of matrix multiplies.
2) Operators can be completely opaque to someone who doesn’t know them, and sometimes named functions are more clear.
3) Named functions can also sometimes be completely opaque if you don't know them, e.g. "let x = cholesky(y)"
4) Languages with fixed operator sets that also allow overloading (e.g. C++) end up with those operators being abused.
5) Some code can only be written and maintained by domain experts, and those experts often know the operators.

Swift’s approach is basically to say to users: “ok we allow overloaded operators, but at least if you encounter some operation that you don’t know… you know that you don’t know it”. If you encounter "if ¬x {“ or “a ∩ b” in some source code, at least you can command click, jump to the definition and read what it does: you aren’t misled into thinking that the expression is some familiar thing, but find out later it was overloaded to do something crazy (bitshifts for i/o? really??? :).

Set algebra is an illustrative example, because it is both used by people who are experts and people who are not. As far as policies go, I think it makes sense for Swift libraries to define operator-like things as named functions (e.g. “intersection") and also define operators (“∩”) which can optionally be used in source bases that want them for convenience. The compiler and language cannot know whether a code base is written and maintained by experts who know the symbols and who value their clarity (over the difficulty typing and recognizing them), and this approach allows maintainers of the codebase to pick their own policies.

Oh, yeah, I can't imagine a situation in which I'd think it'd be a good idea to not define a named function to go along with a unicode operator. I'm mainly concerned that we not limit the people in 5) unless we need to. And to be clear, if we actually need to, then I'm fine with doing that... It's just that -- like I said earlier in this message -- I don't clearly understand why this is a problem. That said, there are multiple people more knowledgable than I am on the topic who are telling me I'm wrong about something that I'm kinda surprised there was disagreement about in the first place... I think it's probably time for me to stop pressing an issue that I apparently don't fully understand and wait to see what's in the actual proposal.

I do think that Ethan’s suggestion upthread interesting, which suggest considering something like:
import matrixlib (operators: [ᵀ,·,⊗])

Three concerns I see:
- Requiring them today would be a source incompatibility with Swift 4

If we leave the current "import everything" behavior as the default, why would it be a source-breaking change? We could also flip the syntax around and write something like "import matrixlib (operators: -[ᵀ,·,⊗])", where we're saying which operators we want to not import, because clearly (heh, famous last words in this thread) the default value would the empty list, and an empty list of what not to import implies we should import them all. (I'm not arguing for or against either, just exploring the issue & syntax)

- Multiple modules can define operators, unclear whether this refers to the operator decl or implementations of operators.

If it weren't for precedences, I'd guess it should probably only refer to the implementations, since there's no point (that I can see) of wanting to not import an operator unless you want to use it for something else and the two functions' signatures create ambiguities. Since an operator's precedence is attached to the operator itself rather than the function, it'd probably be better to just pretend it doesn't exist at all.

- Imports are per-module, not per-source-file, so this couldn’t be used to “user-partition” the identifier and operator space. It could be a way to make it clear that the user is opting into these explicitly.

I haven't followed them as closely as I probably should've, but do the recent threads on submodules change anything WRT this?

- Dave Sweeris

···

On Oct 2, 2017, at 10:06 PM, Chris Lattner <clattner@nondot.org> wrote:
On Oct 2, 2017, at 9:12 PM, David Sweeris via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Ethan_Tira-Thompson · October 4, 2017, 12:47am

Keep in mind that Swift already goes far above and beyond in terms of operators

Yep, that's is a large part of why I'm such a Swift fan :-D

Fortunately, no one is seriously proposing a major curtailing of the capabilities here, we’re just trying to rationalize the operator set, which is a bit of a mess at present.

in that: (a) it allows overloading of almost all standard operators; (b) it permits the definition of effectively an infinite number of custom operators using characters found in standard operators; (c) it permits the definition of custom precedences for custom operators; and (d) it additionally permits the use of a wide number of Unicode characters for custom operators. Most systems programming languages don't even allow (a), let alone (b) or (c). Even dramatically curtailing (d) leaves Swift with an unusually expansive support for custom operators.

Yes, but many of those custom operators won't have a clear meaning because operators are rarely limited to pre-existing symbols like "++++++++" (which doesn't mean anything at all AFAIK), so operators that are widely known within some field probably won't be widely known to the general public, which, IIUC, seems to be your standard for inclusion(?). Please let me know if that's not your position... I hate being misunderstood probably more than the next person, and I wouldn't want to be guilty of that myself.

The approach to operator handling in Swift is very intentional. IMO, it is well known that:

1) Operators can make code significantly easier to understand by reducing noise from complex expressions: writing x.matmul(y) is insane <https://www.python.org/dev/peps/pep-0465/> if you’re doing a lot of matrix multiplies.
2) Operators can be completely opaque to someone who doesn’t know them, and sometimes named functions are more clear.
3) Named functions can also sometimes be completely opaque if you don't know them, e.g. "let x = cholesky(y)"
4) Languages with fixed operator sets that also allow overloading (e.g. C++) end up with those operators being abused.
5) Some code can only be written and maintained by domain experts, and those experts often know the operators.

Well said!

I think comments about poorly chosen operator symbols (e.g. invisible or visual similar) are a bit of a red herring. From a malicious angle, they’d rather overload a standard operator than introduce an exotic one which would draw more attention and doesn’t have pre-existing usage. From a maintenance angle, choosing a poor operator symbol is akin to choosing a poorly named identifier. That’s really for the users to figure out themselves, we shouldn’t try to legislate the equivalent of “no single letter variables”.

Swift’s approach is basically to say to users: “ok we allow overloaded operators, but at least if you encounter some operation that you don’t know… you know that you don’t know it”. If you encounter "if ¬x {“ or “a ∩ b” in some source code, at least you can command click, jump to the definition and read what it does: you aren’t misled into thinking that the expression is some familiar thing, but find out later it was overloaded to do something crazy (bitshifts for i/o? really??? :).

Exactly! If someone has already decided they want an operator for something, better to let them have a choice of a new symbol rather than necessarily overloading one of the standard ones because we’ve restricted the set. I think most of the bad reputation of custom operators is the surprising results of developers being forced to shoehorn the “standard” operators into new roles that confuse readers who think they know what an operator is doing. E.g. it’s not the operator that’s as dangerous as the overloading.

Set algebra is an illustrative example, because it is both used by people who are experts and people who are not. As far as policies go, I think it makes sense for Swift libraries to define operator-like things as named functions (e.g. “intersection") and also define operators (“∩”) which can optionally be used in source bases that want them for convenience. The compiler and language cannot know whether a code base is written and maintained by experts who know the symbols and who value their clarity (over the difficulty typing and recognizing them), and this approach allows maintainers of the codebase to pick their own policies.

I do think that Ethan’s suggestion upthread interesting, which suggest considering something like:
import matrixlib (operators: [ᵀ,·,⊗])

Three concerns I see:
- Requiring them today would be a source incompatibility with Swift 4

To clarify, I’m only suggesting the qualifier be required for “non-standard” operators, so the source incompatibility would be on par to whatever unicode cleanup is similarly reclassifying characters already in use.

In that vein, this suggestion would dovetail well with such a reclassification effort, as it would give an easy upgrade path for existing code that wants to continue using a particular character, and allows a fairly conservative set of “standard” operators to be whitelisted without sacrificing end-user expressibility, which simplifies the scope of the classification effort.

“Standard” operators could include sections of the mathematical plane even though they aren’t necessarily used by the standard library, if there is desire to reserve such characters exclusively for operators and never identifiers.

- Multiple modules can define operators, unclear whether this refers to the operator decl or implementations of operators.

Hmm, how are conflicting operator declarations handled today? (e.g. different precedence, associativity for the same fixity?)

My thinking is import all declarations of that operator for a specified module (and so if the declaration isn’t imported, then implementations are hidden too). You would have to specifically import the operator for each module that provides it. If the user imports conflicting declarations it’s just the same result as today.

And by “all declarations of that operator” I mean if we have a matrix library that defines ᵀ for combinations of matrix, vector, sparse matrix, etc., then the single "import matrixlib (operator: ᵀ) ” statement makes all of those available since we should expect the module to be giving a consistent interpretation of that operator. So in technical terms this is importing all declarations regardless of fixity, not sure if it’s worth getting more granular about importing just prefix but not infix.

Conversely, if the operator isn’t imported, then it’s as if those declarations were all internal to the module, and avoids any conflicts.

So if module A declares an operator ¬ and another module B uses that as identifier, then the client resolves this at import. Either import ¬ from A and lose access to the identifier in B, or ignore the operator from A but retain access to the identifier in B. (Hopefully rational symbol choices would make this a rare situation on par with other global namespace collisions, and good modules should provide less exotic interface fallbacks as well.)

- Imports are per-module, not per-source-file, so this couldn’t be used to “user-partition” the identifier and operator space. It could be a way to make it clear that the user is opting into these explicitly.

Ahh nuts I actually thought imports were per-source-file!

So I guess a intra-module dependency for building the identifier/operator set is still too much a performance hit? Parsing isn't already collecting all the imports from across the current module?

Well regardless, I’d be willing to live with repeating a per-file import statement for operator specification. A little quirky that the operator attribute only has a file-level scope, but clearly I don’t mind respecifying imports in each file anyway (I kind of feel this is good form so you can move source files around and the dependencies come along.)

Alternatively, we could make a new per-file import specific for operators, orthogonal to module imports, although using similar syntax:
  import operator ᵀ
  import operator ·
  import operator ⊗

I thought about just using “import ᵀ”, but I don’t want to risk confusion with a module name. Might be nice to pass a collection, but since we’re not doing that with module imports then don't start now.

These would be applied similar to previous proposal, but globally toggling operator visibility. Basically just controls the operator character set and nothing more. So implementation should be really simple, all imported operators declarations are already loaded as normal, but the compiler can only make the connection if the character was listed as an operator in the current file. Initially I wanted an operator declaration in the current file to also serve as updating the character set so you don’t need both, but I see an argument to always require the import (for non-standard operators) just to surface guidance when an import will be needed to access that operator from elsewhere.

Does that help? I liked having per-module control for conflict resolution and also auditing where operators come from, but (naively) this seems like a really simple implementation and if there is demand we could still add a syntax for module-specific filtering later.

-Ethan

···

On Oct 2, 2017, at 10:07 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:
On Oct 2, 2017, at 9:12 PM, David Sweeris via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

jrose · October 4, 2017, 12:54am

- Imports are per-module, not per-source-file, so this couldn’t be used to “user-partition” the identifier and operator space. It could be a way to make it clear that the user is opting into these explicitly.

Imports actually are per-source-file. At times they appear to be per-module because we don't check this properly for extensions, but those aren't strictly per-module either; you also have access to extensions in any module you've recursively imported. Which is terrible.

Jordan

Chris_Lattner · October 4, 2017, 4:44am

I don’t think this is something we have to try hard to avoid. It is true that some characters look similar, particularly in some fonts, but this isn’t new:

  let a1 = 42
  let al = 12
  let b = al + a1

There is a fundamental difference between similar characters and characters that are meant to be visually identical. People judge the quality of a font by its Unicode support, and that means that only "low-quality" fonts would render, say, LATIN CAPITAL LETTER T and GREEK CAPITAL LETTER TAU differently.

As Dave DeLong mentions downthread, it really isn’t different. Different codepoints can look similar in some fonts and different in other fonts. That is reality for a broad range of codepoints.

All I’m saying is that it isn’t a priority (to me) to “solve” this problem.

If there were real code that was maliciously shadowing to try to cause confusion, then you have a more serious problem on your hands than someone accidentally misunderstanding which one to use.

I'm not sure I understand. If the "more serious problem" you're talking about is that your popular project is a valuable target to subvert, then there is no question that being backdoored would be more serious than people not reading your code right. I don't see how it pushes the problem out of scope, though.

As a security guy, I take my role of thinking about how anything can be abused very seriously. Backdoored open source projects turn up every now and then.

This code is backdoored. I challenge you to spot the bug:

func shellEscape(_ args: [String]) -> [String]?
func isWhitelisted(_ tool: String) -> Bool

func execute(externalTool: String, parameters: [String]) {
    if isWhitelisted(externalTool), let pаrameters = shellEscape(parameters) {
        print("Running tool \(pаrameters[0])")"
        system(parameters.joined(separator: " "))
    }
}

All I’m saying is that we shouldn’t complicate the design to solve this problem (IMO). If it falls out of the solution somehow (e.g. just disallow invisible characters) then that’s great of course!

How did you identify the bug in the snippet from above? Is it practical enough that you would, for instance, recommend that the server group do that test on every PR that they receive going forward?

I think that it's hard to build something meaningful without making it look suspicious. It's already kind of fishy that my shellEscape function returns an Optional, and people will eventually figure out that the parameters are not, in fact, shell-escaped. Still, I feel that it should be recognized that security is more than buffer overflows and integer overflows, and if there ever is an underhanded Swift code contest, that'll be my entry.

I don’t see the bug, but I assume you’re doing something evil with “parameters" not getting shell escaped.

Seriously, I get the issue you’re trying to draw attention to, and I respect the fact that good security folks are paranoid :-). That doesn’t make “fixing this” a high priority in itself though. We have to weight the cost and benefit of solving this. If “solving” this problem makes swift substantially more complex, difficult to specify, or difficult to maintain, or if it prevents this proposal from going through, then MHO is that this is not worth fixing.

The rationale is that if you have evil actors trying to subvert your code, there are a lot easier ways to do so than through this mechanism. Additionally, if you are someone who cares so much, it is trivial to define additional checker tools outside of the compiler to check for such things - just as some people enforce house style rules (like no use of “x!”) in external tools. All problems do not have to be solved through language design.

-Chris

···

On Oct 3, 2017, at 9:43 AM, Félix Cloutier <felixcloutier@icloud.com> wrote:

Chris_Lattner · October 4, 2017, 4:47am

Keep in mind that Swift already goes far above and beyond in terms of operators

Yep, that's is a large part of why I'm such a Swift fan :-D

Fortunately, no one is seriously proposing a major curtailing of the capabilities here, we’re just trying to rationalize the operator set, which is a bit of a mess at present.

I guess I don't really understand why it's currently "a bit of a mess”.

Read the motivation/inconsistency section of:

github.com

xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

# Refining identifier and operator symbology

* Proposal: [SE-NNNN](NNNN-refining-identifier-and-operator-symbology.md)
* Authors: [Xiaodi Wu](https://github.com/xwu), [Jacob Bandes-Storch](https://github.com/jtbandes), [Erica Sadun](https://github.com/erica), Jonathan Shapiro, [João Pinheiro](https://github.com/joaopinheiro)
* Review Manager: TBD
* Status: **Awaiting review**

<!--
* Decision Notes: [Rationale](https://lists.swift.org/pipermail/swift-evolution/), [Additional Commentary](https://lists.swift.org/pipermail/swift-evolution/)
* Bugs: [SR-NNNN](https://bugs.swift.org/browse/SR-NNNN), [SR-MMMM](https://bugs.swift.org/browse/SR-MMMM)
* Previous Revision: [1](https://github.com/apple/swift-evolution/blob/...commit-ID.../proposals/NNNN-filename.md)
* Previous Proposal: [SE-XXXX](XXXX-filename.md)
-->


## Introduction

This proposal refines and rationalizes Swift's identifier and operator
symbology. Specifically, this proposal:

This file has been truncated. show original

Set algebra is an illustrative example, because it is both used by people who are experts and people who are not. As far as policies go, I think it makes sense for Swift libraries to define operator-like things as named functions (e.g. “intersection") and also define operators (“∩”) which can optionally be used in source bases that want them for convenience. The compiler and language cannot know whether a code base is written and maintained by experts who know the symbols and who value their clarity (over the difficulty typing and recognizing them), and this approach allows maintainers of the codebase to pick their own policies.

Oh, yeah, I can't imagine a situation in which I'd think it'd be a good idea to not define a named function to go along with a unicode operator. I'm mainly concerned that we not limit the people in 5) unless we need to. And to be clear, if we actually need to, then I'm fine with doing that... It's just that -- like I said earlier in this message -- I don't clearly understand why this is a problem.

Sure, that’s fair. This is an issue we’ve been tracking since the Swift 2.x (!) days, so there is a lot of context that is probably not immediately obvious if you haven’t been following it since then. The proposal link above talks about the damage that we still carry.

I do think that Ethan’s suggestion upthread interesting, which suggest considering something like:
import matrixlib (operators: [ᵀ,·,⊗])

Three concerns I see:
- Requiring them today would be a source incompatibility with Swift 4

If we leave the current "import everything" behavior as the default, why would it be a source-breaking change?

It’s not clear to me that leaving that as the default would actually add anything useful, because if that is the default, noone will opt into typing more gunk into their code for no reason.

-Chris

···

On Oct 3, 2017, at 4:05 PM, David Sweeris <davesweeris@mac.com> wrote:

On Oct 2, 2017, at 10:06 PM, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:
On Oct 2, 2017, at 9:12 PM, David Sweeris via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Chris_Lattner · October 4, 2017, 4:51am

I do think that Ethan’s suggestion upthread interesting, which suggest considering something like:
import matrixlib (operators: [ᵀ,·,⊗])

FWIW, I think you should split discussion of this off into a new subthread, because people are probably not paying attention to it buried in this one.

Three concerns I see:
- Requiring them today would be a source incompatibility with Swift 4

To clarify, I’m only suggesting the qualifier be required for “non-standard” operators, so the source incompatibility would be on par to whatever unicode cleanup is similarly reclassifying characters already in use.

In that vein, this suggestion would dovetail well with such a reclassification effort, as it would give an easy upgrade path for existing code that wants to continue using a particular character, and allows a fairly conservative set of “standard” operators to be whitelisted without sacrificing end-user expressibility, which simplifies the scope of the classification effort.

“Standard” operators could include sections of the mathematical plane even though they aren’t necessarily used by the standard library, if there is desire to reserve such characters exclusively for operators and never identifiers.

I’m not sure how this would work. The people objecting to operators seem to be saying that they don’t know what they do. I’m not sure how to rectify that, but the idea of making it more explicit in code is interesting, and maybe there is a way to tie it in somehow. That’s why I’m saying it is potentially interesting to explore this, even though it isn’t immediately apparent to me how this can help with the concerns.

- Imports are per-module, not per-source-file, so this couldn’t be used to “user-partition” the identifier and operator space. It could be a way to make it clear that the user is opting into these explicitly.

Ahh nuts I actually thought imports were per-source-file!

Jordan clarified this downthread.

-Chris

···

On Oct 3, 2017, at 5:48 PM, Ethan Tira-Thompson via swift-evolution <swift-evolution@swift.org> wrote:

David_Sweeris · October 4, 2017, 5:45pm

Oh! I didn't realize the proposal had already been written! Yeah, that clears things up quite a bit, thanks for posting it :-)

Xiaodi Wu, I’m sorry I ever doubted you :-)

- Dave Sweeris

···

On Oct 3, 2017, at 21:47, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:

On Oct 3, 2017, at 4:05 PM, David Sweeris <davesweeris@mac.com <mailto:davesweeris@mac.com>> wrote:

On Oct 2, 2017, at 10:06 PM, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:

On Oct 2, 2017, at 9:12 PM, David Sweeris via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Keep in mind that Swift already goes far above and beyond in terms of operators

Yep, that's is a large part of why I'm such a Swift fan :-D

Fortunately, no one is seriously proposing a major curtailing of the capabilities here, we’re just trying to rationalize the operator set, which is a bit of a mess at present.

I guess I don't really understand why it's currently "a bit of a mess”.

Read the motivation/inconsistency section of:
https://github.com/xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

Set algebra is an illustrative example, because it is both used by people who are experts and people who are not. As far as policies go, I think it makes sense for Swift libraries to define operator-like things as named functions (e.g. “intersection") and also define operators (“∩”) which can optionally be used in source bases that want them for convenience. The compiler and language cannot know whether a code base is written and maintained by experts who know the symbols and who value their clarity (over the difficulty typing and recognizing them), and this approach allows maintainers of the codebase to pick their own policies.

Oh, yeah, I can't imagine a situation in which I'd think it'd be a good idea to not define a named function to go along with a unicode operator. I'm mainly concerned that we not limit the people in 5) unless we need to. And to be clear, if we actually need to, then I'm fine with doing that... It's just that -- like I said earlier in this message -- I don't clearly understand why this is a problem.

Sure, that’s fair. This is an issue we’ve been tracking since the Swift 2.x (!) days, so there is a lot of context that is probably not immediately obvious if you haven’t been following it since then. The proposal link above talks about the damage that we still carry.

hexdreamer · November 2, 2021, 7:34pm

Some research on the topic of malicious whitespace: https://www.trojansource.codes/trojan-source.pdf