emoji in source code failed to compile in linux ubuntu

Bee · December 4, 2015, 5:00am

Hi,

Just successfully installed Swift on my Linux machine. It went well and a
Hello World program compiled successfully. However, when I tried to put an
emoji inside of the string, it failed compiling. The error message is:

$ swiftc test.swift
test.swift:1:32: error: invalid UTF-8 found in source file
print("Hello World from Swift! ")
test.swift:1:35: error: invalid UTF-8 found in source file
print("Hello World from Swift! ")
test.swift:1:1: error: 'print' is unavailable: Please wrap your tuple
argument in parentheses: 'print((...
))'
print("Hello World from Swift! ")
Swift.print:2:13: note: 'print' has been explicitly marked unavailable here
public func print<T>(_: T)
^
$

I put the emoji after the exclamation marks of the string.

Any hints?

Thank you.

···

--
-Bee-

gribozavr · December 4, 2015, 6:29am

Hi Bee,

Hi,

Just successfully installed Swift on my Linux machine. It went well and a Hello World program compiled successfully. However, when I tried to put an emoji inside of the string, it failed compiling. The error message is:

$ swiftc test.swift
test.swift:1:32: error: invalid UTF-8 found in source file

This might be a clue... how did you insert the emoji?

print("Hello World from Swift! ")
test.swift:1:35: error: invalid UTF-8 found in source file
print("Hello World from Swift! ")
test.swift:1:1: error: 'print' is unavailable: Please wrap your tuple argument in parentheses: 'print((...
))'
print("Hello World from Swift! ")
Swift.print:2:13: note: 'print' has been explicitly marked unavailable here
public func print<T>(_: T)
^

Could you post the output of:

$ xxd < test.swift

since it looks like mail has mangled the emoji and I'm not seeing it
in your message.

Dmitri

···

On Thu, Dec 3, 2015 at 9:00 PM, Bee <bee.ography@gmail.com> wrote:

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

Bee · December 4, 2015, 10:41am

Here's the screen capture: http://i.imgur.com/0ud0hE4.png

···

On Fri, Dec 4, 2015 at 1:29 PM, Dmitri Gribenko <gribozavr@gmail.com> wrote:

Hi Bee,

On Thu, Dec 3, 2015 at 9:00 PM, Bee <bee.ography@gmail.com> wrote:
>
> Hi,
>
> Just successfully installed Swift on my Linux machine. It went well and
a Hello World program compiled successfully. However, when I tried to put
an emoji inside of the string, it failed compiling. The error message is:
>
> $ swiftc test.swift
> test.swift:1:32: error: invalid UTF-8 found in source file

This might be a clue... how did you insert the emoji?

> print("Hello World from Swift! ")
> test.swift:1:35: error: invalid UTF-8 found in source file
> print("Hello World from Swift! ")
> test.swift:1:1: error: 'print' is unavailable: Please wrap your tuple
argument in parentheses: 'print((...
> ))'
> print("Hello World from Swift! ")
> Swift.print:2:13: note: 'print' has been explicitly marked unavailable
here
> public func print<T>(_: T)
> ^

Could you post the output of:

$ xxd < test.swift

since it looks like mail has mangled the emoji and I'm not seeing it
in your message.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

--
-Bee-

alblue · December 4, 2015, 10:46am

If you run the XXD command it will print out the hex values, which is the important thing (as opposed to a screenshot which doesn’t have that information). For example, it might be using UTF-16 or some other variation.

Can you open Terminal and run

xxd < test.swift

and then copy/paste the text into a mail response? Then we can figure out what’s wrong.

Thanks,

Alex

···

On 4 Dec 2015, at 11:41, Bee <bee.ography@gmail.com> wrote:

Here's the screen capture: http://i.imgur.com/0ud0hE4.png

On Fri, Dec 4, 2015 at 1:29 PM, Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>> wrote:
Hi Bee,

On Thu, Dec 3, 2015 at 9:00 PM, Bee <bee.ography@gmail.com <mailto:bee.ography@gmail.com>> wrote:
>
> Hi,
>
> Just successfully installed Swift on my Linux machine. It went well and a Hello World program compiled successfully. However, when I tried to put an emoji inside of the string, it failed compiling. The error message is:
>
> $ swiftc test.swift
> test.swift:1:32: error: invalid UTF-8 found in source file

This might be a clue... how did you insert the emoji?

> print("Hello World from Swift! ")
> test.swift:1:35: error: invalid UTF-8 found in source file
> print("Hello World from Swift! ")
> test.swift:1:1: error: 'print' is unavailable: Please wrap your tuple argument in parentheses: 'print((...
> ))'
> print("Hello World from Swift! ")
> Swift.print:2:13: note: 'print' has been explicitly marked unavailable here
> public func print<T>(_: T)
> ^

Could you post the output of:

$ xxd < test.swift

since it looks like mail has mangled the emoji and I'm not seeing it
in your message.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/

--
-Bee-

_______________________________________________
swift-users mailing list
swift-users@swift.org <mailto:swift-users@swift.org>
https://lists.swift.org/mailman/listinfo/swift-users

Bee · December 4, 2015, 10:52am

Here it is:

$ xxd < test.swift

0000000: 7072 696e 7428 2248 656c 6c6f 2057 6f72 print("Hello Wor
0000010: 6c64 2066 726f 6d20 5377 6966 7421 20ed ld from Swift! .
0000020: a0bd edb8 8020 2229 ..... ")

I've asked Koding dev and they said their editor also has problem handling
unicode. They believe the problem is with the editor, not swift. They
suggest to prevent using unicode directly inside the code, and use the
unicode code instead. So, I think the problem has been cleared, although
the ideal solution is currently being worked on.

Thank you.

···

On Fri, Dec 4, 2015 at 5:46 PM, Alex Blewitt <alex.blewitt@gmail.com> wrote:

If you run the XXD command it will print out the hex values, which is the
important thing (as opposed to a screenshot which doesn’t have that
information). For example, it might be using UTF-16 or some other variation.

Can you open Terminal and run

xxd < test.swift

and then copy/paste the text into a mail response? Then we can figure out
what’s wrong.

Thanks,

Alex

On 4 Dec 2015, at 11:41, Bee <bee.ography@gmail.com> wrote:

Here's the screen capture: http://i.imgur.com/0ud0hE4.png

On Fri, Dec 4, 2015 at 1:29 PM, Dmitri Gribenko <gribozavr@gmail.com> > wrote:

Hi Bee,

On Thu, Dec 3, 2015 at 9:00 PM, Bee <bee.ography@gmail.com> wrote:
>
> Hi,
>
> Just successfully installed Swift on my Linux machine. It went well and
a Hello World program compiled successfully. However, when I tried to put
an emoji inside of the string, it failed compiling. The error message is:
>
> $ swiftc test.swift
> test.swift:1:32: error: invalid UTF-8 found in source file

This might be a clue... how did you insert the emoji?

> print("Hello World from Swift! ")
> test.swift:1:35: error: invalid UTF-8 found in source file
> print("Hello World from Swift! ")
> test.swift:1:1: error: 'print' is unavailable: Please wrap your tuple
argument in parentheses: 'print((...
> ))'
> print("Hello World from Swift! ")
> Swift.print:2:13: note: 'print' has been explicitly marked unavailable
here
> public func print<T>(_: T)
> ^

Could you post the output of:

$ xxd < test.swift

since it looks like mail has mangled the emoji and I'm not seeing it
in your message.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

--
-Bee-

_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users

--
-Bee-

Bee · December 4, 2015, 10:56am

Oh, I'm sorry… I also have conversation with another member of the list,
but we forget to include the list address on the mail target so our
conversation going private.

FYI, I installed swift on my Koding VM (ubuntu 64bit) from koding.com. It
has an online code editor. Their editor still has problem handling unicode.

Hope this will clear all things.

Thank you.

···

On Fri, Dec 4, 2015 at 5:52 PM, Bee <bee.ography@gmail.com> wrote:

Here it is:

$ xxd < test.swift

0000000: 7072 696e 7428 2248 656c 6c6f 2057 6f72 print("Hello Wor
0000010: 6c64 2066 726f 6d20 5377 6966 7421 20ed ld from Swift! .
0000020: a0bd edb8 8020 2229 ..... ")

I've asked Koding dev and they said their editor also has problem handling
unicode. They believe the problem is with the editor, not swift. They
suggest to prevent using unicode directly inside the code, and use the
unicode code instead. So, I think the problem has been cleared, although
the ideal solution is currently being worked on.

Thank you.

On Fri, Dec 4, 2015 at 5:46 PM, Alex Blewitt <alex.blewitt@gmail.com> > wrote:

If you run the XXD command it will print out the hex values, which is the
important thing (as opposed to a screenshot which doesn’t have that
information). For example, it might be using UTF-16 or some other variation.

Can you open Terminal and run

xxd < test.swift

and then copy/paste the text into a mail response? Then we can figure out
what’s wrong.

Thanks,

Alex

On 4 Dec 2015, at 11:41, Bee <bee.ography@gmail.com> wrote:

Here's the screen capture: http://i.imgur.com/0ud0hE4.png

On Fri, Dec 4, 2015 at 1:29 PM, Dmitri Gribenko <gribozavr@gmail.com> >> wrote:

Hi Bee,

On Thu, Dec 3, 2015 at 9:00 PM, Bee <bee.ography@gmail.com> wrote:
>
> Hi,
>
> Just successfully installed Swift on my Linux machine. It went well
and a Hello World program compiled successfully. However, when I tried to
put an emoji inside of the string, it failed compiling. The error message
is:
>
> $ swiftc test.swift
> test.swift:1:32: error: invalid UTF-8 found in source file

This might be a clue... how did you insert the emoji?

> print("Hello World from Swift! ")
> test.swift:1:35: error: invalid UTF-8 found in source file
> print("Hello World from Swift! ")
> test.swift:1:1: error: 'print' is unavailable: Please wrap your tuple
argument in parentheses: 'print((...
> ))'
> print("Hello World from Swift! ")
> Swift.print:2:13: note: 'print' has been explicitly marked unavailable
here
> public func print<T>(_: T)
> ^

Could you post the output of:

$ xxd < test.swift

since it looks like mail has mangled the emoji and I'm not seeing it
in your message.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

--
-Bee-

_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users

--
-Bee-

--
-Bee-

snej · December 4, 2015, 7:25pm

By total coincidence, just this week I implemented support for the above in a JSON parser, so I’m suddenly an expert ;-)

Based on what I know, both of those encodings are correct. UTF-16 surrogate pairs can occur in decoded UTF-8 and need to be decoded in turn into Unicode codepoints; according to the Wikipedia page on UTF-8:

"In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and 983040 4-byte sequences.”

In other words, during UTF-8 encoding, some Unicode codepoints need to be broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize surrogate pairs and reassemble them into a single codepoint. That’s what’s not happening in this case.

In short, I think this is a bug in the UTF-8 parser being used by the Swift compiler. But as my expertise here is only a few days old, I’m prepared to be corrected.

—Jens

···

On Dec 4, 2015, at 3:28 AM, Quinn The Eskimo! <eskimo1@apple.com> wrote:

I can explain that. U+1F603 is encoded in UTF-16 as d83d de03. If you encode each of these separately as UTF-8, you get ed a0 bd followed by ed b8 80. That's not the correct way to encode U+1F603 as UTF-8, hence the failure.

eskimo · December 4, 2015, 11:28am

I can explain that. U+1F603 is encoded in UTF-16 as d83d de03. If you encode each of these separately as UTF-8, you get ed a0 bd followed by ed b8 80. That's not the correct way to encode U+1F603 as UTF-8, hence the failure.

Share and Enjoy

···

On 4 Dec 2015, at 11:12, Alex Blewitt <alex.blewitt@gmail.com> wrote:

However it looks like the character is being encoded in your file as: ed a0 bd ed b8 80 which I’m not sure where that’s come from.

--
Quinn "The Eskimo!" <http://www.apple.com/developer/>
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

gribozavr · December 4, 2015, 7:47pm

No, surrogate code points can not appear in a UTF-8 stream, they can
only appear in UTF-16.

http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf page 30, table 2-3.

Dmitri

···

On Fri, Dec 4, 2015 at 11:25 AM, Jens Alfke <jens@mooseyard.com> wrote:

In other words, during UTF-8 encoding, some Unicode codepoints need to be broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize surrogate pairs and reassemble them into a single codepoint. That’s what’s not happening in this case.

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

alblue · December 4, 2015, 11:12am

Yes, I think there is a problem in the way that the Unicode is being entered. The unicode sequence for UTF-8 should look like f0 9f 98 83 0a.

However it looks like the character is being encoded in your file as: ed a0 bd ed b8 80 which I’m not sure where that’s come from. But it does look like the editor that you’re using isn’t converting the unicode to the right value.

In the meantime, you can print out a smiley by entering a unicode escape sequence as follows:

print("Hello \u{1f603}")

Alex

···

On 4 Dec 2015, at 11:56, Bee <bee.ography@gmail.com> wrote:

Oh, I'm sorry… I also have conversation with another member of the list, but we forget to include the list address on the mail target so our conversation going private.

FYI, I installed swift on my Koding VM (ubuntu 64bit) from koding.com <http://koding.com/>\. It has an online code editor. Their editor still has problem handling unicode.

Hope this will clear all things.

Thank you.

On Fri, Dec 4, 2015 at 5:52 PM, Bee <bee.ography@gmail.com <mailto:bee.ography@gmail.com>> wrote:
Here it is:

$ xxd < test.swift
0000000: 7072 696e 7428 2248 656c 6c6f 2057 6f72 print("Hello Wor
0000010: 6c64 2066 726f 6d20 5377 6966 7421 20ed ld from Swift! .
0000020: a0bd edb8 8020 2229 ..... ")

I've asked Koding dev and they said their editor also has problem handling unicode. They believe the problem is with the editor, not swift. They suggest to prevent using unicode directly inside the code, and use the unicode code instead. So, I think the problem has been cleared, although the ideal solution is currently being worked on.

Thank you.

On Fri, Dec 4, 2015 at 5:46 PM, Alex Blewitt <alex.blewitt@gmail.com <mailto:alex.blewitt@gmail.com>> wrote:
If you run the XXD command it will print out the hex values, which is the important thing (as opposed to a screenshot which doesn’t have that information). For example, it might be using UTF-16 or some other variation.

Can you open Terminal and run

xxd < test.swift

and then copy/paste the text into a mail response? Then we can figure out what’s wrong.

Thanks,

Alex

On 4 Dec 2015, at 11:41, Bee <bee.ography@gmail.com <mailto:bee.ography@gmail.com>> wrote:

Here's the screen capture: http://i.imgur.com/0ud0hE4.png

On Fri, Dec 4, 2015 at 1:29 PM, Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>> wrote:
Hi Bee,

On Thu, Dec 3, 2015 at 9:00 PM, Bee <bee.ography@gmail.com <mailto:bee.ography@gmail.com>> wrote:
>
> Hi,
>
> Just successfully installed Swift on my Linux machine. It went well and a Hello World program compiled successfully. However, when I tried to put an emoji inside of the string, it failed compiling. The error message is:
>
> $ swiftc test.swift
> test.swift:1:32: error: invalid UTF-8 found in source file

This might be a clue... how did you insert the emoji?

> print("Hello World from Swift! ")
> test.swift:1:35: error: invalid UTF-8 found in source file
> print("Hello World from Swift! ")
> test.swift:1:1: error: 'print' is unavailable: Please wrap your tuple argument in parentheses: 'print((...
> ))'
> print("Hello World from Swift! ")
> Swift.print:2:13: note: 'print' has been explicitly marked unavailable here
> public func print<T>(_: T)
> ^

Could you post the output of:

$ xxd < test.swift

since it looks like mail has mangled the emoji and I'm not seeing it
in your message.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/

--
-Bee-

_______________________________________________
swift-users mailing list
swift-users@swift.org <mailto:swift-users@swift.org>
https://lists.swift.org/mailman/listinfo/swift-users

--
-Bee-

--
-Bee-

snej · December 4, 2015, 8:22pm

Thanks for referencing the spec — that was useful to me. However, this looks like an issue where spec conformance clashes with real-world desire for compatibility. After some more reading I found that it’s actually pretty common to find UTF-8 containing surrogate pairs, mostly due to software that was using 16-bit Unicode before it got formalized as UTF-16. According to Wikipedia <https://en.wikipedia.org/wiki/CESU-8>\*, Java encodes UTF-8 this way, as do Oracle and MySQL databases.

If there are enough text editors/processors that do this, it might be a good idea for Swift’s lexer to accept surrogate pairs even if they’re technically invalid.

—Jens

* CESU-8 - Wikipedia

···

On Dec 4, 2015, at 11:47 AM, Dmitri Gribenko <gribozavr@gmail.com> wrote:

On Fri, Dec 4, 2015 at 11:25 AM, Jens Alfke <jens@mooseyard.com> wrote:

In other words, during UTF-8 encoding, some Unicode codepoints need to be broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize surrogate pairs and reassemble them into a single codepoint. That’s what’s not happening in this case.

No, surrogate code points can not appear in a UTF-8 stream, they can
only appear in UTF-16.

http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf page 30, table 2-3.

gribozavr · December 4, 2015, 8:40pm

Violating the spec in this part would cause security issues, since
different implementations would disagree on the character data.
Sorry, but we are not doing this. The Unicode spec is unambiguous
here.

Dmitri

···

On Fri, Dec 4, 2015 at 12:22 PM, Jens Alfke <jens@mooseyard.com> wrote:

On Dec 4, 2015, at 11:47 AM, Dmitri Gribenko <gribozavr@gmail.com> wrote:

On Fri, Dec 4, 2015 at 11:25 AM, Jens Alfke <jens@mooseyard.com> wrote:

In other words, during UTF-8 encoding, some Unicode codepoints need to be
broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize
surrogate pairs and reassemble them into a single codepoint. That’s what’s
not happening in this case.

No, surrogate code points can not appear in a UTF-8 stream, they can
only appear in UTF-16.

http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf page 30, table 2-3.

Thanks for referencing the spec — that was useful to me. However, this looks
like an issue where spec conformance clashes with real-world desire for
compatibility.

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

Lily_Ballard · December 4, 2015, 9:36pm

The unicode spec is very clear that conforming implementations must
reject any alternative utf-8 encoding, even if it appears to make
sense, because not doing so is a security vulnerability. More
generally, any code point that encodes as 1-3 code units can actually
be rewritten to be encoded as 4 code units (or really, any number of
code units up to 4 that's not less than the canonical encoding). These
are no more valid than encoding surrogate pair code points in utf-8.
The reason why this is considered a security vulnerability is because
any code that attempts to validate or filter a utf-8 stream may not
recognize these alternative encodings, and the validation/filtering can
be bypassed. For example, think of an HTML form validator that
automatically converts < into <. If you passed in an alternative
encoding such as C0 BC, the validator may not recognize it, but if the
browser then interpreted this ill-formed sequence as being the same as
0x3C, you'd have a trivial XSS attack.

More generally, section 3.9 of the Unicode 8.0 standard explicitly lists
the well-formed UTF-8 byte sequences, and this list does not include any
encoding for the surrogate pair range. And conforming implementations
must reject any ill-formed sequence.

-Kevin Ballard

···

On Fri, Dec 4, 2015, at 12:40 PM, Dmitri Gribenko wrote:

On Fri, Dec 4, 2015 at 12:22 PM, Jens Alfke > <jens@mooseyard.com> wrote:
>
> On Dec 4, 2015, at 11:47 AM, Dmitri Gribenko <gribozavr@gmail.com> > > wrote:
>
> On Fri, Dec 4, 2015 at 11:25 AM, Jens Alfke <jens@mooseyard.com> > > wrote:
>
> In other words, during UTF-8 encoding, some Unicode codepoints need
> to be broken into surrogate pairs. Therefore a UTF-8 decoder needs
> to recognize surrogate pairs and reassemble them into a single
> codepoint. That’s what’s not happening in this case.
>
>
> No, surrogate code points can not appear in a UTF-8 stream, they can
> only appear in UTF-16.
>
> http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf page 30,
> table 2-3.
>
>
> Thanks for referencing the spec — that was useful to me. However,
> this looks like an issue where spec conformance clashes with real-
> world desire for compatibility.

Violating the spec in this part would cause security issues, since
different implementations would disagree on the character data. Sorry,
but we are not doing this. The Unicode spec is unambiguous here.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/
_______________________________________________
swift-users mailing list swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users