puzzled with this charset/encoding -related behaviour

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

puzzled with this charset/encoding -related behaviour

Alexandre Karpov
TL;DR: how do I run erl which understands Unicode?

Or, in more detail:

(Disclaimer: this official documentation got me really humbled:
, and just a little bit scared =) )

Judging by my S/O question, which got 3 upvotes and no answers, I'm not the only one wondering:

Here's the gist of the problem:

57> "абв".

[1072,1073,1074]

The codes are correct Unicode for the [Cyrillic] characters - which means my Terminal didn't fail to understand my keyboard's input =) but Erlang shell didn't recognize Terminal's input as printable characters. And it is my understanding that this is exactly why this call fails:

25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]). ** exception error: bad argument in function re:run/3 called as re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])

The reason why this came up is me trying the example from "Programming Erlang" where Joe gives you a lib_find module, and demonstrates reading of MP3 tags from files; because I tried looking for mp3 files on a path which had some Chinese characters in some filenames, this problem arose.

I've tried finding way to run erl with a different charset (hoping for erl --charset=UTF8 or something), but only found references to file names, which in my case doesn't sound very related.

Thanks!

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: puzzled with this charset/encoding -related behaviour

Attila Rajmund Nohl
2017-10-14 4:21 GMT+02:00 Alexandre Karpov <[hidden email]>:

> TL;DR: how do I run erl which understands Unicode?
>
> Or, in more detail:
>
> (Disclaimer: this official documentation got me really humbled:
> http://www1.erlang.org/doc/apps/stdlib/unicode_usage.html
> , and just a little bit scared =) )
>
> Judging by my S/O question, which got 3 upvotes and no answers, I'm not the
> only one wondering:
> https://stackoverflow.com/questions/46735539/erlang-regexp-matching-on-chinese-characters
>
> Here's the gist of the problem:
>
> 57> "абв".
>
> [1072,1073,1074]
>
> The codes are correct Unicode for the [Cyrillic] characters - which means my
> Terminal didn't fail to understand my keyboard's input =) but Erlang shell
> didn't recognize Terminal's input as printable characters. And it is my
> understanding that this is exactly why this call fails:
>
> 25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]). **
> exception error: bad argument in function re:run/3 called as
> re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])

Try

re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: puzzled with this charset/encoding -related behaviour

zxq9-2
On 2017年10月14日 土曜日 10:12:19 Attila Rajmund Nohl wrote:

> 2017-10-14 4:21 GMT+02:00 Alexandre Karpov <[hidden email]>:
> > TL;DR: how do I run erl which understands Unicode?
> >
> > Or, in more detail:
> >
> > (Disclaimer: this official documentation got me really humbled:
> > http://www1.erlang.org/doc/apps/stdlib/unicode_usage.html
> > , and just a little bit scared =) )
> >
> > Judging by my S/O question, which got 3 upvotes and no answers, I'm not the
> > only one wondering:
> > https://stackoverflow.com/questions/46735539/erlang-regexp-matching-on-chinese-characters
> >
> > Here's the gist of the problem:
> >
> > 57> "абв".
> >
> > [1072,1073,1074]
> >
> > The codes are correct Unicode for the [Cyrillic] characters - which means my
> > Terminal didn't fail to understand my keyboard's input =) but Erlang shell
> > didn't recognize Terminal's input as printable characters. And it is my
> > understanding that this is exactly why this call fails:
> >
> > 25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]). **
> > exception error: bad argument in function re:run/3 called as
> > re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])
>
> Try
>
> re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).

FYI: the SO question has an answer now.

The regex execution needs to be put into unicode mode:

re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [unicode, {capture, none}]).

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: puzzled with this charset/encoding -related behaviour

Dan Gudmundsson-2
In reply to this post by Attila Rajmund Nohl
re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}, unicode]).
The binary one matches since it works on bytes and not utf-8 characters?

Also the erlang shell doesn't know if a list of integers is a list of integers or a string,
since they may be represented by the same list of integers.

So it tries to guess, by default it guesses that lists containing integers larger than 255
is not a string but a list of integers. You can change that with:

(w)erl +pc unicode

1> "йцу.asd".
"йцу.asd"

/Dan


On Sat, Oct 14, 2017 at 10:12 AM Attila Rajmund Nohl <[hidden email]> wrote:
2017-10-14 4:21 GMT+02:00 Alexandre Karpov <[hidden email]>:
> TL;DR: how do I run erl which understands Unicode?
>
> Or, in more detail:
>
> (Disclaimer: this official documentation got me really humbled:
> http://www1.erlang.org/doc/apps/stdlib/unicode_usage.html
> , and just a little bit scared =) )
>
> Judging by my S/O question, which got 3 upvotes and no answers, I'm not the
> only one wondering:
> https://stackoverflow.com/questions/46735539/erlang-regexp-matching-on-chinese-characters
>
> Here's the gist of the problem:
>
> 57> "абв".
>
> [1072,1073,1074]
>
> The codes are correct Unicode for the [Cyrillic] characters - which means my
> Terminal didn't fail to understand my keyboard's input =) but Erlang shell
> didn't recognize Terminal's input as printable characters. And it is my
> understanding that this is exactly why this call fails:
>
> 25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]). **
> exception error: bad argument in function re:run/3 called as
> re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])

Try

re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: puzzled with this charset/encoding -related behaviour

Alexandre Karpov
Thanks everyone! I didn't realize until this conversation how much more important strings-as-binaries are, compared to simple "strings". Everything _works_ now, of course, but I don't think my understanding has caught up 100%

"by default it guesses that lists containing integers larger than 255
is not a string but a list of integers" <<< this really set some things straight

But suffer me this follow-up question, Dan. Using +pc unicode indeed gave me a shell that represents lists of integers using the characters found in Unicode mapping; so now, in error messages, I see arguments reported more clearly:

7> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).    

** exception error: bad argument

     in function  re:run/3

        called as re:run("йцу.asd","^(.*\\..*)$",[{capture,none}])


If I use the binary-string representation, it works _even_without_ /utf8, it works just fine:

3> re:run(<<"普通話.asd">>, xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).

match

Note that the call above was executed in the shell started _without_ the +pc unicode, and the binary does _not_ have the /utf8>> thingy... This means my understanding is still lacking... binaries are honest and good, strings are fake and evil, but +px unicode seems to help a little with fake string... while using binary-strings doesn't _always_ require the /utf8 ... what is this sorcery?!
=)


On Sat, Oct 14, 2017 at 4:24 AM, Dan Gudmundsson <[hidden email]> wrote:
re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}, unicode]).
The binary one matches since it works on bytes and not utf-8 characters?

Also the erlang shell doesn't know if a list of integers is a list of integers or a string,
since they may be represented by the same list of integers.

So it tries to guess, by default it guesses that lists containing integers larger than 255
is not a string but a list of integers. You can change that with:

(w)erl +pc unicode

1> "йцу.asd".
"йцу.asd"

/Dan


On Sat, Oct 14, 2017 at 10:12 AM Attila Rajmund Nohl <[hidden email]> wrote:
2017-10-14 4:21 GMT+02:00 Alexandre Karpov <[hidden email]>:
> TL;DR: how do I run erl which understands Unicode?
>
> Or, in more detail:
>
> (Disclaimer: this official documentation got me really humbled:
> http://www1.erlang.org/doc/apps/stdlib/unicode_usage.html
> , and just a little bit scared =) )
>
> Judging by my S/O question, which got 3 upvotes and no answers, I'm not the
> only one wondering:
> https://stackoverflow.com/questions/46735539/erlang-regexp-matching-on-chinese-characters
>
> Here's the gist of the problem:
>
> 57> "абв".
>
> [1072,1073,1074]
>
> The codes are correct Unicode for the [Cyrillic] characters - which means my
> Terminal didn't fail to understand my keyboard's input =) but Erlang shell
> didn't recognize Terminal's input as printable characters. And it is my
> understanding that this is exactly why this call fails:
>
> 25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]). **
> exception error: bad argument in function re:run/3 called as
> re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])

Try

re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: puzzled with this charset/encoding -related behaviour

Dan Gudmundsson-2

On Sat, Oct 14, 2017 at 6:23 PM Alexandre Karpov <[hidden email]> wrote:
Thanks everyone! I didn't realize until this conversation how much more important strings-as-binaries are, compared to simple "strings". Everything _works_ now, of course, but I don't think my understanding has caught up 100%

"by default it guesses that lists containing integers larger than 255
is not a string but a list of integers" <<< this really set some things straight

But suffer me this follow-up question, Dan. Using +pc unicode indeed gave me a shell that represents lists of integers using the characters found in Unicode mapping; so now, in error messages, I see arguments reported more clearly:

7> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).    

** exception error: bad argument

     in function  re:run/3

        called as re:run("йцу.asd","^(.*\\..*)$",[{capture,none}])


If I use the binary-string representation, it works _even_without_ /utf8, it works just fine:

3> re:run(<<"普通話.asd">>, xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).

match

Note that the call above was executed in the shell started _without_ the +pc unicode, and the binary does _not_ have the /utf8>> thingy... This means my understanding is still lacking... binaries are honest and good, strings are fake and evil, but +px unicode seems to help a little with fake string... while using binary-strings doesn't _always_ require the /utf8 ... what is this sorcery?!
=)

First see the difference here:

4> <<"普通話.asd">>.
<<110,26,113,46,97,115,100>>
i.e. the codepoints are just truncated to below 256

5> <<"普通話.asd"/utf8>>.
<<230,153,174,233,128,154,232,169,177,46,97,115,100>>
6> 
And the codepoints are utf8 encoded. 

Don't give up on lists, they can be useful and fast for some usages.

And your regexp matches anything with a dot in it, so even if the string is handled as utf8 encoded binary or just plain bytes, it
still works since in both representation you get a match.

To understand unicode, play around and try to make it work (on both lists and binaries) with some fancier regexps try match
something with a unicode sign in it and capture the result so you see what you matched. 
Print your input/result strings with both io:format("~ts: ~w~n",[Str, Str]). So you can see both the actual string and it's representation,
test with both binaries and lists as representations.

/Dan



On Sat, Oct 14, 2017 at 4:24 AM, Dan Gudmundsson <[hidden email]> wrote:
re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}, unicode]).
The binary one matches since it works on bytes and not utf-8 characters?

Also the erlang shell doesn't know if a list of integers is a list of integers or a string,
since they may be represented by the same list of integers.

So it tries to guess, by default it guesses that lists containing integers larger than 255
is not a string but a list of integers. You can change that with:

(w)erl +pc unicode

1> "йцу.asd".
"йцу.asd"

/Dan


On Sat, Oct 14, 2017 at 10:12 AM Attila Rajmund Nohl <[hidden email]> wrote:
2017-10-14 4:21 GMT+02:00 Alexandre Karpov <[hidden email]>:
> TL;DR: how do I run erl which understands Unicode?
>
> Or, in more detail:
>
> (Disclaimer: this official documentation got me really humbled:
> http://www1.erlang.org/doc/apps/stdlib/unicode_usage.html
> , and just a little bit scared =) )
>
> Judging by my S/O question, which got 3 upvotes and no answers, I'm not the
> only one wondering:
> https://stackoverflow.com/questions/46735539/erlang-regexp-matching-on-chinese-characters
>
> Here's the gist of the problem:
>
> 57> "абв".
>
> [1072,1073,1074]
>
> The codes are correct Unicode for the [Cyrillic] characters - which means my
> Terminal didn't fail to understand my keyboard's input =) but Erlang shell
> didn't recognize Terminal's input as printable characters. And it is my
> understanding that this is exactly why this call fails:
>
> 25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]). **
> exception error: bad argument in function re:run/3 called as
> re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])

Try

re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions