Regexp Matching on Unicode

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Regexp Matching on Unicode

Zachary Kessin-2
Hi All

I am hitting a bit of a wall here, I am building a lexer with leex and I really want to match on unicode chars, there is a regex class \p{Letter} but that does not seem to work in erlang. I really want is a way to say "Match a letter, but not a digit". So the \w would not work. Any ideas?

--
Zach Kessin
Twitter: @zkessin
Skype: zachkessin

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Regexp Matching on Unicode

José Valim-2
Make sure to escape the property escape character and to also pass the [unicode] flag when compiling and it should be good to go:

28> {ok, Reg} = re:compile("\\p{L}{5}", []).
{ok,{re_pattern,0,0,0,
                <<69,82,67,80,77,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
                  255,255,...>>}}
29> re:run(<<"こんにちは"/utf8>>, Reg).
nomatch

30> {ok, RegUni} = re:compile("\\p{L}{5}", [unicode]).
{ok,{re_pattern,0,1,0,
                <<69,82,67,80,77,0,0,0,0,8,0,0,1,0,0,0,255,255,255,255,
                  255,255,...>>}}
31> re:run(<<"こんにちは"/utf8>>, RegUni).
{match,[{0,15}]}



José Valim
Skype: jv.ptec
Founder and Director of R&D

On Tue, Dec 13, 2016 at 11:32 AM, Zachary Kessin <[hidden email]> wrote:
Hi All

I am hitting a bit of a wall here, I am building a lexer with leex and I really want to match on unicode chars, there is a regex class \p{Letter} but that does not seem to work in erlang. I really want is a way to say "Match a letter, but not a digit". So the \w would not work. Any ideas?

--
Zach Kessin
Twitter: @zkessin
Skype: zachkessin

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Regexp Matching on Unicode

Hugo Mills-2
In reply to this post by Zachary Kessin-2
On Tue, Dec 13, 2016 at 12:32:43PM +0200, Zachary Kessin wrote:
> Hi All
>
> I am hitting a bit of a wall here, I am building a lexer with leex and I
> really want to match on unicode chars, there is a regex class \p{Letter}
> but that does not seem to work in erlang. I really want is a way to say
> "Match a letter, but not a digit". So the \w would not work. Any ideas?

   I think if you want unicode support, you need to write your own
lexer, or use something other than leex. It's a bit limited in what it
supports. I went through this earlier this year, and ended up writing
my own -- partly for that reason, and partly to do with the way I
wanted to process block comments.

   Hugo.

--
Hugo Mills             | "There's more than one way to do it" is not a
hugo@... carfax.org.uk | commandment. It is a dire warning.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Regexp Matching on Unicode

José Valim-2
In reply to this post by José Valim-2
Apologies, just after Hugo Mills' reply I noticed your question was related to leex and not re.

leex does not support unicode character classes, such as \p or \w. It does accept unicode as its input as well as unicode characters as literals in your rules, such as [á-ú], the pound sign, etc.



José Valim
Skype: jv.ptec
Founder and Director of R&D

On Tue, Dec 13, 2016 at 11:40 AM, José Valim <[hidden email]> wrote:
Make sure to escape the property escape character and to also pass the [unicode] flag when compiling and it should be good to go:

28> {ok, Reg} = re:compile("\\p{L}{5}", []).
{ok,{re_pattern,0,0,0,
                <<69,82,67,80,77,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
                  255,255,...>>}}
29> re:run(<<"こんにちは"/utf8>>, Reg).
nomatch

30> {ok, RegUni} = re:compile("\\p{L}{5}", [unicode]).
{ok,{re_pattern,0,1,0,
                <<69,82,67,80,77,0,0,0,0,8,0,0,1,0,0,0,255,255,255,255,
                  255,255,...>>}}
31> re:run(<<"こんにちは"/utf8>>, RegUni).
{match,[{0,15}]}



José Valim
Skype: jv.ptec
Founder and Director of R&D

On Tue, Dec 13, 2016 at 11:32 AM, Zachary Kessin <[hidden email]> wrote:
Hi All

I am hitting a bit of a wall here, I am building a lexer with leex and I really want to match on unicode chars, there is a regex class \p{Letter} but that does not seem to work in erlang. I really want is a way to say "Match a letter, but not a digit". So the \w would not work. Any ideas?

--
Zach Kessin
Twitter: @zkessin
Skype: zachkessin

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions