Quantcast

Erlang re:run regular exp, match problrm

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Erlang re:run regular exp, match problrm

Mathias
Hi there,

I'm trying to figure out how Erlangs re:run module works.

When executing this::
1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
z=\"14\"/>", "<point(?:\s|.)*\/>").
{match,[{0,54}]}

I can see that it gives me a match on the complete XML representation
{match,[{0,54}]}.

But what I really would like to do is for it to give me a subset of matches
on each entity similar to {match,[{0,26},{27, 26}]}.

so the output would yield  something like this:
0-26 gives the first xml entity complete with it's attributes <point x="12"
y="2" z="4"/> and
match 27,26 gives the remaining entity.

If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide me
in the right direction closer to find the solution it will be greatly
appreciated.

I know about xmerl but for my trivial case it seems like overkill.

Thx in advance.

BR,
Mathias Stalås
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

dlfen
try this.
re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" z=\"14\"/>", "<point[^<point]*\/>",[global]).


在 2010-10-30,下午6:44, Mathias 写道:

> Hi there,
>
> I'm trying to figure out how Erlangs re:run module works.
>
> When executing this::
> 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
> z=\"14\"/>", "<point(?:\s|.)*\/>").
> {match,[{0,54}]}
>
> I can see that it gives me a match on the complete XML representation
> {match,[{0,54}]}.
>
> But what I really would like to do is for it to give me a subset of matches
> on each entity similar to {match,[{0,26},{27, 26}]}.
>
> so the output would yield  something like this:
> 0-26 gives the first xml entity complete with it's attributes <point x="12"
> y="2" z="4"/> and
> match 27,26 gives the remaining entity.
>
> If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide me
> in the right direction closer to find the solution it will be greatly
> appreciated.
>
> I know about xmerl but for my trivial case it seems like overkill.
>
> Thx in advance.
>
> BR,
> Mathias Stalås


________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

Mathias
Works like a charm!

Many thanks dlfen!

BR,
Mathias

On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote:

> try this.
> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>
>
> 在 2010-10-30,下午6:44, Mathias 写道:
>
> > Hi there,
> >
> > I'm trying to figure out how Erlangs re:run module works.
> >
> > When executing this::
> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
> > z=\"14\"/>", "<point(?:\s|.)*\/>").
> > {match,[{0,54}]}
> >
> > I can see that it gives me a match on the complete XML representation
> > {match,[{0,54}]}.
> >
> > But what I really would like to do is for it to give me a subset of
> matches
> > on each entity similar to {match,[{0,26},{27, 26}]}.
> >
> > so the output would yield  something like this:
> > 0-26 gives the first xml entity complete with it's attributes <point
> x="12"
> > y="2" z="4"/> and
> > match 27,26 gives the remaining entity.
> >
> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide
> me
> > in the right direction closer to find the solution it will be greatly
> > appreciated.
> >
> > I know about xmerl but for my trivial case it seems like overkill.
> >
> > Thx in advance.
> >
> > BR,
> > Mathias Stalås
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

Hynek Vychodil-2
I would not thanks on your place. It doesn't do what you want but
works only by accident in this particular example. [^<point]* means
any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
same way in this particular example.

This would work much more generally

re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])

but anyway you should use xml parser for xml parsing because xml is
not parseable by regular grammar so regular expression is not proper
tool for do it. You will end up with error prone solution.

On Sat, Oct 30, 2010 at 1:34 PM, Mathias <[hidden email]> wrote:

> Works like a charm!
>
> Many thanks dlfen!
>
> BR,
> Mathias
>
> On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote:
>
>> try this.
>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>>
>>
>> 在 2010-10-30,下午6:44, Mathias 写道:
>>
>> > Hi there,
>> >
>> > I'm trying to figure out how Erlangs re:run module works.
>> >
>> > When executing this::
>> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> > z=\"14\"/>", "<point(?:\s|.)*\/>").
>> > {match,[{0,54}]}
>> >
>> > I can see that it gives me a match on the complete XML representation
>> > {match,[{0,54}]}.
>> >
>> > But what I really would like to do is for it to give me a subset of
>> matches
>> > on each entity similar to {match,[{0,26},{27, 26}]}.
>> >
>> > so the output would yield  something like this:
>> > 0-26 gives the first xml entity complete with it's attributes <point
>> x="12"
>> > y="2" z="4"/> and
>> > match 27,26 gives the remaining entity.
>> >
>> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide
>> me
>> > in the right direction closer to find the solution it will be greatly
>> > appreciated.
>> >
>> > I know about xmerl but for my trivial case it seems like overkill.
>> >
>> > Thx in advance.
>> >
>> > BR,
>> > Mathias Stalås
>>
>>
>



--
--Hynek (Pichi) Vychodil

Analyze your data in minutes. Share your insights instantly. Thrill
your boss.  Be a data hero!
Try GoodData now for free: www.gooddata.com

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

Mathias
Hi,

That expression was actually the first one I tried out, with the only difference that I did It from within my application. This was before I knew that file:read_file and UTF-8 don't blend well. I used file:read_file to read in my UTF-8 encoded file... I only tried dilfen's expression from the CLI, and there it succeeded. At the point of posting I had tried so many different solutions that I was tired and just wanted my rather simple laboratory application to work. Later I found out that neither did that sample he gave work from my programs application scope or the one proposed form you.

Further investigation has lead me to believe that re:run/3 has either an issue with strings lacking of linefeed.
Putting some non valid xml in a file and using my rather simple program always yields([[{0,162}]]) the first char and the last points(see attached doc) ending char '>' as a match, it doesn't split them up as expected which is either the expected module behaviour which I find a bit odd or a(god forbid) programming fault from my part, Here is the code I use:

-module(mock).
-export([start/0, read_file/1, decode_data/1, find_pattern/3]).

start() -> 
        Bin = read_file("point.xml"),
        UnicodeString = decode_data(Bin),
        NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode, global]),
        NodeList.

read_file(File) ->
        case file:read_file(File) of
                {ok, Bin} -> Bin;
                _ -> []
end.

decode_data(Data) ->
         case unicode:characters_to_list(Data, utf8) of
                 {error, Encoded, Rest} ->
                        io:format("Caught Error~w~n", Encoded, Rest),
                        [];
                 List ->
                        List
 end.

find_pattern(Str, Pattern, Options) ->
        case re:run(Str, Pattern, Options) of
                {match, Part} ->
                        io:format("find_pattern: ~w~n", [Part]),
                        Part;
                nomatch -> []
end.

However, adding a linefeed '\n' after each entity in the doc will give the expected result:
[[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
which to me looks strange. Haven't read up on the re module that much but this is my experience.
 
I have resigned to using xmerl_xpath which seems to do the job. I guess me coming from the Java world is a bit spoiled with strong support for string manipulation and doing the above would have taken men less then 10 min.

Anyway Thank you both for the effort.

BR,
Mathias Stalås



On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <[hidden email]> wrote:
I would not thanks on your place. It doesn't do what you want but
works only by accident in this particular example. [^<point]* means
any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
same way in this particular example.

This would work much more generally

re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])

but anyway you should use xml parser for xml parsing because xml is
not parseable by regular grammar so regular expression is not proper
tool for do it. You will end up with error prone solution.

On Sat, Oct 30, 2010 at 1:34 PM, Mathias <[hidden email]> wrote:
> Works like a charm!
>
> Many thanks dlfen!
>
> BR,
> Mathias
>
> On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote:
>
>> try this.
>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>>
>>
>> 在 2010-10-30,下午6:44, Mathias 写道:
>>
>> > Hi there,
>> >
>> > I'm trying to figure out how Erlangs re:run module works.
>> >
>> > When executing this::
>> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> > z=\"14\"/>", "<point(?:\s|.)*\/>").
>> > {match,[{0,54}]}
>> >
>> > I can see that it gives me a match on the complete XML representation
>> > {match,[{0,54}]}.
>> >
>> > But what I really would like to do is for it to give me a subset of
>> matches
>> > on each entity similar to {match,[{0,26},{27, 26}]}.
>> >
>> > so the output would yield  something like this:
>> > 0-26 gives the first xml entity complete with it's attributes <point
>> x="12"
>> > y="2" z="4"/> and
>> > match 27,26 gives the remaining entity.
>> >
>> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide
>> me
>> > in the right direction closer to find the solution it will be greatly
>> > appreciated.
>> >
>> > I know about xmerl but for my trivial case it seems like overkill.
>> >
>> > Thx in advance.
>> >
>> > BR,
>> > Mathias Stalås
>>
>>
>



--
--Hynek (Pichi) Vychodil

Analyze your data in minutes. Share your insights instantly. Thrill
your boss.  Be a data hero!
Try GoodData now for free: www.gooddata.com



________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:[hidden email]

point.xml (226 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

Jesse Gumm
The problem in that last example is that by default * is greedy and .
doesn't match the linefeed (which is why putting \n at the end of each thing
worked)

"<point.*\/>" will match from the first instance of "<point" to the very
last "/>"

Changing it to:

"<point.*?\/>" will make * act "ungreedy" and only match at the first
instance it finds, then end.

Alternatively, you could use

"<point [^>]*\/>" then you don't really have to worry about greediness or
not.

-Jesse


On Sun, Oct 31, 2010 at 11:41 AM, Mathias <[hidden email]> wrote:

> Hi,
>
> That expression was actually the first one I tried out, with the only
> difference that I did It from within my application. This was before I knew
> that file:read_file and UTF-8 don't blend well. I used file:read_file to
> read in my UTF-8 encoded file... I only tried dilfen's expression from the
> CLI, and there it succeeded. At the point of posting I had tried so many
> different solutions that I was tired and just wanted my rather simple
> laboratory application to work. Later I found out that neither did that
> sample he gave work from my programs application scope or the one proposed
> form you.
>
> Further investigation has lead me to believe that re:run/3 has either an
> issue with strings lacking of linefeed.
> Putting some non valid xml in a file and using my rather simple program
> always yields([[{0,162}]]) the first char and the last points(see attached
> doc) ending char '>' as a match, it doesn't split them up as expected which
> is either the expected module behaviour which I find a bit odd or a(god
> forbid) programming fault from my part, Here is the code I use:
>
> -module(mock).
> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>
> start() ->
>         Bin = read_file("point.xml"),
>         UnicodeString = decode_data(Bin),
>         NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
> global]),
>         NodeList.
>
> read_file(File) ->
>         case file:read_file(File) of
>                 {ok, Bin} -> Bin;
>                 _ -> []
> end.
>
> decode_data(Data) ->
>          case unicode:characters_to_list(Data, utf8) of
>                  {error, Encoded, Rest} ->
>                         io:format("Caught Error~w~n", Encoded, Rest),
>                         [];
>                  List ->
>                         List
>  end.
>
> find_pattern(Str, Pattern, Options) ->
>         case re:run(Str, Pattern, Options) of
>                 {match, Part} ->
>                         io:format("find_pattern: ~w~n", [Part]),
>                         Part;
>                 nomatch -> []
> end.
>
> However, adding a linefeed '\n' after each entity in the doc will give the
> expected result:
> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
> which to me looks strange. Haven't read up on the re module that much but
> this is my experience.
>
> I have resigned to using xmerl_xpath which seems to do the job. I guess me
> coming from the Java world is a bit spoiled with strong support for string
> manipulation and doing the above would have taken men less then 10 min.
>
> Anyway Thank you both for the effort.
>
> BR,
> Mathias Stalås
>
>
>
>
> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <[hidden email]>wrote:
>
>> I would not thanks on your place. It doesn't do what you want but
>> works only by accident in this particular example. [^<point]* means
>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>> same way in this particular example.
>>
>> This would work much more generally
>>
>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])
>>
>> but anyway you should use xml parser for xml parsing because xml is
>> not parseable by regular grammar so regular expression is not proper
>> tool for do it. You will end up with error prone solution.
>>
>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <[hidden email]> wrote:
>> > Works like a charm!
>> >
>> > Many thanks dlfen!
>> >
>> > BR,
>> > Mathias
>> >
>> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote:
>> >
>> >> try this.
>> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> >> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>> >>
>> >>
>> >> 在 2010-10-30,下午6:44, Mathias 写道:
>> >>
>> >> > Hi there,
>> >> >
>> >> > I'm trying to figure out how Erlangs re:run module works.
>> >> >
>> >> > When executing this::
>> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> >> > z=\"14\"/>", "<point(?:\s|.)*\/>").
>> >> > {match,[{0,54}]}
>> >> >
>> >> > I can see that it gives me a match on the complete XML representation
>> >> > {match,[{0,54}]}.
>> >> >
>> >> > But what I really would like to do is for it to give me a subset of
>> >> matches
>> >> > on each entity similar to {match,[{0,26},{27, 26}]}.
>> >> >
>> >> > so the output would yield  something like this:
>> >> > 0-26 gives the first xml entity complete with it's attributes <point
>> >> x="12"
>> >> > y="2" z="4"/> and
>> >> > match 27,26 gives the remaining entity.
>> >> >
>> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and
>> guide
>> >> me
>> >> > in the right direction closer to find the solution it will be greatly
>> >> > appreciated.
>> >> >
>> >> > I know about xmerl but for my trivial case it seems like overkill.
>> >> >
>> >> > Thx in advance.
>> >> >
>> >> > BR,
>> >> > Mathias Stalås
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> --Hynek (Pichi) Vychodil
>>
>> Analyze your data in minutes. Share your insights instantly. Thrill
>> your boss.  Be a data hero!
>> Try GoodData now for free: www.gooddata.com
>>
>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:[hidden email]
>



--
Jesse Gumm
Sigma Star Systems
414.940.4866
[hidden email]
http://www.sigma-star.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

Mathias
Thank you for the clarification Jesse.
Nicely explained. I have tried it and it worked.

BR,
Mathias Stalås

On Sun, Oct 31, 2010 at 5:58 PM, Jesse Gumm <[hidden email]> wrote:

> The problem in that last example is that by default * is greedy and .
> doesn't match the linefeed (which is why putting \n at the end of each thing
> worked)
>
> "<point.*\/>" will match from the first instance of "<point" to the very
> last "/>"
>
> Changing it to:
>
> "<point.*?\/>" will make * act "ungreedy" and only match at the first
> instance it finds, then end.
>
> Alternatively, you could use
>
> "<point [^>]*\/>" then you don't really have to worry about greediness or
> not.
>
> -Jesse
>
>
> On Sun, Oct 31, 2010 at 11:41 AM, Mathias <[hidden email]> wrote:
>
>> Hi,
>>
>> That expression was actually the first one I tried out, with the only
>> difference that I did It from within my application. This was before I knew
>> that file:read_file and UTF-8 don't blend well. I used file:read_file to
>> read in my UTF-8 encoded file... I only tried dilfen's expression from the
>> CLI, and there it succeeded. At the point of posting I had tried so many
>> different solutions that I was tired and just wanted my rather simple
>> laboratory application to work. Later I found out that neither did that
>> sample he gave work from my programs application scope or the one proposed
>> form you.
>>
>> Further investigation has lead me to believe that re:run/3 has either an
>> issue with strings lacking of linefeed.
>> Putting some non valid xml in a file and using my rather simple program
>> always yields([[{0,162}]]) the first char and the last points(see attached
>> doc) ending char '>' as a match, it doesn't split them up as expected which
>> is either the expected module behaviour which I find a bit odd or a(god
>> forbid) programming fault from my part, Here is the code I use:
>>
>> -module(mock).
>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>>
>> start() ->
>>         Bin = read_file("point.xml"),
>>         UnicodeString = decode_data(Bin),
>>         NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
>> global]),
>>         NodeList.
>>
>> read_file(File) ->
>>         case file:read_file(File) of
>>                 {ok, Bin} -> Bin;
>>                 _ -> []
>> end.
>>
>> decode_data(Data) ->
>>          case unicode:characters_to_list(Data, utf8) of
>>                  {error, Encoded, Rest} ->
>>                         io:format("Caught Error~w~n", Encoded, Rest),
>>                         [];
>>                  List ->
>>                         List
>>  end.
>>
>> find_pattern(Str, Pattern, Options) ->
>>         case re:run(Str, Pattern, Options) of
>>                 {match, Part} ->
>>                         io:format("find_pattern: ~w~n", [Part]),
>>                         Part;
>>                 nomatch -> []
>> end.
>>
>> However, adding a linefeed '\n' after each entity in the doc will give the
>> expected result:
>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
>> which to me looks strange. Haven't read up on the re module that much but
>> this is my experience.
>>
>> I have resigned to using xmerl_xpath which seems to do the job. I guess me
>> coming from the Java world is a bit spoiled with strong support for string
>> manipulation and doing the above would have taken men less then 10 min.
>>
>> Anyway Thank you both for the effort.
>>
>> BR,
>> Mathias Stalås
>>
>>
>>
>>
>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <[hidden email]>wrote:
>>
>>> I would not thanks on your place. It doesn't do what you want but
>>> works only by accident in this particular example. [^<point]* means
>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>>> same way in this particular example.
>>>
>>> This would work much more generally
>>>
>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>>> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])
>>>
>>> but anyway you should use xml parser for xml parsing because xml is
>>> not parseable by regular grammar so regular expression is not proper
>>> tool for do it. You will end up with error prone solution.
>>>
>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <[hidden email]>
>>> wrote:
>>> > Works like a charm!
>>> >
>>> > Many thanks dlfen!
>>> >
>>> > BR,
>>> > Mathias
>>> >
>>> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote:
>>> >
>>> >> try this.
>>> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>> >> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>>> >>
>>> >>
>>> >> 在 2010-10-30,下午6:44, Mathias 写道:
>>> >>
>>> >> > Hi there,
>>> >> >
>>> >> > I'm trying to figure out how Erlangs re:run module works.
>>> >> >
>>> >> > When executing this::
>>> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>> >> > z=\"14\"/>", "<point(?:\s|.)*\/>").
>>> >> > {match,[{0,54}]}
>>> >> >
>>> >> > I can see that it gives me a match on the complete XML
>>> representation
>>> >> > {match,[{0,54}]}.
>>> >> >
>>> >> > But what I really would like to do is for it to give me a subset of
>>> >> matches
>>> >> > on each entity similar to {match,[{0,26},{27, 26}]}.
>>> >> >
>>> >> > so the output would yield  something like this:
>>> >> > 0-26 gives the first xml entity complete with it's attributes <point
>>> >> x="12"
>>> >> > y="2" z="4"/> and
>>> >> > match 27,26 gives the remaining entity.
>>> >> >
>>> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and
>>> guide
>>> >> me
>>> >> > in the right direction closer to find the solution it will be
>>> greatly
>>> >> > appreciated.
>>> >> >
>>> >> > I know about xmerl but for my trivial case it seems like overkill.
>>> >> >
>>> >> > Thx in advance.
>>> >> >
>>> >> > BR,
>>> >> > Mathias Stalås
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> --Hynek (Pichi) Vychodil
>>>
>>> Analyze your data in minutes. Share your insights instantly. Thrill
>>> your boss.  Be a data hero!
>>> Try GoodData now for free: www.gooddata.com
>>>
>>
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:[hidden email]
>>
>
>
>
> --
> Jesse Gumm
> Sigma Star Systems
> 414.940.4866
> [hidden email]
> http://www.sigma-star.com
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

Morten Krogh
In reply to this post by Jesse Gumm
Hi

Take care, '>' is an allowed character in attribute values, e.g.,

<point id="/>"/>

is valid xml.

If you control the input yourself it is no problem, of course.

Morten.


On 10/31/10 5:58 PM, Jesse Gumm wrote:

> The problem in that last example is that by default * is greedy and .
> doesn't match the linefeed (which is why putting \n at the end of each thing
> worked)
>
> "<point.*\/>" will match from the first instance of"<point" to the very
> last "/>"
>
> Changing it to:
>
> "<point.*?\/>" will make * act "ungreedy" and only match at the first
> instance it finds, then end.
>
> Alternatively, you could use
>
> "<point [^>]*\/>" then you don't really have to worry about greediness or
> not.
>
> -Jesse
>
>
> On Sun, Oct 31, 2010 at 11:41 AM, Mathias<[hidden email]>  wrote:
>
>> Hi,
>>
>> That expression was actually the first one I tried out, with the only
>> difference that I did It from within my application. This was before I knew
>> that file:read_file and UTF-8 don't blend well. I used file:read_file to
>> read in my UTF-8 encoded file... I only tried dilfen's expression from the
>> CLI, and there it succeeded. At the point of posting I had tried so many
>> different solutions that I was tired and just wanted my rather simple
>> laboratory application to work. Later I found out that neither did that
>> sample he gave work from my programs application scope or the one proposed
>> form you.
>>
>> Further investigation has lead me to believe that re:run/3 has either an
>> issue with strings lacking of linefeed.
>> Putting some non valid xml in a file and using my rather simple program
>> always yields([[{0,162}]]) the first char and the last points(see attached
>> doc) ending char '>' as a match, it doesn't split them up as expected which
>> is either the expected module behaviour which I find a bit odd or a(god
>> forbid) programming fault from my part, Here is the code I use:
>>
>> -module(mock).
>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>>
>> start() ->
>>          Bin = read_file("point.xml"),
>>          UnicodeString = decode_data(Bin),
>>          NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
>> global]),
>>          NodeList.
>>
>> read_file(File) ->
>>          case file:read_file(File) of
>>                  {ok, Bin} ->  Bin;
>>                  _ ->  []
>> end.
>>
>> decode_data(Data) ->
>>           case unicode:characters_to_list(Data, utf8) of
>>                   {error, Encoded, Rest} ->
>>                          io:format("Caught Error~w~n", Encoded, Rest),
>>                          [];
>>                   List ->
>>                          List
>>   end.
>>
>> find_pattern(Str, Pattern, Options) ->
>>          case re:run(Str, Pattern, Options) of
>>                  {match, Part} ->
>>                          io:format("find_pattern: ~w~n", [Part]),
>>                          Part;
>>                  nomatch ->  []
>> end.
>>
>> However, adding a linefeed '\n' after each entity in the doc will give the
>> expected result:
>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
>> which to me looks strange. Haven't read up on the re module that much but
>> this is my experience.
>>
>> I have resigned to using xmerl_xpath which seems to do the job. I guess me
>> coming from the Java world is a bit spoiled with strong support for string
>> manipulation and doing the above would have taken men less then 10 min.
>>
>> Anyway Thank you both for the effort.
>>
>> BR,
>> Mathias Stalås
>>
>>
>>
>>
>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil<[hidden email]>wrote:
>>
>>> I would not thanks on your place. It doesn't do what you want but
>>> works only by accident in this particular example. [^<point]* means
>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>>> same way in this particular example.
>>>
>>> This would work much more generally
>>>
>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>>> y=\"2\"z=\"14\"/>","<point.*?\/>",[global])
>>>
>>> but anyway you should use xml parser for xml parsing because xml is
>>> not parseable by regular grammar so regular expression is not proper
>>> tool for do it. You will end up with error prone solution.
>>>
>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias<[hidden email]>  wrote:
>>>> Works like a charm!
>>>>
>>>> Many thanks dlfen!
>>>>
>>>> BR,
>>>> Mathias
>>>>
>>>> On Sat, Oct 30, 2010 at 1:03 PM, dlfen<[hidden email]>  wrote:
>>>>
>>>>> try this.
>>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>> z=\"14\"/>","<point[^<point]*\/>",[global]).
>>>>>
>>>>>
>>>>> 在 2010-10-30,下午6:44, Mathias 写道:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm trying to figure out how Erlangs re:run module works.
>>>>>>
>>>>>> When executing this::
>>>>>> 1>  re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>>> z=\"14\"/>","<point(?:\s|.)*\/>").
>>>>>> {match,[{0,54}]}
>>>>>>
>>>>>> I can see that it gives me a match on the complete XML representation
>>>>>> {match,[{0,54}]}.
>>>>>>
>>>>>> But what I really would like to do is for it to give me a subset of
>>>>> matches
>>>>>> on each entity similar to {match,[{0,26},{27, 26}]}.
>>>>>>
>>>>>> so the output would yield  something like this:
>>>>>> 0-26 gives the first xml entity complete with it's attributes<point
>>>>> x="12"
>>>>>> y="2" z="4"/>  and
>>>>>> match 27,26 gives the remaining entity.
>>>>>>
>>>>>> If anyone can spot why my regexp:<point(?:\s|.)*\/>  is failing and
>>> guide
>>>>> me
>>>>>> in the right direction closer to find the solution it will be greatly
>>>>>> appreciated.
>>>>>>
>>>>>> I know about xmerl but for my trivial case it seems like overkill.
>>>>>>
>>>>>> Thx in advance.
>>>>>>
>>>>>> BR,
>>>>>> Mathias Stalås
>>>>>
>>>
>>>
>>> --
>>> --Hynek (Pichi) Vychodil
>>>
>>> Analyze your data in minutes. Share your insights instantly. Thrill
>>> your boss.  Be a data hero!
>>> Try GoodData now for free: www.gooddata.com
>>>
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:[hidden email]
>>
>
>


________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Erlang re:run regular exp, match problrm

Mathias
I'm just goofing around at the moment so no worries.
I found xmerl_xpath to be a good friend at the moment.
It's a pity(IMOP) that the documentation on how to use these library is
gently speaking sparse.
On the other side I'm glad that they exist, and there are some good examples
in other open source projects that have some nice coding where one can pick
up some bits and pieces and figure things out.

BR,
Mathias Stalås



On Sun, Oct 31, 2010 at 10:08 PM, Morten Krogh <[hidden email]> wrote:

> Hi
>
> Take care, '>' is an allowed character in attribute values, e.g.,
>
> <point id="/>"/>
>
> is valid xml.
>
> If you control the input yourself it is no problem, of course.
>
> Morten.
>
>
>
> On 10/31/10 5:58 PM, Jesse Gumm wrote:
>
>> The problem in that last example is that by default * is greedy and .
>> doesn't match the linefeed (which is why putting \n at the end of each
>> thing
>> worked)
>>
>> "<point.*\/>" will match from the first instance of"<point" to the very
>> last "/>"
>>
>> Changing it to:
>>
>> "<point.*?\/>" will make * act "ungreedy" and only match at the first
>> instance it finds, then end.
>>
>> Alternatively, you could use
>>
>> "<point [^>]*\/>" then you don't really have to worry about greediness or
>> not.
>>
>> -Jesse
>>
>>
>> On Sun, Oct 31, 2010 at 11:41 AM, Mathias<[hidden email]>
>>  wrote:
>>
>>  Hi,
>>>
>>> That expression was actually the first one I tried out, with the only
>>> difference that I did It from within my application. This was before I
>>> knew
>>> that file:read_file and UTF-8 don't blend well. I used file:read_file to
>>> read in my UTF-8 encoded file... I only tried dilfen's expression from
>>> the
>>> CLI, and there it succeeded. At the point of posting I had tried so many
>>> different solutions that I was tired and just wanted my rather simple
>>> laboratory application to work. Later I found out that neither did that
>>> sample he gave work from my programs application scope or the one
>>> proposed
>>> form you.
>>>
>>> Further investigation has lead me to believe that re:run/3 has either an
>>> issue with strings lacking of linefeed.
>>> Putting some non valid xml in a file and using my rather simple program
>>> always yields([[{0,162}]]) the first char and the last points(see
>>> attached
>>> doc) ending char '>' as a match, it doesn't split them up as expected
>>> which
>>> is either the expected module behaviour which I find a bit odd or a(god
>>> forbid) programming fault from my part, Here is the code I use:
>>>
>>> -module(mock).
>>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>>>
>>> start() ->
>>>         Bin = read_file("point.xml"),
>>>         UnicodeString = decode_data(Bin),
>>>         NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
>>> global]),
>>>         NodeList.
>>>
>>> read_file(File) ->
>>>         case file:read_file(File) of
>>>                 {ok, Bin} ->  Bin;
>>>                 _ ->  []
>>> end.
>>>
>>> decode_data(Data) ->
>>>          case unicode:characters_to_list(Data, utf8) of
>>>                  {error, Encoded, Rest} ->
>>>                         io:format("Caught Error~w~n", Encoded, Rest),
>>>                         [];
>>>                  List ->
>>>                         List
>>>  end.
>>>
>>> find_pattern(Str, Pattern, Options) ->
>>>         case re:run(Str, Pattern, Options) of
>>>                 {match, Part} ->
>>>                         io:format("find_pattern: ~w~n", [Part]),
>>>                         Part;
>>>                 nomatch ->  []
>>> end.
>>>
>>> However, adding a linefeed '\n' after each entity in the doc will give
>>> the
>>> expected result:
>>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
>>> which to me looks strange. Haven't read up on the re module that much but
>>> this is my experience.
>>>
>>> I have resigned to using xmerl_xpath which seems to do the job. I guess
>>> me
>>> coming from the Java world is a bit spoiled with strong support for
>>> string
>>> manipulation and doing the above would have taken men less then 10 min.
>>>
>>> Anyway Thank you both for the effort.
>>>
>>> BR,
>>> Mathias Stalås
>>>
>>>
>>>
>>>
>>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil<[hidden email]
>>> >wrote:
>>>
>>>  I would not thanks on your place. It doesn't do what you want but
>>>> works only by accident in this particular example. [^<point]* means
>>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>>>> same way in this particular example.
>>>>
>>>> This would work much more generally
>>>>
>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>>>> y=\"2\"z=\"14\"/>","<point.*?\/>",[global])
>>>>
>>>> but anyway you should use xml parser for xml parsing because xml is
>>>> not parseable by regular grammar so regular expression is not proper
>>>> tool for do it. You will end up with error prone solution.
>>>>
>>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias<[hidden email]>
>>>>  wrote:
>>>>
>>>>> Works like a charm!
>>>>>
>>>>> Many thanks dlfen!
>>>>>
>>>>> BR,
>>>>> Mathias
>>>>>
>>>>> On Sat, Oct 30, 2010 at 1:03 PM, dlfen<[hidden email]>  wrote:
>>>>>
>>>>>  try this.
>>>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>>> z=\"14\"/>","<point[^<point]*\/>",[global]).
>>>>>>
>>>>>>
>>>>>> 在 2010-10-30,下午6:44, Mathias 写道:
>>>>>>
>>>>>>  Hi there,
>>>>>>>
>>>>>>> I'm trying to figure out how Erlangs re:run module works.
>>>>>>>
>>>>>>> When executing this::
>>>>>>> 1>  re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>>>> z=\"14\"/>","<point(?:\s|.)*\/>").
>>>>>>> {match,[{0,54}]}
>>>>>>>
>>>>>>> I can see that it gives me a match on the complete XML representation
>>>>>>> {match,[{0,54}]}.
>>>>>>>
>>>>>>> But what I really would like to do is for it to give me a subset of
>>>>>>>
>>>>>> matches
>>>>>>
>>>>>>> on each entity similar to {match,[{0,26},{27, 26}]}.
>>>>>>>
>>>>>>> so the output would yield  something like this:
>>>>>>> 0-26 gives the first xml entity complete with it's attributes<point
>>>>>>>
>>>>>> x="12"
>>>>>>
>>>>>>> y="2" z="4"/>  and
>>>>>>> match 27,26 gives the remaining entity.
>>>>>>>
>>>>>>> If anyone can spot why my regexp:<point(?:\s|.)*\/>  is failing and
>>>>>>>
>>>>>> guide
>>>>
>>>>> me
>>>>>>
>>>>>>> in the right direction closer to find the solution it will be greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> I know about xmerl but for my trivial case it seems like overkill.
>>>>>>>
>>>>>>> Thx in advance.
>>>>>>>
>>>>>>> BR,
>>>>>>> Mathias Stalås
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> --Hynek (Pichi) Vychodil
>>>>
>>>> Analyze your data in minutes. Share your insights instantly. Thrill
>>>> your boss.  Be a data hero!
>>>> Try GoodData now for free: www.gooddata.com
>>>>
>>>>
>>>
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:[hidden email]
>>>
>>>
>>
>>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:[hidden email]
>
>
Loading...