|
Hi there,
I'm trying to figure out how Erlangs re:run module works. When executing this:: 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" z=\"14\"/>", "<point(?:\s|.)*\/>"). {match,[{0,54}]} I can see that it gives me a match on the complete XML representation {match,[{0,54}]}. But what I really would like to do is for it to give me a subset of matches on each entity similar to {match,[{0,26},{27, 26}]}. so the output would yield something like this: 0-26 gives the first xml entity complete with it's attributes <point x="12" y="2" z="4"/> and match 27,26 gives the remaining entity. If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide me in the right direction closer to find the solution it will be greatly appreciated. I know about xmerl but for my trivial case it seems like overkill. Thx in advance. BR, Mathias Stalås |
|
try this.
re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" z=\"14\"/>", "<point[^<point]*\/>",[global]). 在 2010-10-30,下午6:44, Mathias 写道: > Hi there, > > I'm trying to figure out how Erlangs re:run module works. > > When executing this:: > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" > z=\"14\"/>", "<point(?:\s|.)*\/>"). > {match,[{0,54}]} > > I can see that it gives me a match on the complete XML representation > {match,[{0,54}]}. > > But what I really would like to do is for it to give me a subset of matches > on each entity similar to {match,[{0,26},{27, 26}]}. > > so the output would yield something like this: > 0-26 gives the first xml entity complete with it's attributes <point x="12" > y="2" z="4"/> and > match 27,26 gives the remaining entity. > > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide me > in the right direction closer to find the solution it will be greatly > appreciated. > > I know about xmerl but for my trivial case it seems like overkill. > > Thx in advance. > > BR, > Mathias Stalås ________________________________________________________________ erlang-questions (at) erlang.org mailing list. See http://www.erlang.org/faq.html To unsubscribe; mailto:[hidden email] |
|
Works like a charm!
Many thanks dlfen! BR, Mathias On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote: > try this. > re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" > z=\"14\"/>", "<point[^<point]*\/>",[global]). > > > 在 2010-10-30,下午6:44, Mathias 写道: > > > Hi there, > > > > I'm trying to figure out how Erlangs re:run module works. > > > > When executing this:: > > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" > > z=\"14\"/>", "<point(?:\s|.)*\/>"). > > {match,[{0,54}]} > > > > I can see that it gives me a match on the complete XML representation > > {match,[{0,54}]}. > > > > But what I really would like to do is for it to give me a subset of > matches > > on each entity similar to {match,[{0,26},{27, 26}]}. > > > > so the output would yield something like this: > > 0-26 gives the first xml entity complete with it's attributes <point > x="12" > > y="2" z="4"/> and > > match 27,26 gives the remaining entity. > > > > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide > me > > in the right direction closer to find the solution it will be greatly > > appreciated. > > > > I know about xmerl but for my trivial case it seems like overkill. > > > > Thx in advance. > > > > BR, > > Mathias Stalås > > |
|
I would not thanks on your place. It doesn't do what you want but
works only by accident in this particular example. [^<point]* means any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in same way in this particular example. This would work much more generally re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"z=\"14\"/>", "<point.*?\/>",[global]) but anyway you should use xml parser for xml parsing because xml is not parseable by regular grammar so regular expression is not proper tool for do it. You will end up with error prone solution. On Sat, Oct 30, 2010 at 1:34 PM, Mathias <[hidden email]> wrote: > Works like a charm! > > Many thanks dlfen! > > BR, > Mathias > > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote: > >> try this. >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >> z=\"14\"/>", "<point[^<point]*\/>",[global]). >> >> >> 在 2010-10-30,下午6:44, Mathias 写道: >> >> > Hi there, >> > >> > I'm trying to figure out how Erlangs re:run module works. >> > >> > When executing this:: >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >> > z=\"14\"/>", "<point(?:\s|.)*\/>"). >> > {match,[{0,54}]} >> > >> > I can see that it gives me a match on the complete XML representation >> > {match,[{0,54}]}. >> > >> > But what I really would like to do is for it to give me a subset of >> matches >> > on each entity similar to {match,[{0,26},{27, 26}]}. >> > >> > so the output would yield something like this: >> > 0-26 gives the first xml entity complete with it's attributes <point >> x="12" >> > y="2" z="4"/> and >> > match 27,26 gives the remaining entity. >> > >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and guide >> me >> > in the right direction closer to find the solution it will be greatly >> > appreciated. >> > >> > I know about xmerl but for my trivial case it seems like overkill. >> > >> > Thx in advance. >> > >> > BR, >> > Mathias Stalås >> >> > -- --Hynek (Pichi) Vychodil Analyze your data in minutes. Share your insights instantly. Thrill your boss. Be a data hero! Try GoodData now for free: www.gooddata.com ________________________________________________________________ erlang-questions (at) erlang.org mailing list. See http://www.erlang.org/faq.html To unsubscribe; mailto:[hidden email] |
|
Hi,
That expression was actually the first one I tried out, with the only difference that I did It from within my application. This was before I knew that file:read_file and UTF-8 don't blend well. I used file:read_file to read in my UTF-8 encoded file... I only tried dilfen's expression from the CLI, and there it succeeded. At the point of posting I had tried so many different solutions that I was tired and just wanted my rather simple laboratory application to work. Later I found out that neither did that sample he gave work from my programs application scope or the one proposed form you. Further investigation has lead me to believe that re:run/3 has either an issue with strings lacking of linefeed. Putting some non valid xml in a file and using my rather simple program always yields([[{0,162}]]) the first char and the last points(see attached doc) ending char '>' as a match, it doesn't split them up as expected which is either the expected module behaviour which I find a bit odd or a(god forbid) programming fault from my part, Here is the code I use: -module(mock). -export([start/0, read_file/1, decode_data/1, find_pattern/3]). start() -> Bin = read_file("point.xml"), UnicodeString = decode_data(Bin), NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode, global]), NodeList. read_file(File) -> case file:read_file(File) of {ok, Bin} -> Bin; _ -> [] end. decode_data(Data) -> case unicode:characters_to_list(Data, utf8) of {error, Encoded, Rest} -> io:format("Caught Error~w~n", Encoded, Rest), []; List -> List end. find_pattern(Str, Pattern, Options) -> case re:run(Str, Pattern, Options) of {match, Part} -> io:format("find_pattern: ~w~n", [Part]), Part; nomatch -> [] end. However, adding a linefeed '\n' after each entity in the doc will give the expected result: [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]] which to me looks strange. Haven't read up on the re module that much but this is my experience. I have resigned to using xmerl_xpath which seems to do the job. I guess me coming from the Java world is a bit spoiled with strong support for string manipulation and doing the above would have taken men less then 10 min. Anyway Thank you both for the effort. BR, Mathias Stalås On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <[hidden email]> wrote: I would not thanks on your place. It doesn't do what you want but ________________________________________________________________ erlang-questions (at) erlang.org mailing list. See http://www.erlang.org/faq.html To unsubscribe; mailto:[hidden email] |
|
The problem in that last example is that by default * is greedy and .
doesn't match the linefeed (which is why putting \n at the end of each thing worked) "<point.*\/>" will match from the first instance of "<point" to the very last "/>" Changing it to: "<point.*?\/>" will make * act "ungreedy" and only match at the first instance it finds, then end. Alternatively, you could use "<point [^>]*\/>" then you don't really have to worry about greediness or not. -Jesse On Sun, Oct 31, 2010 at 11:41 AM, Mathias <[hidden email]> wrote: > Hi, > > That expression was actually the first one I tried out, with the only > difference that I did It from within my application. This was before I knew > that file:read_file and UTF-8 don't blend well. I used file:read_file to > read in my UTF-8 encoded file... I only tried dilfen's expression from the > CLI, and there it succeeded. At the point of posting I had tried so many > different solutions that I was tired and just wanted my rather simple > laboratory application to work. Later I found out that neither did that > sample he gave work from my programs application scope or the one proposed > form you. > > Further investigation has lead me to believe that re:run/3 has either an > issue with strings lacking of linefeed. > Putting some non valid xml in a file and using my rather simple program > always yields([[{0,162}]]) the first char and the last points(see attached > doc) ending char '>' as a match, it doesn't split them up as expected which > is either the expected module behaviour which I find a bit odd or a(god > forbid) programming fault from my part, Here is the code I use: > > -module(mock). > -export([start/0, read_file/1, decode_data/1, find_pattern/3]). > > start() -> > Bin = read_file("point.xml"), > UnicodeString = decode_data(Bin), > NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode, > global]), > NodeList. > > read_file(File) -> > case file:read_file(File) of > {ok, Bin} -> Bin; > _ -> [] > end. > > decode_data(Data) -> > case unicode:characters_to_list(Data, utf8) of > {error, Encoded, Rest} -> > io:format("Caught Error~w~n", Encoded, Rest), > []; > List -> > List > end. > > find_pattern(Str, Pattern, Options) -> > case re:run(Str, Pattern, Options) of > {match, Part} -> > io:format("find_pattern: ~w~n", [Part]), > Part; > nomatch -> [] > end. > > However, adding a linefeed '\n' after each entity in the doc will give the > expected result: > [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]] > which to me looks strange. Haven't read up on the re module that much but > this is my experience. > > I have resigned to using xmerl_xpath which seems to do the job. I guess me > coming from the Java world is a bit spoiled with strong support for string > manipulation and doing the above would have taken men less then 10 min. > > Anyway Thank you both for the effort. > > BR, > Mathias Stalås > > > > > On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <[hidden email]>wrote: > >> I would not thanks on your place. It doesn't do what you want but >> works only by accident in this particular example. [^<point]* means >> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in >> same way in this particular example. >> >> This would work much more generally >> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" >> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global]) >> >> but anyway you should use xml parser for xml parsing because xml is >> not parseable by regular grammar so regular expression is not proper >> tool for do it. You will end up with error prone solution. >> >> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <[hidden email]> wrote: >> > Works like a charm! >> > >> > Many thanks dlfen! >> > >> > BR, >> > Mathias >> > >> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote: >> > >> >> try this. >> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >> >> z=\"14\"/>", "<point[^<point]*\/>",[global]). >> >> >> >> >> >> 在 2010-10-30,下午6:44, Mathias 写道: >> >> >> >> > Hi there, >> >> > >> >> > I'm trying to figure out how Erlangs re:run module works. >> >> > >> >> > When executing this:: >> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >> >> > z=\"14\"/>", "<point(?:\s|.)*\/>"). >> >> > {match,[{0,54}]} >> >> > >> >> > I can see that it gives me a match on the complete XML representation >> >> > {match,[{0,54}]}. >> >> > >> >> > But what I really would like to do is for it to give me a subset of >> >> matches >> >> > on each entity similar to {match,[{0,26},{27, 26}]}. >> >> > >> >> > so the output would yield something like this: >> >> > 0-26 gives the first xml entity complete with it's attributes <point >> >> x="12" >> >> > y="2" z="4"/> and >> >> > match 27,26 gives the remaining entity. >> >> > >> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and >> guide >> >> me >> >> > in the right direction closer to find the solution it will be greatly >> >> > appreciated. >> >> > >> >> > I know about xmerl but for my trivial case it seems like overkill. >> >> > >> >> > Thx in advance. >> >> > >> >> > BR, >> >> > Mathias Stalås >> >> >> >> >> > >> >> >> >> -- >> --Hynek (Pichi) Vychodil >> >> Analyze your data in minutes. Share your insights instantly. Thrill >> your boss. Be a data hero! >> Try GoodData now for free: www.gooddata.com >> > > > > ________________________________________________________________ > erlang-questions (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:[hidden email] > -- Jesse Gumm Sigma Star Systems 414.940.4866 [hidden email] http://www.sigma-star.com |
|
Thank you for the clarification Jesse.
Nicely explained. I have tried it and it worked. BR, Mathias Stalås On Sun, Oct 31, 2010 at 5:58 PM, Jesse Gumm <[hidden email]> wrote: > The problem in that last example is that by default * is greedy and . > doesn't match the linefeed (which is why putting \n at the end of each thing > worked) > > "<point.*\/>" will match from the first instance of "<point" to the very > last "/>" > > Changing it to: > > "<point.*?\/>" will make * act "ungreedy" and only match at the first > instance it finds, then end. > > Alternatively, you could use > > "<point [^>]*\/>" then you don't really have to worry about greediness or > not. > > -Jesse > > > On Sun, Oct 31, 2010 at 11:41 AM, Mathias <[hidden email]> wrote: > >> Hi, >> >> That expression was actually the first one I tried out, with the only >> difference that I did It from within my application. This was before I knew >> that file:read_file and UTF-8 don't blend well. I used file:read_file to >> read in my UTF-8 encoded file... I only tried dilfen's expression from the >> CLI, and there it succeeded. At the point of posting I had tried so many >> different solutions that I was tired and just wanted my rather simple >> laboratory application to work. Later I found out that neither did that >> sample he gave work from my programs application scope or the one proposed >> form you. >> >> Further investigation has lead me to believe that re:run/3 has either an >> issue with strings lacking of linefeed. >> Putting some non valid xml in a file and using my rather simple program >> always yields([[{0,162}]]) the first char and the last points(see attached >> doc) ending char '>' as a match, it doesn't split them up as expected which >> is either the expected module behaviour which I find a bit odd or a(god >> forbid) programming fault from my part, Here is the code I use: >> >> -module(mock). >> -export([start/0, read_file/1, decode_data/1, find_pattern/3]). >> >> start() -> >> Bin = read_file("point.xml"), >> UnicodeString = decode_data(Bin), >> NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode, >> global]), >> NodeList. >> >> read_file(File) -> >> case file:read_file(File) of >> {ok, Bin} -> Bin; >> _ -> [] >> end. >> >> decode_data(Data) -> >> case unicode:characters_to_list(Data, utf8) of >> {error, Encoded, Rest} -> >> io:format("Caught Error~w~n", Encoded, Rest), >> []; >> List -> >> List >> end. >> >> find_pattern(Str, Pattern, Options) -> >> case re:run(Str, Pattern, Options) of >> {match, Part} -> >> io:format("find_pattern: ~w~n", [Part]), >> Part; >> nomatch -> [] >> end. >> >> However, adding a linefeed '\n' after each entity in the doc will give the >> expected result: >> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]] >> which to me looks strange. Haven't read up on the re module that much but >> this is my experience. >> >> I have resigned to using xmerl_xpath which seems to do the job. I guess me >> coming from the Java world is a bit spoiled with strong support for string >> manipulation and doing the above would have taken men less then 10 min. >> >> Anyway Thank you both for the effort. >> >> BR, >> Mathias Stalås >> >> >> >> >> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <[hidden email]>wrote: >> >>> I would not thanks on your place. It doesn't do what you want but >>> works only by accident in this particular example. [^<point]* means >>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in >>> same way in this particular example. >>> >>> This would work much more generally >>> >>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" >>> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global]) >>> >>> but anyway you should use xml parser for xml parsing because xml is >>> not parseable by regular grammar so regular expression is not proper >>> tool for do it. You will end up with error prone solution. >>> >>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <[hidden email]> >>> wrote: >>> > Works like a charm! >>> > >>> > Many thanks dlfen! >>> > >>> > BR, >>> > Mathias >>> > >>> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <[hidden email]> wrote: >>> > >>> >> try this. >>> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >>> >> z=\"14\"/>", "<point[^<point]*\/>",[global]). >>> >> >>> >> >>> >> 在 2010-10-30,下午6:44, Mathias 写道: >>> >> >>> >> > Hi there, >>> >> > >>> >> > I'm trying to figure out how Erlangs re:run module works. >>> >> > >>> >> > When executing this:: >>> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >>> >> > z=\"14\"/>", "<point(?:\s|.)*\/>"). >>> >> > {match,[{0,54}]} >>> >> > >>> >> > I can see that it gives me a match on the complete XML >>> representation >>> >> > {match,[{0,54}]}. >>> >> > >>> >> > But what I really would like to do is for it to give me a subset of >>> >> matches >>> >> > on each entity similar to {match,[{0,26},{27, 26}]}. >>> >> > >>> >> > so the output would yield something like this: >>> >> > 0-26 gives the first xml entity complete with it's attributes <point >>> >> x="12" >>> >> > y="2" z="4"/> and >>> >> > match 27,26 gives the remaining entity. >>> >> > >>> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and >>> guide >>> >> me >>> >> > in the right direction closer to find the solution it will be >>> greatly >>> >> > appreciated. >>> >> > >>> >> > I know about xmerl but for my trivial case it seems like overkill. >>> >> > >>> >> > Thx in advance. >>> >> > >>> >> > BR, >>> >> > Mathias Stalås >>> >> >>> >> >>> > >>> >>> >>> >>> -- >>> --Hynek (Pichi) Vychodil >>> >>> Analyze your data in minutes. Share your insights instantly. Thrill >>> your boss. Be a data hero! >>> Try GoodData now for free: www.gooddata.com >>> >> >> >> >> ________________________________________________________________ >> erlang-questions (at) erlang.org mailing list. >> See http://www.erlang.org/faq.html >> To unsubscribe; mailto:[hidden email] >> > > > > -- > Jesse Gumm > Sigma Star Systems > 414.940.4866 > [hidden email] > http://www.sigma-star.com > |
|
In reply to this post by Jesse Gumm
Hi
Take care, '>' is an allowed character in attribute values, e.g., <point id="/>"/> is valid xml. If you control the input yourself it is no problem, of course. Morten. On 10/31/10 5:58 PM, Jesse Gumm wrote: > The problem in that last example is that by default * is greedy and . > doesn't match the linefeed (which is why putting \n at the end of each thing > worked) > > "<point.*\/>" will match from the first instance of"<point" to the very > last "/>" > > Changing it to: > > "<point.*?\/>" will make * act "ungreedy" and only match at the first > instance it finds, then end. > > Alternatively, you could use > > "<point [^>]*\/>" then you don't really have to worry about greediness or > not. > > -Jesse > > > On Sun, Oct 31, 2010 at 11:41 AM, Mathias<[hidden email]> wrote: > >> Hi, >> >> That expression was actually the first one I tried out, with the only >> difference that I did It from within my application. This was before I knew >> that file:read_file and UTF-8 don't blend well. I used file:read_file to >> read in my UTF-8 encoded file... I only tried dilfen's expression from the >> CLI, and there it succeeded. At the point of posting I had tried so many >> different solutions that I was tired and just wanted my rather simple >> laboratory application to work. Later I found out that neither did that >> sample he gave work from my programs application scope or the one proposed >> form you. >> >> Further investigation has lead me to believe that re:run/3 has either an >> issue with strings lacking of linefeed. >> Putting some non valid xml in a file and using my rather simple program >> always yields([[{0,162}]]) the first char and the last points(see attached >> doc) ending char '>' as a match, it doesn't split them up as expected which >> is either the expected module behaviour which I find a bit odd or a(god >> forbid) programming fault from my part, Here is the code I use: >> >> -module(mock). >> -export([start/0, read_file/1, decode_data/1, find_pattern/3]). >> >> start() -> >> Bin = read_file("point.xml"), >> UnicodeString = decode_data(Bin), >> NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode, >> global]), >> NodeList. >> >> read_file(File) -> >> case file:read_file(File) of >> {ok, Bin} -> Bin; >> _ -> [] >> end. >> >> decode_data(Data) -> >> case unicode:characters_to_list(Data, utf8) of >> {error, Encoded, Rest} -> >> io:format("Caught Error~w~n", Encoded, Rest), >> []; >> List -> >> List >> end. >> >> find_pattern(Str, Pattern, Options) -> >> case re:run(Str, Pattern, Options) of >> {match, Part} -> >> io:format("find_pattern: ~w~n", [Part]), >> Part; >> nomatch -> [] >> end. >> >> However, adding a linefeed '\n' after each entity in the doc will give the >> expected result: >> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]] >> which to me looks strange. Haven't read up on the re module that much but >> this is my experience. >> >> I have resigned to using xmerl_xpath which seems to do the job. I guess me >> coming from the Java world is a bit spoiled with strong support for string >> manipulation and doing the above would have taken men less then 10 min. >> >> Anyway Thank you both for the effort. >> >> BR, >> Mathias Stalås >> >> >> >> >> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil<[hidden email]>wrote: >> >>> I would not thanks on your place. It doesn't do what you want but >>> works only by accident in this particular example. [^<point]* means >>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in >>> same way in this particular example. >>> >>> This would work much more generally >>> >>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" >>> y=\"2\"z=\"14\"/>","<point.*?\/>",[global]) >>> >>> but anyway you should use xml parser for xml parsing because xml is >>> not parseable by regular grammar so regular expression is not proper >>> tool for do it. You will end up with error prone solution. >>> >>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias<[hidden email]> wrote: >>>> Works like a charm! >>>> >>>> Many thanks dlfen! >>>> >>>> BR, >>>> Mathias >>>> >>>> On Sat, Oct 30, 2010 at 1:03 PM, dlfen<[hidden email]> wrote: >>>> >>>>> try this. >>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >>>>> z=\"14\"/>","<point[^<point]*\/>",[global]). >>>>> >>>>> >>>>> 在 2010-10-30,下午6:44, Mathias 写道: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I'm trying to figure out how Erlangs re:run module works. >>>>>> >>>>>> When executing this:: >>>>>> 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >>>>>> z=\"14\"/>","<point(?:\s|.)*\/>"). >>>>>> {match,[{0,54}]} >>>>>> >>>>>> I can see that it gives me a match on the complete XML representation >>>>>> {match,[{0,54}]}. >>>>>> >>>>>> But what I really would like to do is for it to give me a subset of >>>>> matches >>>>>> on each entity similar to {match,[{0,26},{27, 26}]}. >>>>>> >>>>>> so the output would yield something like this: >>>>>> 0-26 gives the first xml entity complete with it's attributes<point >>>>> x="12" >>>>>> y="2" z="4"/> and >>>>>> match 27,26 gives the remaining entity. >>>>>> >>>>>> If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and >>> guide >>>>> me >>>>>> in the right direction closer to find the solution it will be greatly >>>>>> appreciated. >>>>>> >>>>>> I know about xmerl but for my trivial case it seems like overkill. >>>>>> >>>>>> Thx in advance. >>>>>> >>>>>> BR, >>>>>> Mathias Stalås >>>>> >>> >>> >>> -- >>> --Hynek (Pichi) Vychodil >>> >>> Analyze your data in minutes. Share your insights instantly. Thrill >>> your boss. Be a data hero! >>> Try GoodData now for free: www.gooddata.com >>> >> >> >> ________________________________________________________________ >> erlang-questions (at) erlang.org mailing list. >> See http://www.erlang.org/faq.html >> To unsubscribe; mailto:[hidden email] >> > > ________________________________________________________________ erlang-questions (at) erlang.org mailing list. See http://www.erlang.org/faq.html To unsubscribe; mailto:[hidden email] |
|
I'm just goofing around at the moment so no worries.
I found xmerl_xpath to be a good friend at the moment. It's a pity(IMOP) that the documentation on how to use these library is gently speaking sparse. On the other side I'm glad that they exist, and there are some good examples in other open source projects that have some nice coding where one can pick up some bits and pieces and figure things out. BR, Mathias Stalås On Sun, Oct 31, 2010 at 10:08 PM, Morten Krogh <[hidden email]> wrote: > Hi > > Take care, '>' is an allowed character in attribute values, e.g., > > <point id="/>"/> > > is valid xml. > > If you control the input yourself it is no problem, of course. > > Morten. > > > > On 10/31/10 5:58 PM, Jesse Gumm wrote: > >> The problem in that last example is that by default * is greedy and . >> doesn't match the linefeed (which is why putting \n at the end of each >> thing >> worked) >> >> "<point.*\/>" will match from the first instance of"<point" to the very >> last "/>" >> >> Changing it to: >> >> "<point.*?\/>" will make * act "ungreedy" and only match at the first >> instance it finds, then end. >> >> Alternatively, you could use >> >> "<point [^>]*\/>" then you don't really have to worry about greediness or >> not. >> >> -Jesse >> >> >> On Sun, Oct 31, 2010 at 11:41 AM, Mathias<[hidden email]> >> wrote: >> >> Hi, >>> >>> That expression was actually the first one I tried out, with the only >>> difference that I did It from within my application. This was before I >>> knew >>> that file:read_file and UTF-8 don't blend well. I used file:read_file to >>> read in my UTF-8 encoded file... I only tried dilfen's expression from >>> the >>> CLI, and there it succeeded. At the point of posting I had tried so many >>> different solutions that I was tired and just wanted my rather simple >>> laboratory application to work. Later I found out that neither did that >>> sample he gave work from my programs application scope or the one >>> proposed >>> form you. >>> >>> Further investigation has lead me to believe that re:run/3 has either an >>> issue with strings lacking of linefeed. >>> Putting some non valid xml in a file and using my rather simple program >>> always yields([[{0,162}]]) the first char and the last points(see >>> attached >>> doc) ending char '>' as a match, it doesn't split them up as expected >>> which >>> is either the expected module behaviour which I find a bit odd or a(god >>> forbid) programming fault from my part, Here is the code I use: >>> >>> -module(mock). >>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]). >>> >>> start() -> >>> Bin = read_file("point.xml"), >>> UnicodeString = decode_data(Bin), >>> NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode, >>> global]), >>> NodeList. >>> >>> read_file(File) -> >>> case file:read_file(File) of >>> {ok, Bin} -> Bin; >>> _ -> [] >>> end. >>> >>> decode_data(Data) -> >>> case unicode:characters_to_list(Data, utf8) of >>> {error, Encoded, Rest} -> >>> io:format("Caught Error~w~n", Encoded, Rest), >>> []; >>> List -> >>> List >>> end. >>> >>> find_pattern(Str, Pattern, Options) -> >>> case re:run(Str, Pattern, Options) of >>> {match, Part} -> >>> io:format("find_pattern: ~w~n", [Part]), >>> Part; >>> nomatch -> [] >>> end. >>> >>> However, adding a linefeed '\n' after each entity in the doc will give >>> the >>> expected result: >>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]] >>> which to me looks strange. Haven't read up on the re module that much but >>> this is my experience. >>> >>> I have resigned to using xmerl_xpath which seems to do the job. I guess >>> me >>> coming from the Java world is a bit spoiled with strong support for >>> string >>> manipulation and doing the above would have taken men less then 10 min. >>> >>> Anyway Thank you both for the effort. >>> >>> BR, >>> Mathias Stalås >>> >>> >>> >>> >>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil<[hidden email] >>> >wrote: >>> >>> I would not thanks on your place. It doesn't do what you want but >>>> works only by accident in this particular example. [^<point]* means >>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in >>>> same way in this particular example. >>>> >>>> This would work much more generally >>>> >>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" >>>> y=\"2\"z=\"14\"/>","<point.*?\/>",[global]) >>>> >>>> but anyway you should use xml parser for xml parsing because xml is >>>> not parseable by regular grammar so regular expression is not proper >>>> tool for do it. You will end up with error prone solution. >>>> >>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias<[hidden email]> >>>> wrote: >>>> >>>>> Works like a charm! >>>>> >>>>> Many thanks dlfen! >>>>> >>>>> BR, >>>>> Mathias >>>>> >>>>> On Sat, Oct 30, 2010 at 1:03 PM, dlfen<[hidden email]> wrote: >>>>> >>>>> try this. >>>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >>>>>> z=\"14\"/>","<point[^<point]*\/>",[global]). >>>>>> >>>>>> >>>>>> 在 2010-10-30,下午6:44, Mathias 写道: >>>>>> >>>>>> Hi there, >>>>>>> >>>>>>> I'm trying to figure out how Erlangs re:run module works. >>>>>>> >>>>>>> When executing this:: >>>>>>> 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\" >>>>>>> z=\"14\"/>","<point(?:\s|.)*\/>"). >>>>>>> {match,[{0,54}]} >>>>>>> >>>>>>> I can see that it gives me a match on the complete XML representation >>>>>>> {match,[{0,54}]}. >>>>>>> >>>>>>> But what I really would like to do is for it to give me a subset of >>>>>>> >>>>>> matches >>>>>> >>>>>>> on each entity similar to {match,[{0,26},{27, 26}]}. >>>>>>> >>>>>>> so the output would yield something like this: >>>>>>> 0-26 gives the first xml entity complete with it's attributes<point >>>>>>> >>>>>> x="12" >>>>>> >>>>>>> y="2" z="4"/> and >>>>>>> match 27,26 gives the remaining entity. >>>>>>> >>>>>>> If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and >>>>>>> >>>>>> guide >>>> >>>>> me >>>>>> >>>>>>> in the right direction closer to find the solution it will be greatly >>>>>>> appreciated. >>>>>>> >>>>>>> I know about xmerl but for my trivial case it seems like overkill. >>>>>>> >>>>>>> Thx in advance. >>>>>>> >>>>>>> BR, >>>>>>> Mathias Stalås >>>>>>> >>>>>> >>>>>> >>>> >>>> -- >>>> --Hynek (Pichi) Vychodil >>>> >>>> Analyze your data in minutes. Share your insights instantly. Thrill >>>> your boss. Be a data hero! >>>> Try GoodData now for free: www.gooddata.com >>>> >>>> >>> >>> ________________________________________________________________ >>> erlang-questions (at) erlang.org mailing list. >>> See http://www.erlang.org/faq.html >>> To unsubscribe; mailto:[hidden email] >>> >>> >> >> > > ________________________________________________________________ > erlang-questions (at) erlang.org mailing list. > See http://www.erlang.org/faq.html > To unsubscribe; mailto:[hidden email] > > |
| Powered by Nabble | Edit this page |
