How to extract string between XML tags

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

How to extract string between XML tags

Lloyd R. Prentice-2
Hello,

By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.

Can anyone show a better way?

Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag

My kludge:

extract_text(TaggedText) ->
  Split = re:split(TaggedText, "<"),
  Split2 = lists:nth(2, Split),
  Split3 = binary_to_list(Split2),
  Split4 = re:split(Split3, ">"),
  Split5 = lists:nth(2, Split4),
  binary_to_list(Split5).

Surely there's a better way.

Many thanks,

LRP

*********************************************
My books:

THE GOSPEL OF ASHES
http://thegospelofashes.com

Strength is not enough. Do they have the courage
and the cunning? Can they survive long enough to
save the lives of millions?  

FREEIN' PANCHO
http://freeinpancho.com

A community of misfits help a troubled boy find his way

AYA TAKEO
http://ayatakeo.com

Star-crossed love, war and power in an alternative
universe

Available through Amazon or by request from your
favorite bookstore


**********************************************

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Hugo Mills-2
On Tue, Sep 25, 2018 at 05:56:01PM -0400, [hidden email] wrote:

> Hello,
>
> By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>
> Can anyone show a better way?
>
> Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>
> My kludge:
>
> extract_text(TaggedText) ->
>   Split = re:split(TaggedText, "<"),
>   Split2 = lists:nth(2, Split),
>   Split3 = binary_to_list(Split2),
>   Split4 = re:split(Split3, ">"),
>   Split5 = lists:nth(2, Split4),
>   binary_to_list(Split5).
>
> Surely there's a better way.
   XML isn't a regular language, so (in the general case) you can't(*)
use regexes and simple string splitting to parse XML correctly. If
you've got a very constrained input, where you know that it's going to
conform to specific patterns that you can match on, then you might get
away with it, but if that's not the case, you're barking up the wrong
tree with any kind of regex.

   The solution is to swallow the pain and use a proper XML library.
I've had good results with erlsom in my own projects, but there's
several other erlang XML libs out there, with various benefits and
issues. I'm sure others will weigh in with their experiences with
those.

  Hugo.

(*) By "can't", I don't mean "it's just too painful". I mean "it's
provably not possible to do it right in all cases".

--
Hugo Mills             | "There's a Martian war machine outside -- they want
hugo@... carfax.org.uk | to talk to you about a cure for the common cold."
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                           Stephen Franklin, Babylon 5

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Fred Hebert-2
In reply to this post by Lloyd R. Prentice-2
On 09/25, [hidden email] wrote:

>Hello,
>
>By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>
>Can anyone show a better way?
>
>Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>
>My kludge:
>
>extract_text(TaggedText) ->
>  Split = re:split(TaggedText, "<"),
>  Split2 = lists:nth(2, Split),
>  Split3 = binary_to_list(Split2),
>  Split4 = re:split(Split3, ">"),
>  Split5 = lists:nth(2, Split4),
>  binary_to_list(Split5).
>
>Surely there's a better way.
>

The classic answer: https://stackoverflow.com/a/1732454/35344

The nice non-ridiculous one: You will want to use an XML parser to parse
XML. Regular expressions are usually not the proper structure.

Let's take this as an example:

1> Str = "<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>".
"<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>"
2> rr(xmerl).
[xmerl_event,xmerl_fun_states,xmerl_scanner,xmlAttribute,
 xmlComment,xmlContext,xmlDecl,xmlDocument,xmlElement,
 xmlNamespace,xmlNode,xmlNsNode,xmlObj,xmlPI,xmlText]
3> {XML, _} = xmerl_scan:string(Str).
{#xmlElement{
     name = a,expanded_name = a,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [],pos = 1,attributes = [],
     content =
         [#xmlText{
              parents = [{a,1}],
              pos = 1,language = [],value = "aaaa",type = text},
          #xmlElement{
              name = b,expanded_name = b,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{b,2},{a,1}],
                       pos = 1,language = [],value = "bbbbb",type = text},
                   #xmlElement{
                       name = c,expanded_name = c,nsinfo = [],
                       namespace = #xmlNamespace{...},
                       parents = [...],...},
                   #xmlText{
                       parents = [{b,2},{a,...}],
                       pos = 3,language = [],
                       value = [...],...}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{a,1}],
              pos = 3,language = [],value = "bbbb",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared},
 []}

This gives you a parsed XML document. You can use xpath to access nodes,
if you want. XPath defines a syntax to query the insides of XML
documents as strings: https://en.wikipedia.org/wiki/XPath

For example, the /a/b/c string would mean 'within the root document /,
find the node a, and then go find node b in there, and go find node c'.

This fives something like this:

8> xmerl_xpath:string("/a/b/c", XML).
[#xmlElement{
     name = c,expanded_name = c,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [{b,2},{a,1}],
     pos = 2,attributes = [],
     content =
         [#xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 1,language = [],value = "ccccc",type = text},
          #xmlElement{
              name = d,expanded_name = d,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{c,2},{b,2},{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{d,2},{c,2},{b,2},{a,...}],
                       pos = 1,language = [],value = "ddd",type = text}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 3,language = [],value = "ccc",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared}]

You can see that the XML node has 3 entries in it: a text node (#xmlText
with a value "cccc", #xmlElement which has the name 'd' (so the <d>
tag), and another text node.

You can then go and dig within `<d>` by adding to the xpath:

9> xmerl_xpath:string("/a/b/c/d", XML).
[#xmlElement{name = d,expanded_name = d,nsinfo = [],
             namespace = #xmlNamespace{default = [],nodes = []},
             parents = [{c,2},{b,2},{a,1}],
             pos = 2,attributes = [],
             content = [#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
                                 pos = 1,language = [],value = "ddd",type = text}],
             language = [],xmlbase = "/Users/ferd",
             elementdef = undeclared}]

And the sole node contained there is the one with the content that is
#xmlText.content = "ddd". If you want to extract the text, you can use
the `text()` xpath qualifier:

18> xmerl_xpath:string("/a/b/c/text()", XML).
[#xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ccccc",type = text},
 #xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 3,language = [],value = "ccc",type = text}]

19> xmerl_xpath:string("/a/b/c/d/text()", XML).
[#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ddd",type = text}]

The xmerl structure is kind of cumbersome, but when you have to handle
more complex documents, a real parser with niceties like xpath can do
wonders to handle documents as a logical structure rather than as a
group of tokens to wrangle.

Regards,
Fred.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Lloyd R. Prentice-2
Thanks all,

This is definitely useful info for another aspect of my project.

But perhaps there's a simpler solution to my immediate problem if I pose my question more specifically.

Goal: Add tables to erlpress_core.

Problem: Extract cell info from html tables:

table() ->
   "<table style=\"width:100%\">
       <tr>
          <th>Firstname</th>
          <th>Lastname</th>
          <th>Age</th>
       </tr>
       <tr>
          <td>Jill</td>
          <td>Smith</td>
          <td>50</td>
       </tr>
       <tr>
          <td>Eve</td>
          <td>Jackson</td>
          <td>94</td>
       </tr>
     </table>".

My function ep_parse_table/1 gives me:

[[["<th>Firstname</th>"],
  ["<th>Lastname</th>"],
  ["<th>Age</th>"]],
 [["<tr>Jill</tr>"],["<tr>Smith</tr>"],["<tr>50</tr>"]],
 [["<tr>Eve</tr>"],["<tr>Jackson</tr>"],["<tr>94</tr>"]]]

Is it reasonable to require xmerl as a dependency of erlpress_core and go through the many transformations required to extract cell data for every cell in a, perhaps, large table?

How likely is that a user will include full-fledged XML data in table cells?

If so then maybe we need to suck up the pain.

Or, maybe we just specify that XML data is not permitted in tables submitted to erlpress_core.

Any thoughts?

All the best,

L.




-----Original Message-----
From: "Fred Hebert" <[hidden email]>
Sent: Tuesday, September 25, 2018 6:20pm
To: [hidden email]
Cc: "Erlang/OTP discussions" <[hidden email]>
Subject: Re: [erlang-questions] How to extract string between XML tags

On 09/25, [hidden email] wrote:

>Hello,
>
>By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>
>Can anyone show a better way?
>
>Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>
>My kludge:
>
>extract_text(TaggedText) ->
>  Split = re:split(TaggedText, "<"),
>  Split2 = lists:nth(2, Split),
>  Split3 = binary_to_list(Split2),
>  Split4 = re:split(Split3, ">"),
>  Split5 = lists:nth(2, Split4),
>  binary_to_list(Split5).
>
>Surely there's a better way.
>

The classic answer: https://stackoverflow.com/a/1732454/35344

The nice non-ridiculous one: You will want to use an XML parser to parse
XML. Regular expressions are usually not the proper structure.

Let's take this as an example:

1> Str = "<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>".
"<a>aaaa<b>bbbbb<c>ccccc<d>ddd</d>ccc</c>bbb</b>bbbb</a>"
2> rr(xmerl).
[xmerl_event,xmerl_fun_states,xmerl_scanner,xmlAttribute,
 xmlComment,xmlContext,xmlDecl,xmlDocument,xmlElement,
 xmlNamespace,xmlNode,xmlNsNode,xmlObj,xmlPI,xmlText]
3> {XML, _} = xmerl_scan:string(Str).
{#xmlElement{
     name = a,expanded_name = a,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [],pos = 1,attributes = [],
     content =
         [#xmlText{
              parents = [{a,1}],
              pos = 1,language = [],value = "aaaa",type = text},
          #xmlElement{
              name = b,expanded_name = b,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{b,2},{a,1}],
                       pos = 1,language = [],value = "bbbbb",type = text},
                   #xmlElement{
                       name = c,expanded_name = c,nsinfo = [],
                       namespace = #xmlNamespace{...},
                       parents = [...],...},
                   #xmlText{
                       parents = [{b,2},{a,...}],
                       pos = 3,language = [],
                       value = [...],...}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{a,1}],
              pos = 3,language = [],value = "bbbb",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared},
 []}

This gives you a parsed XML document. You can use xpath to access nodes,
if you want. XPath defines a syntax to query the insides of XML
documents as strings: https://en.wikipedia.org/wiki/XPath

For example, the /a/b/c string would mean 'within the root document /,
find the node a, and then go find node b in there, and go find node c'.

This fives something like this:

8> xmerl_xpath:string("/a/b/c", XML).
[#xmlElement{
     name = c,expanded_name = c,nsinfo = [],
     namespace = #xmlNamespace{default = [],nodes = []},
     parents = [{b,2},{a,1}],
     pos = 2,attributes = [],
     content =
         [#xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 1,language = [],value = "ccccc",type = text},
          #xmlElement{
              name = d,expanded_name = d,nsinfo = [],
              namespace = #xmlNamespace{default = [],nodes = []},
              parents = [{c,2},{b,2},{a,1}],
              pos = 2,attributes = [],
              content =
                  [#xmlText{
                       parents = [{d,2},{c,2},{b,2},{a,...}],
                       pos = 1,language = [],value = "ddd",type = text}],
              language = [],xmlbase = "/Users/ferd",
              elementdef = undeclared},
          #xmlText{
              parents = [{c,2},{b,2},{a,1}],
              pos = 3,language = [],value = "ccc",type = text}],
     language = [],xmlbase = "/Users/ferd",
     elementdef = undeclared}]

You can see that the XML node has 3 entries in it: a text node (#xmlText
with a value "cccc", #xmlElement which has the name 'd' (so the <d>
tag), and another text node.

You can then go and dig within `<d>` by adding to the xpath:

9> xmerl_xpath:string("/a/b/c/d", XML).
[#xmlElement{name = d,expanded_name = d,nsinfo = [],
             namespace = #xmlNamespace{default = [],nodes = []},
             parents = [{c,2},{b,2},{a,1}],
             pos = 2,attributes = [],
             content = [#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
                                 pos = 1,language = [],value = "ddd",type = text}],
             language = [],xmlbase = "/Users/ferd",
             elementdef = undeclared}]

And the sole node contained there is the one with the content that is
#xmlText.content = "ddd". If you want to extract the text, you can use
the `text()` xpath qualifier:

18> xmerl_xpath:string("/a/b/c/text()", XML).
[#xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ccccc",type = text},
 #xmlText{parents = [{c,2},{b,2},{a,1}],
          pos = 3,language = [],value = "ccc",type = text}]

19> xmerl_xpath:string("/a/b/c/d/text()", XML).
[#xmlText{parents = [{d,2},{c,2},{b,2},{a,1}],
          pos = 1,language = [],value = "ddd",type = text}]

The xmerl structure is kind of cumbersome, but when you have to handle
more complex documents, a real parser with niceties like xpath can do
wonders to handle documents as a logical structure rather than as a
group of tokens to wrangle.

Regards,
Fred.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Fred Hebert-2
On 09/25, [hidden email] wrote:

>Thanks all,
>
>This is definitely useful info for another aspect of my project.
>
>But perhaps there's a simpler solution to my immediate problem if I pose my question more specifically.
>
>Goal: Add tables to erlpress_core.
>
>Problem: Extract cell info from html tables:
>
>table() ->
>  ...
>
>My function ep_parse_table/1 gives me:
>
>[[["<th>Firstname</th>"],
>  ["<th>Lastname</th>"],
>  ["<th>Age</th>"]],
> [["<tr>Jill</tr>"],["<tr>Smith</tr>"],["<tr>50</tr>"]],
> [["<tr>Eve</tr>"],["<tr>Jackson</tr>"],["<tr>94</tr>"]]]
>

{XML, _} = xmerl_parse:string(table()),
Rows = xmerl_xpath:string("/table/tr", XML),


From there:

[[[ Text || #xmlText{value=Text} <- Col]
           || #xmlElement{content=Col} <- Cols]
             || #xmlElement{content=Cols} <- Rows].

Gives:

[[["Firstname"],["Lastname"],["Age"]],
 [["Jill"],["Smith"],["50"]],
 [["Eve"],["Jackson"],["94"]]]

If you want to keep the node's type:

[[[ {Name,Text} || #xmlText{value=Text} <- Col]
           || #xmlElement{content=Col, name=Name} <- Cols]
             || #xmlElement{content=Cols} <- Rows].

Gives:

[[[{th,"Firstname"}],[{th,"Lastname"}],[{th,"Age"}]],
 [[{td,"Jill"}],[{td,"Smith"}],[{td,"50"}]],
 [[{td,"Eve"}],[{td,"Jackson"}],[{td,"94"}]]]

This is a bit obtuse due to using list comprehensions, I haven't taken
the time to clean the code up.

>Is it reasonable to require xmerl as a dependency of erlpress_core and
>go through the many transformations required to extract cell data for
>every cell in a, perhaps, large table?
>

It's not that bad, considering xmerl is part of the standard library,
but that's a fair concern anyway.

>How likely is that a user will include full-fledged XML data in table cells?
>

The sad thing is that XML is _simpler_ to parse than HTML and all its
variants (because they are less strict, they allow for more stuff to
happen).

The question though is what is the syntax you aim to support? Should
people be able to style text using tags like <strong>, <em>, <code>,
<tt>, and so on? Or do you expect literal text always? What you accept
or refuse defines what you can deal with.

>If so then maybe we need to suck up the pain.
>
>Or, maybe we just specify that XML data is not permitted in tables
>submitted to erlpress_core.
>
>Any thoughts?
>

You could say XML data is not supported. That does not prevent you from
using the XML parser rather than writing your own.

For example, what does someone do when they want to use the '<td>'
string from within the table to avoid breaking your own parser? What's
the escape sequence?  Using XML as a parser, you get it for free: &gt;
is > and &lt; is <:

72> xmerl_scan:string("<td> bf&lt;td&gt;aaa</td>").
{#xmlElement{...
             content = [#xmlText{value = " bf<td>aaa" ...}],
             ...},
 []}

You can see the resulting string being " bf<td>aaa" despite already
being in a <td> element. No confusion to be had.

If you don't use the parser, you have to come up with these rules
yourself, and implement them properly. That's a lot of work :)


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Lloyd R. Prentice-2
Hi Fred,

This is gorgeous! in it will go.

Thank you so much.

All the best,

Lloyd

Sent from my iPad

> On Sep 25, 2018, at 7:38 PM, Fred Hebert <[hidden email]> wrote:
>
>> On 09/25, [hidden email] wrote:
>> Thanks all,
>>
>> This is definitely useful info for another aspect of my project.
>>
>> But perhaps there's a simpler solution to my immediate problem if I pose my question more specifically.
>>
>> Goal: Add tables to erlpress_core.
>>
>> Problem: Extract cell info from html tables:
>>
>> table() ->
>> ...
>>
>> My function ep_parse_table/1 gives me:
>>
>> [[["<th>Firstname</th>"],
>> ["<th>Lastname</th>"],
>> ["<th>Age</th>"]],
>> [["<tr>Jill</tr>"],["<tr>Smith</tr>"],["<tr>50</tr>"]],
>> [["<tr>Eve</tr>"],["<tr>Jackson</tr>"],["<tr>94</tr>"]]]
>>
>
> {XML, _} = xmerl_parse:string(table()),
> Rows = xmerl_xpath:string("/table/tr", XML),
>
>
> From there:
>
> [[[ Text || #xmlText{value=Text} <- Col]
>          || #xmlElement{content=Col} <- Cols]
>            || #xmlElement{content=Cols} <- Rows].
>
> Gives:
>
> [[["Firstname"],["Lastname"],["Age"]],
> [["Jill"],["Smith"],["50"]],
> [["Eve"],["Jackson"],["94"]]]
>
> If you want to keep the node's type:
>
> [[[ {Name,Text} || #xmlText{value=Text} <- Col]
>          || #xmlElement{content=Col, name=Name} <- Cols]
>            || #xmlElement{content=Cols} <- Rows].
>
> Gives:
>
> [[[{th,"Firstname"}],[{th,"Lastname"}],[{th,"Age"}]],
> [[{td,"Jill"}],[{td,"Smith"}],[{td,"50"}]],
> [[{td,"Eve"}],[{td,"Jackson"}],[{td,"94"}]]]
>
> This is a bit obtuse due to using list comprehensions, I haven't taken the time to clean the code up.
>
>> Is it reasonable to require xmerl as a dependency of erlpress_core and go through the many transformations required to extract cell data for every cell in a, perhaps, large table?
>>
>
> It's not that bad, considering xmerl is part of the standard library, but that's a fair concern anyway.
>
>> How likely is that a user will include full-fledged XML data in table cells?
>>
>
> The sad thing is that XML is _simpler_ to parse than HTML and all its variants (because they are less strict, they allow for more stuff to happen).
>
> The question though is what is the syntax you aim to support? Should people be able to style text using tags like <strong>, <em>, <code>, <tt>, and so on? Or do you expect literal text always? What you accept or refuse defines what you can deal with.
>
>> If so then maybe we need to suck up the pain.
>>
>> Or, maybe we just specify that XML data is not permitted in tables submitted to erlpress_core.
>>
>> Any thoughts?
>>
>
> You could say XML data is not supported. That does not prevent you from using the XML parser rather than writing your own.
>
> For example, what does someone do when they want to use the '<td>' string from within the table to avoid breaking your own parser? What's the escape sequence?  Using XML as a parser, you get it for free: &gt; is > and &lt; is <:
>
> 72> xmerl_scan:string("<td> bf&lt;td&gt;aaa</td>").
> {#xmlElement{...
>            content = [#xmlText{value = " bf<td>aaa" ...}],
>            ...},
> []}
>
> You can see the resulting string being " bf<td>aaa" despite already being in a <td> element. No confusion to be had.
>
> If you don't use the parser, you have to come up with these rules yourself, and implement them properly. That's a lot of work :)
>
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

PAILLEAU Eric
In reply to this post by Lloyd R. Prentice-2
hello,
sorry did not see this question before.

A simple regexp is possible "<\/?[^>]{1,}>"

re:replace("<th>title <b>bold</b></th>","<\/?[^>]{1,}>","", [global,
{return, list}]).
"title bold"



Le 25/09/2018 à 23:56, [hidden email] a écrit :

> Hello,
>
> By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>
> Can anyone show a better way?
>
> Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>
> My kludge:
>
> extract_text(TaggedText) ->
>    Split = re:split(TaggedText, "<"),
>    Split2 = lists:nth(2, Split),
>    Split3 = binary_to_list(Split2),
>    Split4 = re:split(Split3, ">"),
>    Split5 = lists:nth(2, Split4),
>    binary_to_list(Split5).
>
> Surely there's a better way.
>
> Many thanks,
>
> LRP
>
> *********************************************
> My books:
>
> THE GOSPEL OF ASHES
> http://thegospelofashes.com
>
> Strength is not enough. Do they have the courage
> and the cunning? Can they survive long enough to
> save the lives of millions?
>
> FREEIN' PANCHO
> http://freeinpancho.com
>
> A community of misfits help a troubled boy find his way
>
> AYA TAKEO
> http://ayatakeo.com
>
> Star-crossed love, war and power in an alternative
> universe
>
> Available through Amazon or by request from your
> favorite bookstore
>
>
> **********************************************
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

PAILLEAU Eric
Hello,
BTW  "</?[^>]{1,}>"  works too, no need to escape /  (Perl reflex :) ...)


Le 29/09/2018 à 11:13, PAILLEAU Eric a écrit :

> hello,
> sorry did not see this question before.
>
> A simple regexp is possible "<\/?[^>]{1,}>"
>
> re:replace("<th>title <b>bold</b></th>","<\/?[^>]{1,}>","", [global,
> {return, list}]).
> "title bold"
>
>
>
> Le 25/09/2018 à 23:56, [hidden email] a écrit :
>> Hello,
>>
>> By now I should know how to do this. But I've fumbled for more time
>> than I have to find an elegant solution.
>>
>> Can anyone show a better way?
>>
>> Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>>
>> My kludge:
>>
>> extract_text(TaggedText) ->
>>    Split = re:split(TaggedText, "<"),
>>    Split2 = lists:nth(2, Split),
>>    Split3 = binary_to_list(Split2),
>>    Split4 = re:split(Split3, ">"),
>>    Split5 = lists:nth(2, Split4),
>>    binary_to_list(Split5).
>>
>> Surely there's a better way.
>>
>> Many thanks,
>>
>> LRP
>>
>> *********************************************
>> My books:
>>
>> THE GOSPEL OF ASHES
>> http://thegospelofashes.com
>>
>> Strength is not enough. Do they have the courage
>> and the cunning? Can they survive long enough to
>> save the lives of millions?
>>
>> FREEIN' PANCHO
>> http://freeinpancho.com
>>
>> A community of misfits help a troubled boy find his way
>>
>> AYA TAKEO
>> http://ayatakeo.com
>>
>> Star-crossed love, war and power in an alternative
>> universe
>>
>> Available through Amazon or by request from your
>> favorite bookstore
>>
>>
>> **********************************************
>>
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Eckard Brauer
Hello,

just another (a beginner's) question probably leading away from the
initial point:

If I use

T = re:replace("<th>title <b>bold</b></th>",
        "<\([^>]\+\)>\(.*\)</\([^>]\+\)>",
        "\\1 \\2 \\3",
        [global,{return, list}]).

how could I check that T is of the form "X Y X"?



Am Sat, 29 Sep 2018 11:18:14 +0200
schrieb PAILLEAU Eric <[hidden email]>:

> Hello,
> BTW  "</?[^>]{1,}>"  works too, no need to escape /  (Perl
> reflex :) ...)
>
>
> Le 29/09/2018 à 11:13, PAILLEAU Eric a écrit :
>  [...]  
>  [...]  
>  [...]  
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--
:)

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

attachment0 (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Lloyd R. Prentice-2
In reply to this post by PAILLEAU Eric
Thanks, Eric!

Best wishes,

Lloyd

Sent from my iPad

> On Sep 29, 2018, at 5:13 AM, PAILLEAU Eric <[hidden email]> wrote:
>
> hello,
> sorry did not see this question before.
>
> A simple regexp is possible "<\/?[^>]{1,}>"
>
> re:replace("<th>title <b>bold</b></th>","<\/?[^>]{1,}>","", [global, {return, list}]).
> "title bold"
>
>
>
>> Le 25/09/2018 à 23:56, [hidden email] a écrit :
>> Hello,
>> By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>> Can anyone show a better way?
>> Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>> My kludge:
>> extract_text(TaggedText) ->
>>   Split = re:split(TaggedText, "<"),
>>   Split2 = lists:nth(2, Split),
>>   Split3 = binary_to_list(Split2),
>>   Split4 = re:split(Split3, ">"),
>>   Split5 = lists:nth(2, Split4),
>>   binary_to_list(Split5).
>> Surely there's a better way.
>> Many thanks,
>> LRP
>> *********************************************
>> My books:
>> THE GOSPEL OF ASHES
>> http://thegospelofashes.com
>> Strength is not enough. Do they have the courage
>> and the cunning? Can they survive long enough to
>> save the lives of millions?
>> FREEIN' PANCHO
>> http://freeinpancho.com
>> A community of misfits help a troubled boy find his way
>> AYA TAKEO
>> http://ayatakeo.com
>> Star-crossed love, war and power in an alternative
>> universe
>> Available through Amazon or by request from your
>> favorite bookstore
>> **********************************************
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Hugo Mills-2
   Note that this only works if there's no nested tags of the same
type. For example, it'll get this wrong:

<b>Part of a <b>nested tag</b>...</b>

   (And there's *no* regex that can get this right in general)

   Hugo.

On Sat, Sep 29, 2018 at 11:11:19AM -0400, Lloyd R. Prentice wrote:

> Thanks, Eric!
>
> Best wishes,
>
> Lloyd
>
> Sent from my iPad
>
> > On Sep 29, 2018, at 5:13 AM, PAILLEAU Eric <[hidden email]> wrote:
> >
> > hello,
> > sorry did not see this question before.
> >
> > A simple regexp is possible "<\/?[^>]{1,}>"
> >
> > re:replace("<th>title <b>bold</b></th>","<\/?[^>]{1,}>","", [global, {return, list}]).
> > "title bold"
> >
> >
> >
> >> Le 25/09/2018 à 23:56, [hidden email] a écrit :
> >> Hello,
> >> By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
> >> Can anyone show a better way?
> >> Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
> >> My kludge:
> >> extract_text(TaggedText) ->
> >>   Split = re:split(TaggedText, "<"),
> >>   Split2 = lists:nth(2, Split),
> >>   Split3 = binary_to_list(Split2),
> >>   Split4 = re:split(Split3, ">"),
> >>   Split5 = lists:nth(2, Split4),
> >>   binary_to_list(Split5).
> >> Surely there's a better way.
> >> Many thanks,
> >> LRP
> >> *********************************************
> >> My books:
> >> THE GOSPEL OF ASHES
> >> http://thegospelofashes.com
> >> Strength is not enough. Do they have the courage
> >> and the cunning? Can they survive long enough to
> >> save the lives of millions?
> >> FREEIN' PANCHO
> >> http://freeinpancho.com
> >> A community of misfits help a troubled boy find his way
> >> AYA TAKEO
> >> http://ayatakeo.com
> >> Star-crossed love, war and power in an alternative
> >> universe
> >> Available through Amazon or by request from your
> >> favorite bookstore
> >> **********************************************
> >> _______________________________________________
> >> erlang-questions mailing list
> >> [hidden email]
> >> http://erlang.org/mailman/listinfo/erlang-questions
> >
> > _______________________________________________
> > erlang-questions mailing list
> > [hidden email]
> > http://erlang.org/mailman/listinfo/erlang-questions
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
--
Hugo Mills             | IoT: The S stands for Security.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

PAILLEAU Eric
Hello Hugo,

it works with anything, as this solution is not to catch data in tags,
but to remove tags. Even with unbalanced crapy html.

1> re:replace("<b>Part of a <b>nested
tag</b>...</b>","<\/?[^>]{1,}>","", [global, {return, list}]).
"Part of a nested tag..."



Le 29/09/2018 à 17:16, Hugo Mills a écrit :

>     Note that this only works if there's no nested tags of the same
> type. For example, it'll get this wrong:
>
> <b>Part of a <b>nested tag</b>...</b>
>
>     (And there's *no* regex that can get this right in general)
>
>     Hugo.
>
> On Sat, Sep 29, 2018 at 11:11:19AM -0400, Lloyd R. Prentice wrote:
>> Thanks, Eric!
>>
>> Best wishes,
>>
>> Lloyd
>>
>> Sent from my iPad
>>
>>> On Sep 29, 2018, at 5:13 AM, PAILLEAU Eric <[hidden email]> wrote:
>>>
>>> hello,
>>> sorry did not see this question before.
>>>
>>> A simple regexp is possible "<\/?[^>]{1,}>"
>>>
>>> re:replace("<th>title <b>bold</b></th>","<\/?[^>]{1,}>","", [global, {return, list}]).
>>> "title bold"
>>>
>>>
>>>
>>>> Le 25/09/2018 à 23:56, [hidden email] a écrit :
>>>> Hello,
>>>> By now I should know how to do this. But I've fumbled for more time than I have to find an elegant solution.
>>>> Can anyone show a better way?
>>>> Example string: "<th>Firstname</th>"  % NOTE: could be any valid tag
>>>> My kludge:
>>>> extract_text(TaggedText) ->
>>>>    Split = re:split(TaggedText, "<"),
>>>>    Split2 = lists:nth(2, Split),
>>>>    Split3 = binary_to_list(Split2),
>>>>    Split4 = re:split(Split3, ">"),
>>>>    Split5 = lists:nth(2, Split4),
>>>>    binary_to_list(Split5).
>>>> Surely there's a better way.
>>>> Many thanks,
>>>> LRP
>>>> *********************************************
>>>> My books:
>>>> THE GOSPEL OF ASHES
>>>> http://thegospelofashes.com
>>>> Strength is not enough. Do they have the courage
>>>> and the cunning? Can they survive long enough to
>>>> save the lives of millions?
>>>> FREEIN' PANCHO
>>>> http://freeinpancho.com
>>>> A community of misfits help a troubled boy find his way
>>>> AYA TAKEO
>>>> http://ayatakeo.com
>>>> Star-crossed love, war and power in an alternative
>>>> universe
>>>> Available through Amazon or by request from your
>>>> favorite bookstore
>>>> **********************************************
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> [hidden email]
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> [hidden email]
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>


--
----------------------------------------
    Eric PAILLEAU  |  [hidden email]
----------------------------------------

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

PAILLEAU Eric
In reply to this post by Eckard Brauer
Hello,
if the question is to be sure that tags are correctly balanced, it is
better to use xmerl parser like Fred proposed.

I see an issue in your regexp
"<\([^>]\+\)>\(.*\)</\([^>]\+\)>"   (.*) will catch anything including
tags (I mean also < )

use instead
"<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>"  i.e anything that is not a tag start.

but for instance it will not work on nested tags :

1> re:replace("<th>title
<b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
\\3",[global,{return, list}]).
"<th>title b bold b</th>"

note that could be rewritten also to :
2> A = re:replace("<th>title
<b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
\\1",[global,{return, list}]).
"<th>title b bold b</th>"

3> B = re:replace("<th>title
<b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
\\1",[global,{return, list}]).
"<th>title b bold b</th>"

As \\1 MUST BE equal to \\3
4> A = B.
should be ok.

Exemple with a single tag

43> A =
re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
\\2 \\3",[global,{return, list}]).
"th title th"
44> B =
re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
\\2 \\3",[global,{return, list}]).
"th title th"
45> A = B.
"th title th"

But with unbalanced tag fails:

48>  A =
re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
\\3",[global,{return, list}]).
"th title b"
49> B =
re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
\\1",[global,{return, list}]).
"th title th"
50> A=B.
** exception error: no match of right hand side value "th title th"

Regards


Le 29/09/2018 à 13:30, Eckard Brauer a écrit :

> Hello,
>
> just another (a beginner's) question probably leading away from the
> initial point:
>
> If I use
>
> T = re:replace("<th>title <b>bold</b></th>",
> "<\([^>]\+\)>\(.*\)</\([^>]\+\)>",
> "\\1 \\2 \\3",
> [global,{return, list}]).
>
> how could I check that T is of the form "X Y X"?
>
>
>
> Am Sat, 29 Sep 2018 11:18:14 +0200
> schrieb PAILLEAU Eric <[hidden email]>:
>
>> Hello,
>> BTW  "</?[^>]{1,}>"  works too, no need to escape /  (Perl
>> reflex :) ...)
>>
>>
>> Le 29/09/2018 à 11:13, PAILLEAU Eric a écrit :
>>   [...]
>>   [...]
>>   [...]
>>
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to extract string between XML tags

Eckard Brauer
Thanks for the response.

yes, indeed the first idea was to check balancing, and I'm aware of
xmerl and the regex issue. Thanks for the hints, it's because "[^<]"
wouldn't match nested tags as you wrote below.

So my first idea was to try

X ++ Y ++ X = re:replace(...)

what of course didn't work. I know of the limitations of REs, just
wanted to check if there's a way to ga as with Prolog's append/2
(append([X,Y,X], [a,b,a]).)

As I wrote, I'm just playing around with Erlang. Most of my work
consists of either C(++) or shell programming with only limited
practice of other languages (e.g. Prolog, very little Lisp, Python,
PHP). So I'm here only for learning so far, and glad for any help I can
get.

Thanks again, best regards
Eckard


Am Sat, 29 Sep 2018 17:59:34 +0200 schrieb PAILLEAU Eric
<[hidden email]>:

> Hello,
> if the question is to be sure that tags are correctly balanced, it is
> better to use xmerl parser like Fred proposed.
>
> I see an issue in your regexp
> "<\([^>]\+\)>\(.*\)</\([^>]\+\)>"   (.*) will catch anything
> including tags (I mean also < )
>
> use instead
> "<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>"  i.e anything that is not a tag
> start.
>
> but for instance it will not work on nested tags :
>
> 1> re:replace("<th>title  
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
> \\3",[global,{return, list}]).
> "<th>title b bold b</th>"
>
> note that could be rewritten also to :
>  [...]  
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
> \\1",[global,{return, list}]).
> "<th>title b bold b</th>"
>
>  [...]  
> <b>bold</b></th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1 \\2
> \\1",[global,{return, list}]).
> "<th>title b bold b</th>"
>
> As \\1 MUST BE equal to \\3
>  [...]  
> should be ok.
>
> Exemple with a single tag
>
>  [...]  
> re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\3",[global,{return, list}]).
> "th title th"
>  [...]  
> re:replace("<th>title</th>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\3",[global,{return, list}]).
> "th title th"
>  [...]  
> "th title th"
>
> But with unbalanced tag fails:
>
>  [...]  
> re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\3",[global,{return, list}]).
> "th title b"
>  [...]  
> re:replace("<th>title</b>","<\([^>]\+\)>\([^<]+\)</\([^>]\+\)>","\\1
> \\2 \\1",[global,{return, list}]).
> "th title th"
>  [...]  
> ** exception error: no match of right hand side value "th title th"
>
> Regards
>
>
> Le 29/09/2018 à 13:30, Eckard Brauer a écrit :
>  [...]  
>  [...]  
>  [...]  
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--
:)

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

attachment0 (201 bytes) Download Attachment