How can I break this string into a list of strings?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

How can I break this string into a list of strings?

Lloyd R. Prentice-2
Hello,

A naive question.

But every time I look into the re module my heart starts pounding. I spend uncountable time figuring out what I need to do. But this time I failed.

Suppose I have the following string:

    "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"

I would like to break it into a list:

    ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]

string:token(MyString, "$\n") doesn't work because it would break the paragraph.

So I try:

    re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return, list}]).

But I get:

   "<h1>Hello!</h1>\",\"     <h2>How are you?</h2>\",\"    <p>This is....

I don't want those pesky escaped quotes at the end of the heads. I want the real deal.

But how can I make it happen?

Thanks and holiday cheer to all,

LRP












_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Constantine Povietkin
Hi,

Just convert it to binary and use binary:split.

В Суб, 24/12/2016 в 16:34 -0500, [hidden email] пишет:

> Hello,
>
> A naive question. 
>
> But every time I look into the re module my heart starts pounding. I
> spend uncountable time figuring out what I need to do. But this time
> I failed.
>
> Suppose I have the following string:
>
>     "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n
> and more text.</p>"
>
> I would like to break it into a list:
>
>     ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and
> more text.</p>"]
>
> string:token(MyString, "$\n") doesn't work because it would break the
> paragraph. 
>
> So I try:
>
>     re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return,
> list}]).
>
> But I get:
>
>    "<h1>Hello!</h1>\",\"     <h2>How are you?</h2>\",\"    <p>This
> is....
>
> I don't want those pesky escaped quotes at the end of the heads. I
> want the real deal. 
>
> But how can I make it happen?
>
> Thanks and holiday cheer to all,
>
> LRP
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Kenneth Lakin
In reply to this post by Lloyd R. Prentice-2
On 12/24/2016 01:34 PM, [hidden email] wrote:
> I don't want those pesky escaped quotes at the end of the heads. I want the real deal.

Does it help you to know that those double-quotes aren't _actually_
escaped and that the backslashes are added by the pretty-print code?
Continuing on with your example:

2> B=re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return, list}]).
%re:replace output elided
3> lists:nth(16, B).
34
4> lists:nth(15, B).
62

ASCII 34 is '"'. ASCII 62 is '>'. There's no "\" in the output, which
would be ASCII 92:

5> lists:foldl(fun(92, Acc) -> Acc ++ [92]; (_, Acc) -> Acc end, [], B).
[]

The pretty-printer is a nice feature (AFAICT, anything the
pretty-printer prints is valid Erlang), but I suspect that being
startled by its escaping is a rite of passage.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

PAILLEAU Eric
In reply to this post by Lloyd R. Prentice-2

Hi,
If re module make your heart pounding,
you may give a try to https://github.com/crownedgrouse/swab
and create powerful macros to apply on your buffers.

"Envoyé depuis mon mobile " Eric



---- [hidden email] a écrit ----

Hello,

A naive question.

But every time I look into the re module my heart starts pounding. I spend uncountable time figuring out what I need to do. But this time I failed.

Suppose I have the following string:

    "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"

I would like to break it into a list:

    ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]

string:token(MyString, "$\n") doesn't work because it would break the paragraph.

So I try:

    re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return, list}]).

But I get:

   "<h1>Hello!</h1>\",\"     <h2>How are you?</h2>\",\"    <p>This is....

I don't want those pesky escaped quotes at the end of the heads. I want the real deal.

But how can I make it happen?

Thanks and holiday cheer to all,

LRP












_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Lloyd R. Prentice-2
In reply to this post by Kenneth Lakin
Thanks all, but apologies.

I misstated my problem.

What I'm really need to do is convert:

 "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n> and more text.</p>"

Into:

 ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]

Such that:

   length(MyString) returns 3
   
or
 
   [length(String) || String <- MyString] returns [N1, N2,N3]

re and string:tokens/2 got a bit muddled in my mind.

Thanks again,

LRP


-----Original Message-----
From: "Kenneth Lakin" <[hidden email]>
Sent: Saturday, December 24, 2016 5:42pm
To: [hidden email], "Lloyd R. Prentice" <[hidden email]>
Subject: Re: [erlang-questions] How can I break this string into a list of strings?

On 12/24/2016 01:34 PM, [hidden email] wrote:
> I don't want those pesky escaped quotes at the end of the heads. I want the real deal.

Does it help you to know that those double-quotes aren't _actually_
escaped and that the backslashes are added by the pretty-print code?
Continuing on with your example:

2> B=re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return, list}]).
%re:replace output elided
3> lists:nth(16, B).
34
4> lists:nth(15, B).
62

ASCII 34 is '"'. ASCII 62 is '>'. There's no "\" in the output, which
would be ASCII 92:

5> lists:foldl(fun(92, Acc) -> Acc ++ [92]; (_, Acc) -> Acc end, [], B).
[]

The pretty-printer is a nice feature (AFAICT, anything the
pretty-printer prints is valid Erlang), but I suspect that being
startled by its escaping is a rite of passage.




_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Lloyd R. Prentice-2
In reply to this post by PAILLEAU Eric
Hi Eric,

Swab looks very cool. But I'll have to wait until I've recovered from my New Year's hangover to dig in sufficiently to understand it.

Up for questions if they occur?

Thanks,

Lloyd

-----Original Message-----
From: "Éric Pailleau" <[hidden email]>
Sent: Saturday, December 24, 2016 6:33pm
To: "Erlang Questions" <[hidden email]>, "Lloyd R. Prentice" <[hidden email]>
Subject: Re: [erlang-questions] How can I break this string into a list of strings?

Hi,
If re module make your heart pounding,
you may give a try to https://github.com/crownedgrouse/swab 
and create powerful macros to apply on your buffers.


"Envoyé depuis mon mobile " Eric

---- [hidden email] a écrit ----

>Hello,
>
>A naive question.
>
>But every time I look into the re module my heart starts pounding. I spend uncountable time figuring out what I need to do. But this time I failed.
>
>Suppose I have the following string:
>
>    "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"
>
>I would like to break it into a list:
>
>    ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]
>
>string:token(MyString, "$\n") doesn't work because it would break the paragraph.
>
>So I try:
>
>    re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return, list}]).
>
>But I get:
>
>   "<h1>Hello!</h1>\",\"     <h2>How are you?</h2>\",\"    <p>This is....
>
>I don't want those pesky escaped quotes at the end of the heads. I want the real deal.
>
>But how can I make it happen?
>
>Thanks and holiday cheer to all,
>
>LRP
>
>
>
>
>
>
>
>
>
>
>
>
>_______________________________________________
>erlang-questions mailing list
>[hidden email]
>http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

zxq9-2
In reply to this post by Lloyd R. Prentice-2
Hi Lloyd!

On 2016年12月24日 土曜日 16:34:22 [hidden email] wrote:

> Hello,
>
> A naive question.
>
> But every time I look into the re module my heart starts pounding. I spend uncountable time figuring out what I need to do. But this time I failed.
>
> Suppose I have the following string:
>
>     "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"
>
> I would like to break it into a list:
>
>     ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]
>
> string:token(MyString, "$\n") doesn't work because it would break the paragraph.
>
> So I try:
>
>     re:replace(Copy, [$>,$\n], [$>,$",$,,$"], [global, {return, list}]).
>
> But I get:
>
>    "<h1>Hello!</h1>\",\"     <h2>How are you?</h2>\",\"    <p>This is....
>
> I don't want those pesky escaped quotes at the end of the heads. I want the real deal.
>
> But how can I make it happen?

Maybe we can elevate this to a more general problem, like "how can I interpret a list of angle-brackety notation and return a list that contains the parts of it in a way that is meaningful to my Erlang program?"

Fortunately the days of "its OK if you have a closing tag, or not, but maybe just in case, sometimes" are over. So these days we know that, at least by HTML5 and XML type rules, we will always have an
explicit closing tag, a slash, or a very few tags that are "self-closing" meaning that when we encounter them we can know right away that we won't encounter an ending one.

We could get into some listy wizardry with lists of regexes operating over your list of charcters that flip it between binaries and lists and lists of binaries and other nonsense... but I've never seen that lead to a happy place, an understandable place, and it always winds up having dramatically worse performance characteristics (by every criteria, which is remarkable).

So let's instead decide that we have an accumulator, which is a deep list (so it mimics the way the elements in the HTML look, like a tree of tags), your original list, and the idea that we need to have a mode for recording label of the tag (the "h1" or "h2" or whatever), recording the data enclosed by the tag ("hello!" and other text we actually care about), and a matching mode to find previous tags that are being closed. 3 modes, basically.

To start with I want to move through the list, passing whatever I encounter to an accumulator. Most simple explicit recursion we can do -- so let's write a tinkering module for it:

  -module(htmler).
  -export([consume/1]).

  -spec consume(String) -> Result
      when String :: unicode:chardata(),
           Result :: {ok, Structure :: list()}
                   | {error, Reason :: term()}.

  consume(String) ->
      consume(String, "").

  consume([Char | Rest], Acc) ->
      consume(Rest, [Char | Acc]);
  consume([], Acc) ->
      {ok, lists:reverse(Acc)}.

...and now give it a try...

  1> c(htmler).
  {ok,htmler}
  2> htmler:consume("<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>").
  {ok, "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"}

OK, works as expected.

If I encounter a $< then I want to start recording the tag label. If I hit whitespace I want to just pass by it until I hit the first meaningful character of the label. I add the label characters to an accumulator for the label. When I encounter whitespace I want to stop recording the tag label, but still move through the next characters until the closing $>.

  -module(htmler).
  -export([consume/1]).

  -spec consume(String) -> Result
      when String :: unicode:chardata(),
           Result :: {ok, Structure :: list()}
                   | {error, Reason :: term()}.

  consume(String) ->
      consume(String, "").

  consume([$< | Rest], Acc) ->
      case consume_label(Rest, "") of
          {ok, Label, Remainder} ->
              consume(Remainder, [Label | Acc]);
          {error, Label} ->
              {error, {bad_tag, Label}}
      end;
  consume([Char | Rest], Acc) ->
      consume(Rest, [Char | Acc]);
  consume([], Acc) ->
      {ok, lists:reverse(Acc)}.


  consume_label([$\s | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label([$\t | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label(String, Acc) ->
      consume_label_text(String, Acc).

  consume_label_text([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_text([$\s | Rest], Acc) ->
      consume_label_close(Rest, Acc);
  consume_label_text([$\t | Rest], Acc) ->
      consume_label_close(Rest, Acc);
  consume_label_text([Char | Rest], Acc) ->
      consume_label_text(Rest, [Char | Acc]);
  consume_label_text([], Acc) ->
      Label = lists:reverse(Acc),
      {error, Label}.

  consume_label_close([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_close([_ | Rest], Acc) ->
      consume_label_close(Rest, Acc).

And now let's give that a try...

  3> c(htmler).
  {ok,htmler}
  4> io:format("~tp~n", [htmler:consume("<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>")]).
  {ok,["h1",72,101,108,108,111,33,"/h1",10,32,32,32,32,32,"h2",72,111,119,32,97,
       114,101,32,121,111,117,63,"/h2",10,32,32,32,32,"p",83,111,109,101,32,116,
       101,120,116,10,32,97,110,100,32,109,111,114,101,32,116,101,120,116,46,
       "/p"]}
  ok

Ah, so we're on the right track, but not quite where we want to be. Looking at this I can tell that we probably don't want to just split things up, but extract some meaning. Maybe what we really want is something like a list of tuples with a label that tells us the kind of thing it is, and then has a list containing the data that the label applies to. Like:

  [{"h1", "Hello!"}, {"h2", "How are you?"}, {"p", "Some text\n and more text."}]

I think we can change this a bit to get to that point, but it will require that we remember the tag we are currently in case, for example, there is an <em> tag inside the <p> component or whatever else -- with these kinds of tags its not about finding a closing tag, it is about finding out which closing tag you've found. Maybe we want some part of this to actually be non-tail recursive -- so we can work on just the tag in which we are parsing at the moment and not remember everything else.

  -module(htmler).
  -export([consume/1]).

  -spec consume(String) -> Result
      when String :: unicode:chardata(),
           Result :: {ok, Structure :: list()}
                   | {error, Reason :: term()}.

  consume(String) ->
      consume(String, []).

  consume([$< | Rest], Acc) ->
      case consume_element(Rest) of
          {ok, Element, Remainder} ->
              consume(Remainder, [Element | Acc]);
          {error, Element} ->
              {error, {bad_tag, Element}}
      end;
  consume([Char | Rest], Acc) ->
      consume(Rest, [Char | Acc]);
  consume([], Acc) ->
      {ok, lists:reverse(Acc)}.


  consume_element(String) ->
      case consume_label(String, "") of
          {ok, Label, Rest} ->
              {ok, Contents, Remainder} = consume_contents(Label, Rest, ""),
              {ok, {Label, Contents}, Remainder};
          Error ->
              Error
      end.


  consume_label([$\s | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label([$\t | Rest], Acc) ->
      consume_label(Rest, Acc);
  consume_label(String, Acc) ->
      consume_label_text(String, Acc).

  consume_label_text([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_text([$\s | Rest], Acc) ->
      consume_label_end(Rest, Acc);
  consume_label_text([$\t | Rest], Acc) ->
      consume_label_end(Rest, Acc);
  consume_label_text([Char | Rest], Acc) ->
      consume_label_text(Rest, [Char | Acc]);
  consume_label_text([], Acc) ->
      Label = lists:reverse(Acc),
      {error, Label}.

  consume_label_end([$/, $> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_end([$> | Rest], Acc) ->
      Label = lists:reverse(Acc),
      {ok, Label, Rest};
  consume_label_end([_ | Rest], Acc) ->
      consume_label_end(Rest, Acc).


  consume_contents(Label, [$<, $/ | Rest], Acc) ->
      {ok, Label, Remainder} = consume_label(Rest, ""),
      Contents = lists:reverse(Acc),
      {ok, Contents, Remainder};
  consume_contents(Label, [$< | Rest], Acc) ->
      {ok, Element, Remainder} = consume_element(Rest),
      consume_contents(Label, Remainder, [Element | Acc]);
  consume_contents(Label, [Char | Rest], Acc) ->
      consume_contents(Label, Rest, [Char | Acc]);
  consume_contents(_, "", Acc) ->
      Contents = lists:reverse(Acc),
      {ok, Contents}.

And how will that work out, I wonder...

  13> io:format("~tp~n", [htmler:consume("<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>")]).
  {ok,[{"h1","Hello!"},
       10,32,32,32,32,32,
       {"h2","How are you?"},
       10,32,32,32,32,
       {"p","Some text\n and more text."}]}
  ok

How about with that nested tag case I was worried about earlier?

  14> io:format("~tp~n", [htmler:consume("<h1>Hello!</h1>\n     <h2>How are <em>you?</em></h2>\n    <p>Some text\n and more text.</p>")]).
  {ok,[{"h1","Hello!"},
       10,32,32,32,32,32,
       {"h2",[72,111,119,32,97,114,101,32,{"em","you?"}]},
       10,32,32,32,32,
       {"p","Some text\n and more text."}]}
  ok

Ah! Nice. So it really does nest the "contents" list.

So now we have a more semantically correct split happening, and it lets us descend down a tree of HTML tags -- assuming some really huge things that we can't normally assume on the big steaming pile of errors, exceptions and malformed output we call the web. For example:

  2> htmler:consume("<body><h1>Hello</h1><p>foo</body>").
  ** exception error: no match of right hand side value {ok,"body",[]}
       in function  htmler:consume_contents/3 (htmler.erl, line 66)
       in call from htmler:consume_element/1 (htmler.erl, line 28)
       in call from htmler:consume_contents/3 (htmler.erl, line 70)
       in call from htmler:consume_element/1 (htmler.erl, line 28)
       in call from htmler:consume/2 (htmler.erl, line 13)

Whoops! Forgot the closing </p> tag, so everything catches fire.

  3> htmler:consume("<body><h1>Hello</h1><p>foo</p></body>").
  {ok,[{"body",[{"h1","Hello"},{"p","foo"}]}]}

Ah, much better.

Anyway, handling bad input in HTML is an art, not a science. This could be made dramatically more complex in the interest of accepting as much crap in the tubes as Firefox does -- or left simple and made to return explicit error messages with an indication what part of consuming the input failed (hint: this is a great way to get a headstart on creating a version that accepts bad input). (Also, of course, we would need to put in some cases to catch when the label is one of the implicitly "self-closing" tags, and explicitly self-closed tag, and a few other things to be complete.)

Anyway, this doesn't exactly address the problem you have, but hopefully it gives you some ideas about how to parse angle-brackety syntaxes that manage to be both dramatically more complex and profoundly less useful than S-expressions.

Regexs just don't cut it in this particular case.

Merry Christmas! I hope you are having a good holiday with the family!

-Craig


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

zxq9-2
On 2016年12月25日 日曜日 14:21:00 zxq9 wrote:
> On 2016年12月24日 土曜日 16:34:22 [hidden email] wrote:
> > I would like to break it into a list:
> >
> >     ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]
> >

While it is in my mind... we can use a variety of techniques to get output either exactly like, or perhaps more useful but very similar to the above based on the previously presented toy function:

  5> {ok, Split} = htmler:consume("<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>").
  {ok,[{"h1","Hello!"},
       10,32,32,32,32,32,
       {"h2","How are you?"},
       10,32,32,32,32,
       {"p","Some text\n and more text."}]}
  6> [Contents || {_, Contents} <- Split].                                                                              
  ["Hello!","How are you?","Some text\n and more text."]

Or even:

  7> ["<" ++ Label ++ ">" ++ Contents ++ "</" ++ Label ++ ">" || {Label, Contents} <- Split].
  ["<h1>Hello!</h1>","<h2>How are you?</h2>","<p>Some text\n and more text.</p>"]

But seriously, why would I still want those tags in there at all?


>   14> io:format("~tp~n", [htmler:consume("<h1>Hello!</h1>\n     <h2>How are <em>you?</em></h2>\n    <p>Some text\n and more text.</p>")]).
>   {ok,[{"h1","Hello!"},
>        10,32,32,32,32,32,
>        {"h2",[72,111,119,32,97,114,101,32,{"em","you?"}]},
>        10,32,32,32,32,
>        {"p","Some text\n and more text."}]}

This case is more interesting and won't work with a simple list comprehension to filter out the elements that are not tuples -- but an explicit function would do the trick.

>   3> htmler:consume("<body><h1>Hello</h1><p>foo</p></body>").
>   {ok,[{"body",[{"h1","Hello"},{"p","foo"}]}]}

This last case is more like what you are going to actually be encountering in real HTML and XML docs -- and it is very similar to the case above, the real difference is that your "main list" that was returned is not wrapped in a tuple the way everything else is (well, that's not entirely true -- the function actually does return a tuple: {ok, Contents}. Maybe this could be leveraged to write a general pretty printing or interpretation function?

Anyway, blahblah. I think you get the idea.

Time for pumpkin pie! Wee!

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

PAILLEAU Eric
In reply to this post by Lloyd R. Prentice-2
Hi,

Have a look to re:split then.

Something like "(<[^>]+>[^<]*<[^>]+>)" should do the job. Sorry I am not front of a computer right now, so could not test my proposal but may gives you a start anyway. 
Oh oh oh !

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Pierre Fenoll-2
string:tokens(XMLLikeString, "<>").

> On 25 Dec 2016, at 08:14, Éric Pailleau <[hidden email]> wrote:
>
> Hi,
>
> Have a look to re:split then.
>
> Something like "(<[^>]+>[^<]*<[^>]+>)" should do the job. Sorry I am not front of a computer right now, so could not test my proposal but may gives you a start anyway.
> Oh oh oh !
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

PAILLEAU Eric
Hi Pierre,

No this way you are loosing html tags.
Lloyd wants to keep html tags, including linefeeds inside html tags.

I tested with re:split(Html, "(<[^>]+>[^<]*<[^>]+>)",[{return, list}]) .  Works. 
Lloyd can then loop on result and keep only non empty string after trim on elements. 
Regards 

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Pierre Fenoll-2
Hey Éric

Well it's not clear whether tags are nested in the OP.
So when my assumption applies, I would just read the list created 3 elements at a time, with strings of whitespace removed.

If it doesn't, regexps are known to not be able to parse stack based grammars.

> On 25 Dec 2016, at 21:41, Éric Pailleau <[hidden email]> wrote:
>
> Hi Pierre,
>
> No this way you are loosing html tags.
> Lloyd wants to keep html tags, including linefeeds inside html tags.
>
> I tested with re:split(Html, "(<[^>]+>[^<]*<[^>]+>)",[{return, list}]) .  Works.
> Lloyd can then loop on result and keep only non empty string after trim on elements.
> Regards
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

zxq9-2
On 2016年12月25日 日曜日 23:43:56 Pierre Fenoll wrote:
> ...regexps are known to not be able to parse stack based grammars.

There is even a famous post on the subject...
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Lloyd R. Prentice-2
In reply to this post by Pierre Fenoll-2
Many thanks all for your excellent fodder for thought.

I would have thought that my question was trivial and the solution a well-worn path. But the issues are deeper than I imagined.

All the best in this new year,

Lloyd

Sent from my iPad

> On Dec 25, 2016, at 5:43 PM, Pierre Fenoll <[hidden email]> wrote:
>
> Hey Éric
>
> Well it's not clear whether tags are nested in the OP.
> So when my assumption applies, I would just read the list created 3 elements at a time, with strings of whitespace removed.
>
> If it doesn't, regexps are known to not be able to parse stack based grammars.
>
>> On 25 Dec 2016, at 21:41, Éric Pailleau <[hidden email]> wrote:
>>
>> Hi Pierre,
>>
>> No this way you are loosing html tags.
>> Lloyd wants to keep html tags, including linefeeds inside html tags.
>>
>> I tested with re:split(Html, "(<[^>]+>[^<]*<[^>]+>)",[{return, list}]) .  Works.
>> Lloyd can then loop on result and keep only non empty string after trim on elements.
>> Regards

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can I break this string into a list of strings?

Richard A. O'Keefe-2
In reply to this post by Lloyd R. Prentice-2


On 25/12/16 10:34 AM, [hidden email] wrote:
> Suppose I have the following string:
>
>     "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n and more text.</p>"
>
> I would like to break it into a list:
>
>     ["<h1>Hello!</h1>", "<h2>How are you?</h2>", "<p>Some text\n and more text.</p>"]
>
> string:token(MyString, "$\n") doesn't work because it would break the paragraph.

So you don't have "a string", you have an XML fragment that happens to
be stored as a string.  My problem in reading this is that I don't have
the faintest idea what you want to happen IN GENERAL.
  - What is to happen if there is a newline character in an attribute?
    "<img src='...' alt='Two gorillas\One cop'>
  - What is to happen if there is a newline between tokens inside a tag?
    "<a href=\n'....'\n>Anchor Text</a\n>"
  - What is to happen if there is a newline inside an element other than
    a <p> element?
    "<h1>Two gorillas, one cop<br>\nSpoof movie of the year</h1>"
  - What is to happen if there AREN'T newlines?
    "<h1>Hello!</h1><h2>How are you?</h2><p>Some text\n etc.</p>"
    White space between block level elements isn't significant, which
    means that you can't in general expect it to be there or to be
    preserved by other tools.
  - What is to happen if the newlines between elements are doubled?
    "<h1>Hello!</h1>\n\n     <h2>How are you?</h2>\n\n    <p>Some text\n
and more text.</p>"
  - What is to happen if there is a newline whitespace sequence
    at the end?
     "<h1>Hello!</h1>\n     <h2>How are you?</h2>\n    <p>Some text\n
     and more text.</p>\n     "

If I needed to do this, I'd look for an XML library in which I could do
    Fragment = xml:parse_fragment(String),
    [xml:unparse_element(Element) || Element <- Fragment]

(At least, that's my GUESS about what you want to achieve.)

To me, this seems like a textbook example of why Strings Are Wrong
and regular expressions make it incredibly easy to do the wrong thing.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions