regexp module with submatches available

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

regexp module with submatches available

Pascal Brisset
We have extended the regexp module in OTP R7B-1 with support for
submatches (the '\(...\)' syntax in SED regular expressions).
This makes it possible to retrieve several components of a match with
a single evaluation of a regexp. For example:

1> RE_URL="\\(.+\\)://\\(.+\\)\\(/.+\\)(\\?\\(.*\\)(&\\(.*\\))*)?",
1> gregexp:groups("http://localhost:81/script?arg&arg2&arg3", RE_URL).
{match,["http","localhost:81","/script","arg","arg2","arg3"]}

gregexp is available from http://www.cellicium.com/erlang/contribs/

-- Pascal Brisset <pascal.brisset> +33141986741 --
--- Cellicium R&D | 73 avenue Carnot | 94230 Cachan | France ---



Reply | Threaded
Open this post in threaded view
|

regexp module with submatches available

Robert Virding-4
Pascal Brisset <pascal.brisset> writes:
>We have extended the regexp module in OTP R7B-1 with support for
>submatches (the '\(...\)' syntax in SED regular expressions).
>This makes it possible to retrieve several components of a match with
>a single evaluation of a regexp. For example:
>
>1> RE_URL="\\(.+\\)://\\(.+\\)\\(/.+\\)(\\?\\(.*\\)(&\\(.*\\))*)?",
>1> gregexp:groups("http://localhost:81/script?arg&arg2&arg3", RE_URL).
>{match,["http","localhost:81","/script","arg","arg2","arg3"]}

Something like this is already planned for the next version.  It
follows the AWK style so it only exists in the substitution
functions.  You can extract sub-matches with a \1 - \9 syntax in the
replacement string.  The main question left is whether to change the
old sub/gsub functions or to only have it in a new gensub function.
Gensub is a new function which allows more control.  AWK only has it
in gensub.

Having a call to just match and extract the groups would probably be
useful.  The question is whether to return the actual substrings or
return a list of start/length pairs like match does today.

Comments?

        Robert


Reply | Threaded
Open this post in threaded view
|

regexp module with submatches available

Pascal Brisset
Robert Virding writes:
 > Having a call to just match and extract the groups would probably be
 > useful.  The question is whether to return the actual substrings or
 > return a list of start/length pairs like match does today.

Actually, gregexp:groups just maps lists:sublist/3 over a list of
start/length pairs returned by gregexp:re_apply/3.

Why not export re_apply/3 ? I'd rather use it than match/2, which
seems to iterate over the whole string to find the first match.

-- Pascal Brisset <pascal.brisset> +33(0)141986741 --
-- Cellicium | 73 avenue Carnot | F-94230 Cachan +33(0)685110788 --


gregexp:groups(S, ParsedRegExp) ->
    case re_apply(S, 1, ParsedRegExp) of
       {match, _RestPos, _Rest, Groups} ->
           GetGroup = fun ({Start,Len}) -> lists:sublist(S,Start,Len) end,
           {match, lists:map(GetGroup, Groups)};
       Other -> Other
    end.



Reply | Threaded
Open this post in threaded view
|

regexp module with submatches available

Robert Virding-4
Pascal Brisset <pascal.brisset> writes:

>Robert Virding writes:
> > Having a call to just match and extract the groups would probably be
> > useful.  The question is whether to return the actual substrings or
> > return a list of start/length pairs like match does today.
>
>Actually, gregexp:groups just maps lists:sublist/3 over a list of
>start/length pairs returned by gregexp:re_apply/3.
>
>Why not export re_apply/3 ? I'd rather use it than match/2, which
>seems to iterate over the whole string to find the first match.

Probably will, at the moment I am in a mood to open up some of the
library modules, done erl_eval for Joe and probably do shell as well.
So there maybe will be a re_apply/3 for those who want to
roll-there-own and some highere interfaces for the standard uses.

There are also some bug fixes.

>From the manual

       match(String, RegExp) -> MatchRes

              Finds  the  first,  longest  match  of  the regular
              expression RegExp in String. This function searches
              for  the  longest  possible  match  and returns the
              first one found if there are several expressions of
              the same length.

       first_match(String, RegExp) -> MatchRes

              Finds the first match  of  the  regular  expression
              RegExp  in String. This call is usually faster than
              match and it is also a useful way to ascertain that
              a match exists.

That minus some uninteresting junk about arguments, return values and
errors.  :-)

        Robert