Substring look-up

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Substring look-up

Olivier Boudeville
Hi,

It must be a silly question, but, since the Latin1 -> Unicode switch in
OTP 20.0, is there a (non-obsolete) way in the string module to look-up
the index of a string into another one, i.e. to find the location of a
given substring?

rstr/2 is supposed to be replaced with find/3, yet the former returns an
index whereas the latter returns a part of the original string. I could
not find a way to obtain a relevant index with any of the newer string
functions - whereas I would guess it is a fairly common need?

To give a bit more context, the goal was to prevent the implementation
of [1] from becoming obsolete; string:substr/3 and string:sub_string/3
are flagged as obsolete and may be replaced by slice/3 (see [2]); yet
what can be done for rstr/2?

(even if a smart use of some function was found to address the
particular need of this replace_extension/3 function, obtaining indexes
of substrings would still be useful in many cases, isn't it?)

Thanks in advance for any hint!

Best regards,

Olivier.


[1] Soon obsolete apparently:

% Returns a new filename whose extension has been updated.
%
% Ex: replace_extension("/home/jack/rosie.ttf", ".ttf", ".wav") should
return
% "/home/jack/rosie.wav".
%
-spec replace_extension( file_path(), extension(), extension() ) ->
file_path().
replace_extension( FilePath, SourceExtension, TargetExtension ) ->

     case string:rstr( FilePath, SourceExtension ) of

         0 ->
             throw( { extension_not_found, SourceExtension, FilePath } );

         Index ->
             string:substr( FilePath, 1, Index-1 ) ++ TargetExtension

     end.


[2] BTW there is a change in the indexing convention that could be
better advertised in the doc:

 > string:substr("abc",1).
"abc"

 > string:sub_string("abc",1).
"abc"

 > string:slice("abc",1).
"bc"

 > string:slice("abc",0).
"abc"

--
Olivier Boudeville

Reply | Threaded
Open this post in threaded view
|

Re: Substring look-up

Dan Gudmundsson-2
There are currently no replacements for those functions,
the thought was that it was a lot more expensive to traverse the string now so you should only traverse it once,
or you should at least think about it instead of just replacing the new api.

I believe 'string:split(File, Ext, trailing)' or why not 'string:replace(File, OldExt, NewExt, trailing)'
does what you want in this case, or you could use the 'filename' module for handling filenames.

But yes the string api could be extended with a function or two.


On Tue, Apr 6, 2021 at 11:29 PM Olivier Boudeville <[hidden email]> wrote:
Hi,

It must be a silly question, but, since the Latin1 -> Unicode switch in
OTP 20.0, is there a (non-obsolete) way in the string module to look-up
the index of a string into another one, i.e. to find the location of a
given substring?

rstr/2 is supposed to be replaced with find/3, yet the former returns an
index whereas the latter returns a part of the original string. I could
not find a way to obtain a relevant index with any of the newer string
functions - whereas I would guess it is a fairly common need?

To give a bit more context, the goal was to prevent the implementation
of [1] from becoming obsolete; string:substr/3 and string:sub_string/3
are flagged as obsolete and may be replaced by slice/3 (see [2]); yet
what can be done for rstr/2?

(even if a smart use of some function was found to address the
particular need of this replace_extension/3 function, obtaining indexes
of substrings would still be useful in many cases, isn't it?)

Thanks in advance for any hint!

Best regards,

Olivier.


[1] Soon obsolete apparently:

% Returns a new filename whose extension has been updated.
%
% Ex: replace_extension("/home/jack/rosie.ttf", ".ttf", ".wav") should
return
% "/home/jack/rosie.wav".
%
-spec replace_extension( file_path(), extension(), extension() ) ->
file_path().
replace_extension( FilePath, SourceExtension, TargetExtension ) ->

     case string:rstr( FilePath, SourceExtension ) of

         0 ->
             throw( { extension_not_found, SourceExtension, FilePath } );

         Index ->
             string:substr( FilePath, 1, Index-1 ) ++ TargetExtension

     end.


[2] BTW there is a change in the indexing convention that could be
better advertised in the doc:

 > string:substr("abc",1).
"abc"

 > string:sub_string("abc",1).
"abc"

 > string:slice("abc",1).
"bc"

 > string:slice("abc",0).
"abc"

--
Olivier Boudeville

Reply | Threaded
Open this post in threaded view
|

Re: Substring look-up

zxq9-2
In reply to this post by Olivier Boudeville
On 2021/04/07 6:29, Olivier Boudeville wrote:

> Hi,
>
> It must be a silly question, but, since the Latin1 -> Unicode switch in
> OTP 20.0, is there a (non-obsolete) way in the string module to look-up
> the index of a string into another one, i.e. to find the location of a
> given substring?
>
> rstr/2 is supposed to be replaced with find/3, yet the former returns an
> index whereas the latter returns a part of the original string. I could
> not find a way to obtain a relevant index with any of the newer string
> functions - whereas I would guess it is a fairly common need?

The regex module's default run/2,3 behavior does what you are asking for.

   1> {ok, MP} = re:compile("foo", [unicode]).
   {ok,{re_pattern,0,1,0,<<69,82,,...>>}}
   2> re:run("barfoobar", MP).
   {match,[{3,3}]}
   3> re:run("barfoobarfoo", MP).
   {match,[{3,3}]}
   4> re:run("barfoobarfoo", MP, [global]).
   {match,[[{3,3}],[{9,3}]]}

Note here the [global] option makes it continue beyond the first match.

We are in a sort of flux at the moment with strings where we have
finally got good unicode support and on a broader set of representations
than just strings-as-lists but in the process of converting the string
library module itself and revamping it a few rough edges and obsolete
warnings still linger.

When all else fails, writing a custom function works great to cover the
gap -- luckily none of these sort of functions are particularly
difficult to figure out how to implement!

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Substring look-up

Olivier Boudeville
In reply to this post by Dan Gudmundsson-2
Hello Dan,

Thanks for your answer; indeed rewriting most uses of string:rstr/2 in terms of string:split/3 and others should be possible, and each string traversal must be expensive now.

Best regards,

Olivier.


[2] Corresponding code:

% Index in a Unicode string, in terms of grapheme clusters (ex: not codepoints,
% not bytes).
%
-type gc_index() :: non_neg_integer().


% Returns the index, in terms of grapheme clusters, of the first occurrence of
% the specified pattern substring (if any) in the specified string.
%
-spec find_substring_index( unicode:chardata(), unicode:chardata() ) ->
                                    gc_index() | 'nomatch'.
find_substring_index( String, SearchPattern ) ->
    find_substring_index( String, SearchPattern, _Direction=leading ).


% Returns the index, in terms of grapheme clusters, of the first or last
% occurrence (depending on the specified direction) of the specified pattern
% substring (if any) in the specified string.
%
-spec find_substring_index( unicode:chardata(), unicode:chardata(),
                            string:direction() ) -> gc_index() | 'nomatch'.
find_substring_index( String, SearchPattern, Direction ) ->
    GCString = string:to_graphemes( String ),
    GCSearchPattern = string:to_graphemes( SearchPattern ),
    PseudoIndex = case Direction of

        leading ->
            string:str( GCString, GCSearchPattern );

        trailing ->
            string:rstr( GCString, GCSearchPattern )

    end,

    case PseudoIndex of

        0 ->
            nomatch;

        % Indexes of grapheme clusters are to start at 0, not 1:
        I ->
            I-1

    end.

Notes:

- probably not very efficient, but may be replaced later by an optimised version, with no API change for the user code

- the point is to reuse the *code* of string:str/2 and string:rstr/2 (even if this API is to disappear)

- maybe such functions could also operate directly on [grapheme_cluster()], to avoid too many conversions



Le 4/7/21 à 9:03 AM, Dan Gudmundsson a écrit :
There are currently no replacements for those functions,
the thought was that it was a lot more expensive to traverse the string now so you should only traverse it once,
or you should at least think about it instead of just replacing the new api.

I believe 'string:split(File, Ext, trailing)' or why not 'string:replace(File, OldExt, NewExt, trailing)'
does what you want in this case, or you could use the 'filename' module for handling filenames.

But yes the string api could be extended with a function or two.


On Tue, Apr 6, 2021 at 11:29 PM Olivier Boudeville <[hidden email]> wrote:
Hi,

It must be a silly question, but, since the Latin1 -> Unicode switch in
OTP 20.0, is there a (non-obsolete) way in the string module to look-up
the index of a string into another one, i.e. to find the location of a
given substring?

rstr/2 is supposed to be replaced with find/3, yet the former returns an
index whereas the latter returns a part of the original string. I could
not find a way to obtain a relevant index with any of the newer string
functions - whereas I would guess it is a fairly common need?

To give a bit more context, the goal was to prevent the implementation
of [1] from becoming obsolete; string:substr/3 and string:sub_string/3
are flagged as obsolete and may be replaced by slice/3 (see [2]); yet
what can be done for rstr/2?

(even if a smart use of some function was found to address the
particular need of this replace_extension/3 function, obtaining indexes
of substrings would still be useful in many cases, isn't it?)

Thanks in advance for any hint!

Best regards,

Olivier.


[1] Soon obsolete apparently:

% Returns a new filename whose extension has been updated.
%
% Ex: replace_extension("/home/jack/rosie.ttf", ".ttf", ".wav") should
return
% "/home/jack/rosie.wav".
%
-spec replace_extension( file_path(), extension(), extension() ) ->
file_path().
replace_extension( FilePath, SourceExtension, TargetExtension ) ->

     case string:rstr( FilePath, SourceExtension ) of

         0 ->
             throw( { extension_not_found, SourceExtension, FilePath } );

         Index ->
             string:substr( FilePath, 1, Index-1 ) ++ TargetExtension

     end.


[2] BTW there is a change in the indexing convention that could be
better advertised in the doc:

 > string:substr("abc",1).
"abc"

 > string:sub_string("abc",1).
"abc"

 > string:slice("abc",1).
"bc"

 > string:slice("abc",0).
"abc"

--
Olivier Boudeville


-- 
Olivier Boudeville
Reply | Threaded
Open this post in threaded view
|

Re: Substring look-up

zxq9-2
On 2021/04/07 19:37, Olivier Boudeville wrote:
> Hello Dan,
>
> Thanks for your answer; indeed rewriting most uses of string:rstr/2 in
> terms of string:split/3 and others should be possible, and each string
> traversal must be expensive now.

Sometimes going the neanderthal route is a simplification:

1. unicode:characters_to_list/1
2. write a custom function to iterate as a list the original way

The more complex the original representation and more interesting the
sort of work you want done the more this approach tends to save me in
both cognitive and processing overhead. My case may be highly optimized
for this, though, as I usually deal with English, German and Japanese
and rarely any other text input languages -- some input forms for other
languages can get pretty interesting and probably don't map as well to
the concept of "characters" after conversion.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Substring look-up

Olivier Boudeville
Hello Craig,

Thanks for your answer and for the regex-based solution.

Regarding unicode:characters_to_list/1, my understanding is that it
would mean operating on codepoints rather than on grapheme clusters, and
I suppose that this *may* result in unintended matches if searching for
clusters, i.e. "user-perceived characters" (often the ones that
matter/make sense?).

For example, if grapheme clusters such as GC1=[A,B], GC2=[C,D] and
GC3=[B,C] existed (where the A, B, C and D variables are codepoints),
then searching for a substring that would contain only the GC3 cluster
in a flattened string containing GC1 then GC2 (i.e. [A,B,C,D]) would
match, whereas it should not (unless such a case is known to be
impossible by design?).

(disclaimer: I am not pretending I understand Unicode correctly, just
needing to cope with various input filenames - not even mentioning the
so-called "raw" ones ;-))

Best regards,

Olivier.



Le 4/7/21 à 12:42 PM, zxq9 a écrit :

> On 2021/04/07 19:37, Olivier Boudeville wrote:
>> Hello Dan,
>>
>> Thanks for your answer; indeed rewriting most uses of string:rstr/2
>> in terms of string:split/3 and others should be possible, and each
>> string traversal must be expensive now.
>
> Sometimes going the neanderthal route is a simplification:
>
> 1. unicode:characters_to_list/1
> 2. write a custom function to iterate as a list the original way
>
> The more complex the original representation and more interesting the
> sort of work you want done the more this approach tends to save me in
> both cognitive and processing overhead. My case may be highly
> optimized for this, though, as I usually deal with English, German and
> Japanese and rarely any other text input languages -- some input forms
> for other languages can get pretty interesting and probably don't map
> as well to the concept of "characters" after conversion.
>
> -Craig


--
Olivier Boudeville