Core Erlang: Associating Function Names to Function Bodies

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Core Erlang: Associating Function Names to Function Bodies

Christopher Meiklejohn-4
As part of a static analysis I'm writing, I'm trying to associate
function names with their function bodies so I can provide information
about the calling functions to users as part of the completed
analysis.  However, I'm running into difficulty performing this
mapping.

The problem I'm running into is the following:

1.) In the Core Erlang AST, named functions are defined by associating
a c_var node in the form of {Function, Arity} to a subtree containing
the c_fun function body.
2.) Also, in the Core Erlang AST, calls to functions are also
represented as c_var nodes in the form of {Function, Arity}.

The cerl_trees:fold, mapfold all perform postorder traversals.  With a
postorder traversal, there may be an arbitrary distance in the fold
between when a function is defined and where it is given a name.  With
reverse postorder, my understanding is that these nodes would be
directly next to each other, which would make the mapping easier.

My initial approach was to attempt to, when a c_var node was found in
the proper form, store a candidate name for the next function that's
found and perform the association that way.  The problem that arises
from this approach is that there can be an arbitrary number of c_var
nodes for function invocations between the definition of the function
and it's actual body.  The same can also happen for any functions that
occur in between as well.

Therefore, my analysis incorrectly associates function bodies to
function names for any function that calls another function in the
body -- incorrectly thinking that the nearest c_var node containing a
function name is it's associated name.

I assume this must have been done for a previous analysis performed on
Erlang, but I haven't found anything.  Would a reverse postorder
traversal make this easier?  Is there a better way to identify the
mapping between c_var nodes and c_funs that are being assigned to
them?  Has this been done before?

Thanks for the advice,
Christopher
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Core Erlang: Associating Function Names to Function Bodies

Christopher Meiklejohn-4
On Sun, Mar 10, 2019 at 1:02 PM Christopher Meiklejohn
<[hidden email]> wrote:

>
> As part of a static analysis I'm writing, I'm trying to associate
> function names with their function bodies so I can provide information
> about the calling functions to users as part of the completed
> analysis.  However, I'm running into difficulty performing this
> mapping.
>
> The problem I'm running into is the following:
>
> 1.) In the Core Erlang AST, named functions are defined by associating
> a c_var node in the form of {Function, Arity} to a subtree containing
> the c_fun function body.
> 2.) Also, in the Core Erlang AST, calls to functions are also
> represented as c_var nodes in the form of {Function, Arity}.
>
> The cerl_trees:fold, mapfold all perform postorder traversals.  With a
> postorder traversal, there may be an arbitrary distance in the fold
> between when a function is defined and where it is given a name.  With
> reverse postorder, my understanding is that these nodes would be
> directly next to each other, which would make the mapping easier.
>
> My initial approach was to attempt to, when a c_var node was found in
> the proper form, store a candidate name for the next function that's
> found and perform the association that way.  The problem that arises
> from this approach is that there can be an arbitrary number of c_var
> nodes for function invocations between the definition of the function
> and it's actual body.  The same can also happen for any functions that
> occur in between as well.
>
> Therefore, my analysis incorrectly associates function bodies to
> function names for any function that calls another function in the
> body -- incorrectly thinking that the nearest c_var node containing a
> function name is it's associated name.
>
> I assume this must have been done for a previous analysis performed on
> Erlang, but I haven't found anything.  Would a reverse postorder
> traversal make this easier?  Is there a better way to identify the
> mapping between c_var nodes and c_funs that are being assigned to
> them?  Has this been done before?

Just for those who are interested, I did figure out one solution to
the problem which I have tried out on several models now that appears
to be working so far.

Here's the rough outline of my currently-working solution:

Since there is no node directly connecting the c_var of a variable
assignment along with the body of the function, I had to perform a
traversal where I try to match function bodies that I find along with
the variables they are being assigned to.  To do this, I did the
following:

1.) I first use cerl_trees:fold to perform a postorder traversal of
the tree, building up a list of nodes to explore.
2.) I then lists:foldr over this, allowing me to perform effectively a
reverse postorder traversal of the graph, where, I will encounter
function bodies (c_fun) prior to encountering function variables.
(c_var)
3.) I perform an analysis that allows me to match up names to bodies
using the following heuristic:

a.) Whenever I encounter a function body, I can determine from the
node whether or not the function body is the body of an anonymous
function or not.  Anonymous functions are annotated with a randomly
assigned label which includes the enclosing function and the line
where the anonymous function was generated.  This allows me to rule
out the top-level functions from functions that are created from
within a top-level function.  When I discover one of these functions,
I keep track of this in a candidate list of potential function body
matches for the function variable I am assigning to.
b.) Whenever I encounter a c_var in the form of {Function, Arity}, I
know that I'm dealing with an application or an assignment.
(Curiously, why is there not a node that directly associates the c_var
to the c_fun -- this would have made things *significantly* easier,
but does not exist in the Core Erlang when performing a traversal via
cerl_trees.)  If the function contains an annotation containing a line
number of invocation, then I know it's the *application* of this
function, and not the *abstraction* of this function.  (Again, why
does c_var treat LHS assignments the same as RHS use of the function
-- if this was more specific, it would have again, made things
*significantly* easier.)  If I'm looking for a candidate match between
body and function variable, then I perform a match based on this.
c.) Otherwise, either the c_var is the use of a function or the c_fun
is the creation of an anonymous function, and it's subsequently
ignored by the matching procedure.

On the 'analysis' branch of this, I have a prototype of this
implementation that I've tested on several modules that appears to be
working.

If this could have been done a different, easier way, I'd appreciate
knowing about it.

Thanks,
Christopher
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Core Erlang: Associating Function Names to Function Bodies

Richard Carlsson-3
The problem seems to mainly be that you would like to be able to use cerl_trees:fold() to do the traversing for you. But in general, a fold doesn't pass the operator any info about the context of each element in relation to its neighbouring elements, in the traversal order.

The representation of Core Erlang syntax trees was specifically designed so that only constructs that have a meaningful semantics on their own will be nodes in the tree. So, you get nodes for literals, calls, tuples, funs, letrecs, etc. (including modules). Many of these have a simple sequence of child nodes, as for a cons cell or a tuple, and in a post-order traversal, you know that you have just seen all the child nodes when you see the tuple node itself. But for some other constructs, like a 'fun (X1,...Xn) -> Body' or a 'case Expr of <Clauses> end' or 'module Name [ <Exports> ... ] attributes [  <Attributes>... ] F1/A1 = fun ... F2/A2 = fun ... end', there is nothing that tells a fold how to separate the subgroups of child nodes, especially when they all can have varying length.

To fix this, a syntax tree representation can choose to use specific grouping nodes, like {module, [{name, [...]}, {exports, [...]}, {attributes, [...]}, {definitions, [...]} ]}, so that a fold would see "aha, I just got done with a 'definitions' group". I guess that's what you wanted, Chris. It does simplify some things in traversals. The drawback, and the reason we decided to avoid all such nodes, is that it forces all code that works on those syntax trees to understand these grouping nodes, and treat them separately from meaning-carrying nodes. In our experience, that is a big pain. I guess what we should have done but never got around to is write some really good tree walking functions that can tell you the context, choose a traversal order, and such things. As is, the cerl:subtrees() function is your best friend. The cerl_trees:fold() was just intended for simple things like checking for occurrences, counting tree sizes and whatnot.

See the example below for how you can label a Core Erlang tree and traverse it to compute mappings between labels and function names.
It's written as a core transform, so you can try it on an arbitrary module foo by compiling it with c(foo,[{core_transform,cerl_funcmap}]).

        /Richard

%% =====================================================================
%% @doc Core transform example for computing and printing a label mapping
%% @author Richard Carlsson <[hidden email]>

-module(cerl_funcmap).

-export([core_transform/2]).

core_transform(T, _Opts) ->
    {T1, _MaxL} = cerl_trees:label(T),
    %%io:format("~n~s.~n", [cerl_prettypr:format(T1)]),
    {Ms, _Ls} = visit(T1),
    io:format("Functions to labels:~n~p.~n", [maps:from_list(Ms)]),
    Rev = lists:usort(
            lists:flatmap(fun ({V, Ls}) -> [{L, V} || L <- Ls] end,
                          Ms)),
    io:format("Labels to functions:~n~p.~n", [maps:from_list(Rev)]),
    T. % return original T

visit(T) ->
    case cerl:type(T) of
        letrec ->
            {visit_defs(cerl:letrec_defs(T), []),
             get_label(T)};
        module ->
            {visit_defs(cerl:module_defs(T), []),
             get_label(T)};
        _ ->
            case cerl:subtrees(T) of
                [] ->
                    {[], get_label(T)};
                Gs ->
                    {Ms, Ls} = visit_list(lists:flatten(Gs)),
                    {Ms, get_label(T) ++ Ls}
            end
    end.

visit_list(Ts) ->
    lists:foldl(fun (T, {Ms0, Ls0}) ->
                        {Ms, Ls} = visit(T),
                        {Ms ++ Ms0, Ls ++ Ls0}
                end,
                {[], []},
                Ts).

visit_defs([{V, T} | Ds], Ms0) ->
    {Ms, Ls} = visit(T),
    Ls1 = ordsets:union(get_label(V), ordsets:from_list(Ls)),
    visit_defs(Ds, [{cerl:var_name(V), Ls1}] ++ Ms ++ Ms0);
visit_defs([], Ms) ->
    Ms.

get_label(T) ->
    case cerl:get_ann(T) of
[{label, L} | _] -> [L];
_ -> []
    end.

%% ============================================================


Den mån 11 mars 2019 kl 15:49 skrev Christopher Meiklejohn <[hidden email]>:
On Sun, Mar 10, 2019 at 1:02 PM Christopher Meiklejohn
<[hidden email]> wrote:
>
> As part of a static analysis I'm writing, I'm trying to associate
> function names with their function bodies so I can provide information
> about the calling functions to users as part of the completed
> analysis.  However, I'm running into difficulty performing this
> mapping.
>
> The problem I'm running into is the following:
>
> 1.) In the Core Erlang AST, named functions are defined by associating
> a c_var node in the form of {Function, Arity} to a subtree containing
> the c_fun function body.
> 2.) Also, in the Core Erlang AST, calls to functions are also
> represented as c_var nodes in the form of {Function, Arity}.
>
> The cerl_trees:fold, mapfold all perform postorder traversals.  With a
> postorder traversal, there may be an arbitrary distance in the fold
> between when a function is defined and where it is given a name.  With
> reverse postorder, my understanding is that these nodes would be
> directly next to each other, which would make the mapping easier.
>
> My initial approach was to attempt to, when a c_var node was found in
> the proper form, store a candidate name for the next function that's
> found and perform the association that way.  The problem that arises
> from this approach is that there can be an arbitrary number of c_var
> nodes for function invocations between the definition of the function
> and it's actual body.  The same can also happen for any functions that
> occur in between as well.
>
> Therefore, my analysis incorrectly associates function bodies to
> function names for any function that calls another function in the
> body -- incorrectly thinking that the nearest c_var node containing a
> function name is it's associated name.
>
> I assume this must have been done for a previous analysis performed on
> Erlang, but I haven't found anything.  Would a reverse postorder
> traversal make this easier?  Is there a better way to identify the
> mapping between c_var nodes and c_funs that are being assigned to
> them?  Has this been done before?

Just for those who are interested, I did figure out one solution to
the problem which I have tried out on several models now that appears
to be working so far.

Here's the rough outline of my currently-working solution:

Since there is no node directly connecting the c_var of a variable
assignment along with the body of the function, I had to perform a
traversal where I try to match function bodies that I find along with
the variables they are being assigned to.  To do this, I did the
following:

1.) I first use cerl_trees:fold to perform a postorder traversal of
the tree, building up a list of nodes to explore.
2.) I then lists:foldr over this, allowing me to perform effectively a
reverse postorder traversal of the graph, where, I will encounter
function bodies (c_fun) prior to encountering function variables.
(c_var)
3.) I perform an analysis that allows me to match up names to bodies
using the following heuristic:

a.) Whenever I encounter a function body, I can determine from the
node whether or not the function body is the body of an anonymous
function or not.  Anonymous functions are annotated with a randomly
assigned label which includes the enclosing function and the line
where the anonymous function was generated.  This allows me to rule
out the top-level functions from functions that are created from
within a top-level function.  When I discover one of these functions,
I keep track of this in a candidate list of potential function body
matches for the function variable I am assigning to.
b.) Whenever I encounter a c_var in the form of {Function, Arity}, I
know that I'm dealing with an application or an assignment.
(Curiously, why is there not a node that directly associates the c_var
to the c_fun -- this would have made things *significantly* easier,
but does not exist in the Core Erlang when performing a traversal via
cerl_trees.)  If the function contains an annotation containing a line
number of invocation, then I know it's the *application* of this
function, and not the *abstraction* of this function.  (Again, why
does c_var treat LHS assignments the same as RHS use of the function
-- if this was more specific, it would have again, made things
*significantly* easier.)  If I'm looking for a candidate match between
body and function variable, then I perform a match based on this.
c.) Otherwise, either the c_var is the use of a function or the c_fun
is the creation of an anonymous function, and it's subsequently
ignored by the matching procedure.

On the 'analysis' branch of this, I have a prototype of this
implementation that I've tested on several modules that appears to be
working.

If this could have been done a different, easier way, I'd appreciate
knowing about it.

Thanks,
Christopher
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions