Diagnosing gen_server call timeouts

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Diagnosing gen_server call timeouts

Roger Lipscombe-2
I've got a gen_server:call that -- very occasionally -- suffers from a
timeout. Obviously this means that my gen_server is already busy doing
something else.

Is there a way that I can instrument gen_server so that it will log
something if it takes too long to return from a callback? That is: my
handle_call is timing out because, presumably, there's another
handle_call (or handle_info, etc.) that's blocking. I'd like *that* to
be logged if it takes longer than, say, 200 milliseconds to return.

Or, I guess, if the message queue length is excessive at any point.

Caveat: I've got ~20,000 of these gen_server processes, and this only
happens intermittently in production, so I'm thinking that tracing
*isn't* the right answer here.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing gen_server call timeouts

Siraaj Khandkar-3
On Thu, Sep 19, 2019 at 1:33 PM Roger Lipscombe <[hidden email]> wrote:

>
> I've got a gen_server:call that -- very occasionally -- suffers from a
> timeout. Obviously this means that my gen_server is already busy doing
> something else.
>
> Is there a way that I can instrument gen_server so that it will log
> something if it takes too long to return from a callback? That is: my
> handle_call is timing out because, presumably, there's another
> handle_call (or handle_info, etc.) that's blocking. I'd like *that* to
> be logged if it takes longer than, say, 200 milliseconds to return.
>
> Or, I guess, if the message queue length is excessive at any point.
>
> Caveat: I've got ~20,000 of these gen_server processes, and this only
> happens intermittently in production, so I'm thinking that tracing
> *isn't* the right answer here.

Idea from an ignorant (of built-in features) point of view: call handler
begins by sends a `{call_started, ID, Other_Interesting_Info}` to a
bookkeeping process, then, before replying, `{call_completed, ID}`.
Bookkeeper thus always knows the current in-progress calls and the stats
for completed ones, so it can now log whichever ones violated your
ostensible SLA. This sounds awfully similar to just posting timer
start/end messages to StatsD or some such, but is more flexible.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing gen_server call timeouts

Jesper Louis Andersen-2
In reply to this post by Roger Lipscombe-2
On Thu, Sep 19, 2019 at 7:33 PM Roger Lipscombe <[hidden email]> wrote:

Is there a way that I can instrument gen_server so that it will log
something if it takes too long to return from a callback? That is: my
handle_call is timing out because, presumably, there's another
handle_call (or handle_info, etc.) that's blocking. I'd like *that* to
be logged if it takes longer than, say, 200 milliseconds to return.


The calling process is terminated by means of an exit signal, by calling `exit(timeout)`. You can capture this with an exception handler and then use erlang:process_info/2 to capture the gen_servers state when this happens (or you can try sys:get_state if it does return at some point). Essentially you turn your calling process into the logger. This works insofar there are relatively few of these errors, and it can explain what the called process is sitting and doing while this happens.

One thing to be aware of is that if you capture the timeout, then the real reply message might arrive late in your mailbox. You will have to handle this accordingly, should you opt to make the process continue.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing gen_server call timeouts

Max Lapshin-2
Copy paste from our video_frame.erl:


send_frame(Stream, #video_frame{content = Content, codec = Codec, flavor = Flavor, dts = DTS} = Frame, SourceTag) when is_pid(Stream) ->
  Frame1 = case SourceTag of
    undefined -> Frame;
    _ -> Frame#video_frame{source = SourceTag}
  end,
  try gen_server:call(Stream, Frame1)
  catch
    exit:{normal, _} ->
      {error, stopped};
    exit:{noproc, _} ->
      {error, stopped};
    exit:{shutdown, _} ->
      {error, stopped};
    exit:{timeout, _}:Stack ->
      Dict = case process_info(Stream, dictionary) of
        {dictionary, Dict_} -> Dict_;
        undefined -> []
      end,
      RemoteStack = process_info(Stream, current_stacktrace),
      Name = case proplists:get_value(name, Dict, <<"dead stream">>) of
        N when is_binary(N) -> N;
        N when is_atom(N) -> N;
        N -> iolist_to_binary(io_lib:format("~p", [N]))
      end,
      Status = proplists:get_value(status, Dict),
      case Status of
        S when S == starting_transcoder -> ok;
        _ -> events:error("failed to send frame ~p,~p,~p,~p, source_tag=~p to ~s (~p) in status ~p\n~p\nremote stack:\n~p", [
                         Content, Codec, Flavor, round(DTS), Frame1#video_frame.source,
                         Name, Stream, Status, Stack, RemoteStack])
      end,
      % [io:format("~10.. s: ~p~n", [K,V]) || {K,V} <- process_info(Stream)]
      {error, timeout}
  end.



this code is helping us for years.  You should remove video specific stuff.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing gen_server call timeouts

Roger Lipscombe-2
Er, so I neglected to mention that the gen_server is remote (same
host, different node), so process_info doesn't work :(

On Fri, 20 Sep 2019 at 01:45, Max Lapshin <[hidden email]> wrote:

>
> Copy paste from our video_frame.erl:
>
>
> send_frame(Stream, #video_frame{content = Content, codec = Codec, flavor = Flavor, dts = DTS} = Frame, SourceTag) when is_pid(Stream) ->
>   Frame1 = case SourceTag of
>     undefined -> Frame;
>     _ -> Frame#video_frame{source = SourceTag}
>   end,
>   try gen_server:call(Stream, Frame1)
>   catch
>     exit:{normal, _} ->
>       {error, stopped};
>     exit:{noproc, _} ->
>       {error, stopped};
>     exit:{shutdown, _} ->
>       {error, stopped};
>     exit:{timeout, _}:Stack ->
>       Dict = case process_info(Stream, dictionary) of
>         {dictionary, Dict_} -> Dict_;
>         undefined -> []
>       end,
>       RemoteStack = process_info(Stream, current_stacktrace),
>       Name = case proplists:get_value(name, Dict, <<"dead stream">>) of
>         N when is_binary(N) -> N;
>         N when is_atom(N) -> N;
>         N -> iolist_to_binary(io_lib:format("~p", [N]))
>       end,
>       Status = proplists:get_value(status, Dict),
>       case Status of
>         S when S == starting_transcoder -> ok;
>         _ -> events:error("failed to send frame ~p,~p,~p,~p, source_tag=~p to ~s (~p) in status ~p\n~p\nremote stack:\n~p", [
>                          Content, Codec, Flavor, round(DTS), Frame1#video_frame.source,
>                          Name, Stream, Status, Stack, RemoteStack])
>       end,
>       % [io:format("~10.. s: ~p~n", [K,V]) || {K,V} <- process_info(Stream)]
>       {error, timeout}
>   end.
>
>
>
> this code is helping us for years.  You should remove video specific stuff.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing gen_server call timeouts

Danil Zagoskin-2
You can use rpc for that:

rpc:call(node(RPid), erlang, process_info, [RPid, [dictionary, current_stacktrace]]).

On Fri, Sep 20, 2019 at 2:57 PM Roger Lipscombe <[hidden email]> wrote:
Er, so I neglected to mention that the gen_server is remote (same
host, different node), so process_info doesn't work :(

On Fri, 20 Sep 2019 at 01:45, Max Lapshin <[hidden email]> wrote:
>
> Copy paste from our video_frame.erl:
>
>
> send_frame(Stream, #video_frame{content = Content, codec = Codec, flavor = Flavor, dts = DTS} = Frame, SourceTag) when is_pid(Stream) ->
>   Frame1 = case SourceTag of
>     undefined -> Frame;
>     _ -> Frame#video_frame{source = SourceTag}
>   end,
>   try gen_server:call(Stream, Frame1)
>   catch
>     exit:{normal, _} ->
>       {error, stopped};
>     exit:{noproc, _} ->
>       {error, stopped};
>     exit:{shutdown, _} ->
>       {error, stopped};
>     exit:{timeout, _}:Stack ->
>       Dict = case process_info(Stream, dictionary) of
>         {dictionary, Dict_} -> Dict_;
>         undefined -> []
>       end,
>       RemoteStack = process_info(Stream, current_stacktrace),
>       Name = case proplists:get_value(name, Dict, <<"dead stream">>) of
>         N when is_binary(N) -> N;
>         N when is_atom(N) -> N;
>         N -> iolist_to_binary(io_lib:format("~p", [N]))
>       end,
>       Status = proplists:get_value(status, Dict),
>       case Status of
>         S when S == starting_transcoder -> ok;
>         _ -> events:error("failed to send frame ~p,~p,~p,~p, source_tag=~p to ~s (~p) in status ~p\n~p\nremote stack:\n~p", [
>                          Content, Codec, Flavor, round(DTS), Frame1#video_frame.source,
>                          Name, Stream, Status, Stack, RemoteStack])
>       end,
>       % [io:format("~10.. s: ~p~n", [K,V]) || {K,V} <- process_info(Stream)]
>       {error, timeout}
>   end.
>
>
>
> this code is helping us for years.  You should remove video specific stuff.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing gen_server call timeouts

Roger Lipscombe-2
In reply to this post by Roger Lipscombe-2
In the end, I used 'dbg'; here's the relevant snippet:

AssertElapsed =
    fun(Pid, Call, Req, Then) ->
        Now = erlang:monotonic_time(),
        ElapsedMicros = erlang:convert_time_unit(Now - Then, native,
microsecond),
        if
            ElapsedMicros > 5000 ->
                io:format("~s ~p(~p) took ~B us\n",
                    [lists:flatten(ec_date:format("c",
calendar:universal_time())), Call, Req, ElapsedMicros]);
            true ->
                ok
        end,
        {message_queue_len, QueueLen} = erlang:process_info(Pid,
message_queue_len),
        if
            QueueLen > 5 ->
                io:format("~s ~p\n",
[lists:flatten(ec_date:format("c", calendar:universal_time())),
erlang:process_info(Pid, messages)]);
            true ->
                ok
        end
    end.

%% TState :: map(Fun -> [{Time, Arg0}]).
TState0 = #{}.

TFun =
    fun(_Msg = {trace, _Pid, call, {Mod, Fun, [Arg0 | _]}}, TState) ->
        % push an entry timestamp.
        Call = {erlang:monotonic_time(), Arg0},
        maps:update_with(Fun, fun(Calls) -> [Call | Calls] end, [Call], TState);

       (_Msg = {trace, _Pid, return_from, {Mod, Fun, _Arity},
_Result}, TState) ->
        % pop entry timestamp.
        [{Then, Arg0} | Calls] = maps:get(Fun, TState),
        AssertElapsed(Pid, Fun, Arg0, Then),
        maps:update(Fun, Calls, TState);

       (Msg, TState) ->
        io:format("~p\n", [Msg]),
        TState
    end.

dbg:start().
dbg:tracer(process, {TFun, TState0}).
dbg:tpl(Mod, handle_call, '_', dbg:fun2ms(fun(_) -> return_trace() end)).
dbg:tpl(Mod, handle_info, '_', dbg:fun2ms(fun(_) -> return_trace() end)).
dbg:p(Pid, c).

In my particular case, handle_info was occasionally taking ~30
seconds, depending on the input, meaning that if a gen_server:call
occurred immediately afterwards, there was a good chance it would time
out.

Aside: I *always* forget that maps:update_with/4 doesn't apply Fun to Init.

On Thu, 19 Sep 2019 at 18:33, Roger Lipscombe <[hidden email]> wrote:

>
> I've got a gen_server:call that -- very occasionally -- suffers from a
> timeout. Obviously this means that my gen_server is already busy doing
> something else.
>
> Is there a way that I can instrument gen_server so that it will log
> something if it takes too long to return from a callback? That is: my
> handle_call is timing out because, presumably, there's another
> handle_call (or handle_info, etc.) that's blocking. I'd like *that* to
> be logged if it takes longer than, say, 200 milliseconds to return.
>
> Or, I guess, if the message queue length is excessive at any point.
>
> Caveat: I've got ~20,000 of these gen_server processes, and this only
> happens intermittently in production, so I'm thinking that tracing
> *isn't* the right answer here.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions