exception exit: timeout in gen_server:call

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

exception exit: timeout in gen_server:call

Roberto Ostinelli-5

All,
I ame experiencing the following error when calling a transaction in poolboy as per the README:

equery(PoolName, Stmt, Params) ->
    poolboy:transaction(PoolName, fun(Worker) ->
        gen_server:call(Worker, {equery, Stmt, Params})
    end).
** exception exit: {timeout,{gen_server,call,
                                        [keys_db,{checkout,#Ref<0.0.1.156295>,true},5000]}}
     in function  gen_server:call/3 (gen_server.erl, line 212)
     in call from poolboy:checkout/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 55)
     in call from poolboy:transaction/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 74)

The process queue keeps on increasing, and I can see the following:

3> erlang:process_info(whereis(keys_db)).  
[{registered_name,keys_db},
 {current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,11906},
 {messages,[{'$gen_cast',{cancel_waiting,#Ref<0.0.1.138090>}},
            {'$gen_call',{<0.15224.0>,#Ref<0.0.1.139621>},
                         {checkout,#Ref<0.0.1.139620>,true}},
            {'$gen_call',{<0.15139.0>,#Ref<0.0.1.139649>},
                         {checkout,#Ref<0.0.1.139648>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138159>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138175>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138232>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138252>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138261>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138286>}},
            {'$gen_call',{<0.15235.0>,#Ref<0.0.1.139774>},
                         {checkout,#Ref<0.0.1.139773>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.2.77777>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138318>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138336>}},
            {'$gen_call',{<0.15233.0>,#Ref<0.0.1.139816>},
                         {checkout,#Ref<0.0.1.139815>,true}},
            {'$gen_call',{<0.15245.0>,#Ref<0.0.1.139854>},
                         {checkout,#Ref<0.0.1.139853>,true}},
            {'$gen_call',{<0.15237.0>,#Ref<0.0.2.78173>},
                         {checkout,#Ref<0.0.2.78172>,...}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138407>}},
            {'$gen_call',{<0.15228.0>,...},{...}},
            {'$gen_call',{...},...},
            {'$gen_call',...},
            {...}|...]},
 {links,[<0.714.1>,<0.817.1>,<0.947.1>,<0.1015.1>,<0.1045.1>,
         <0.1048.1>,<0.1038.1>,<0.983.1>,<0.1002.1>,<0.962.1>,
         <0.877.1>,<0.909.1>,<0.938.1>,<0.892.1>,<0.849.1>,<0.866.1>,
         <0.832.1>,<0.765.1>,<0.789.1>,<0.804.1>|...]},
 {dictionary,[{'$initial_call',{poolboy,init,1}},
              {'$ancestors',[pgpool_sup,<0.673.0>]}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.672.0>},
 {total_heap_size,393326},
 {heap_size,196650},
 {stack_size,33},
 {reductions,14837255},
 {garbage_collection,[{max_heap_size,#{error_logger => true,
                                       kill => true,
                                       size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,3}]},
 {suspending,[]}]

Does someone have  an insight of what may be going wrong? I see that the process status is waiting...

Thank you,
r.

Reply | Threaded
Open this post in threaded view
|

Re: exception exit: timeout in gen_server:call

Roberto Ostinelli-5
Forgot to mention: this happens in a completely random way.

On Tue, Dec 3, 2019 at 6:48 PM Roberto Ostinelli <[hidden email]> wrote:

All,
I ame experiencing the following error when calling a transaction in poolboy as per the README:

equery(PoolName, Stmt, Params) ->
    poolboy:transaction(PoolName, fun(Worker) ->
        gen_server:call(Worker, {equery, Stmt, Params})
    end).
** exception exit: {timeout,{gen_server,call,
                                        [keys_db,{checkout,#Ref<0.0.1.156295>,true},5000]}}
     in function  gen_server:call/3 (gen_server.erl, line 212)
     in call from poolboy:checkout/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 55)
     in call from poolboy:transaction/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 74)

The process queue keeps on increasing, and I can see the following:

3> erlang:process_info(whereis(keys_db)).  
[{registered_name,keys_db},
 {current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,11906},
 {messages,[{'$gen_cast',{cancel_waiting,#Ref<0.0.1.138090>}},
            {'$gen_call',{<0.15224.0>,#Ref<0.0.1.139621>},
                         {checkout,#Ref<0.0.1.139620>,true}},
            {'$gen_call',{<0.15139.0>,#Ref<0.0.1.139649>},
                         {checkout,#Ref<0.0.1.139648>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138159>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138175>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138232>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138252>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138261>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138286>}},
            {'$gen_call',{<0.15235.0>,#Ref<0.0.1.139774>},
                         {checkout,#Ref<0.0.1.139773>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.2.77777>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138318>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138336>}},
            {'$gen_call',{<0.15233.0>,#Ref<0.0.1.139816>},
                         {checkout,#Ref<0.0.1.139815>,true}},
            {'$gen_call',{<0.15245.0>,#Ref<0.0.1.139854>},
                         {checkout,#Ref<0.0.1.139853>,true}},
            {'$gen_call',{<0.15237.0>,#Ref<0.0.2.78173>},
                         {checkout,#Ref<0.0.2.78172>,...}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138407>}},
            {'$gen_call',{<0.15228.0>,...},{...}},
            {'$gen_call',{...},...},
            {'$gen_call',...},
            {...}|...]},
 {links,[<0.714.1>,<0.817.1>,<0.947.1>,<0.1015.1>,<0.1045.1>,
         <0.1048.1>,<0.1038.1>,<0.983.1>,<0.1002.1>,<0.962.1>,
         <0.877.1>,<0.909.1>,<0.938.1>,<0.892.1>,<0.849.1>,<0.866.1>,
         <0.832.1>,<0.765.1>,<0.789.1>,<0.804.1>|...]},
 {dictionary,[{'$initial_call',{poolboy,init,1}},
              {'$ancestors',[pgpool_sup,<0.673.0>]}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.672.0>},
 {total_heap_size,393326},
 {heap_size,196650},
 {stack_size,33},
 {reductions,14837255},
 {garbage_collection,[{max_heap_size,#{error_logger => true,
                                       kill => true,
                                       size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,3}]},
 {suspending,[]}]

Does someone have  an insight of what may be going wrong? I see that the process status is waiting...

Thank you,
r.

Reply | Threaded
Open this post in threaded view
|

Re: exception exit: timeout in gen_server:call

Jesper Louis Andersen-2
The keys_db function is currently executing gen:do_call/4

It is blocked on some other process somewhere, waiting for it to respond, so it can respond back to the 11906 requests it has :) Chances are that the next request in the queue will do a gen:do_call/4 for the 11905 there are left in the queue. And in the mean time, 30 new requests arrived, so we are now at 11935.

In short, it is likely something is running too slow to handle the current load on the system and furthermore, there is no flow control in the system to make the callers reduce the load.

Since it happens randomly, it is likely there is a single large request, done by a single user now and then. And this request stalls the system. Probably enough to have the queue grow and errors start happening. Put a limit on what can be requested, and force the client to cooperate by having it call in with a "plz continue" token.

You can also ask the system for a more detailed stack trace, see erlang:process_info/2 and current_stacktrace when it goes bad. This can often tell you which gen_server call is being made and to whom, narrowing down the problem.

On Tue, Dec 3, 2019 at 7:14 PM Roberto Ostinelli <[hidden email]> wrote:
Forgot to mention: this happens in a completely random way.

On Tue, Dec 3, 2019 at 6:48 PM Roberto Ostinelli <[hidden email]> wrote:

All,
I ame experiencing the following error when calling a transaction in poolboy as per the README:

equery(PoolName, Stmt, Params) ->
    poolboy:transaction(PoolName, fun(Worker) ->
        gen_server:call(Worker, {equery, Stmt, Params})
    end).
** exception exit: {timeout,{gen_server,call,
                                        [keys_db,{checkout,#Ref<0.0.1.156295>,true},5000]}}
     in function  gen_server:call/3 (gen_server.erl, line 212)
     in call from poolboy:checkout/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 55)
     in call from poolboy:transaction/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 74)

The process queue keeps on increasing, and I can see the following:

3> erlang:process_info(whereis(keys_db)).  
[{registered_name,keys_db},
 {current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,11906},
 {messages,[{'$gen_cast',{cancel_waiting,#Ref<0.0.1.138090>}},
            {'$gen_call',{<0.15224.0>,#Ref<0.0.1.139621>},
                         {checkout,#Ref<0.0.1.139620>,true}},
            {'$gen_call',{<0.15139.0>,#Ref<0.0.1.139649>},
                         {checkout,#Ref<0.0.1.139648>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138159>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138175>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138232>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138252>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138261>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138286>}},
            {'$gen_call',{<0.15235.0>,#Ref<0.0.1.139774>},
                         {checkout,#Ref<0.0.1.139773>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.2.77777>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138318>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138336>}},
            {'$gen_call',{<0.15233.0>,#Ref<0.0.1.139816>},
                         {checkout,#Ref<0.0.1.139815>,true}},
            {'$gen_call',{<0.15245.0>,#Ref<0.0.1.139854>},
                         {checkout,#Ref<0.0.1.139853>,true}},
            {'$gen_call',{<0.15237.0>,#Ref<0.0.2.78173>},
                         {checkout,#Ref<0.0.2.78172>,...}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138407>}},
            {'$gen_call',{<0.15228.0>,...},{...}},
            {'$gen_call',{...},...},
            {'$gen_call',...},
            {...}|...]},
 {links,[<0.714.1>,<0.817.1>,<0.947.1>,<0.1015.1>,<0.1045.1>,
         <0.1048.1>,<0.1038.1>,<0.983.1>,<0.1002.1>,<0.962.1>,
         <0.877.1>,<0.909.1>,<0.938.1>,<0.892.1>,<0.849.1>,<0.866.1>,
         <0.832.1>,<0.765.1>,<0.789.1>,<0.804.1>|...]},
 {dictionary,[{'$initial_call',{poolboy,init,1}},
              {'$ancestors',[pgpool_sup,<0.673.0>]}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.672.0>},
 {total_heap_size,393326},
 {heap_size,196650},
 {stack_size,33},
 {reductions,14837255},
 {garbage_collection,[{max_heap_size,#{error_logger => true,
                                       kill => true,
                                       size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,3}]},
 {suspending,[]}]

Does someone have  an insight of what may be going wrong? I see that the process status is waiting...

Thank you,
r.



--
J.
Reply | Threaded
Open this post in threaded view
|

Re: exception exit: timeout in gen_server:call

Roberto Ostinelli-5
This function nothing is but a single postgres query, a “select * from users where id = ‘123’”, properly indexed.

The only thing I can see is a latency towards the db (infra-aws regions unfortunately). It really is that at a certain moment, randomly (sometimes after 5 minutes, other times after 2 days) this happens and there’s no recovery whatsoever. 

I could start rejecting http calls if the poolboy can’t perform a checkout, not sure how to do that (need to check the library better). 

On 3 Dec 2019, at 19:26, Jesper Louis Andersen <[hidden email]> wrote:


The keys_db function is currently executing gen:do_call/4

It is blocked on some other process somewhere, waiting for it to respond, so it can respond back to the 11906 requests it has :) Chances are that the next request in the queue will do a gen:do_call/4 for the 11905 there are left in the queue. And in the mean time, 30 new requests arrived, so we are now at 11935.

In short, it is likely something is running too slow to handle the current load on the system and furthermore, there is no flow control in the system to make the callers reduce the load.

Since it happens randomly, it is likely there is a single large request, done by a single user now and then. And this request stalls the system. Probably enough to have the queue grow and errors start happening. Put a limit on what can be requested, and force the client to cooperate by having it call in with a "plz continue" token.

You can also ask the system for a more detailed stack trace, see erlang:process_info/2 and current_stacktrace when it goes bad. This can often tell you which gen_server call is being made and to whom, narrowing down the problem.

On Tue, Dec 3, 2019 at 7:14 PM Roberto Ostinelli <[hidden email]> wrote:
Forgot to mention: this happens in a completely random way.

On Tue, Dec 3, 2019 at 6:48 PM Roberto Ostinelli <[hidden email]> wrote:

All,
I ame experiencing the following error when calling a transaction in poolboy as per the README:

equery(PoolName, Stmt, Params) ->
    poolboy:transaction(PoolName, fun(Worker) ->
        gen_server:call(Worker, {equery, Stmt, Params})
    end).
** exception exit: {timeout,{gen_server,call,
                                        [keys_db,{checkout,#Ref<0.0.1.156295>,true},5000]}}
     in function  gen_server:call/3 (gen_server.erl, line 212)
     in call from poolboy:checkout/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 55)
     in call from poolboy:transaction/3 (/home/ubuntu/workspace/myapp/_build/default/lib/poolboy/src/poolboy.erl, line 74)

The process queue keeps on increasing, and I can see the following:

3> erlang:process_info(whereis(keys_db)).  
[{registered_name,keys_db},
 {current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,11906},
 {messages,[{'$gen_cast',{cancel_waiting,#Ref<0.0.1.138090>}},
            {'$gen_call',{<0.15224.0>,#Ref<0.0.1.139621>},
                         {checkout,#Ref<0.0.1.139620>,true}},
            {'$gen_call',{<0.15139.0>,#Ref<0.0.1.139649>},
                         {checkout,#Ref<0.0.1.139648>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138159>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138175>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138232>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138252>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138261>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138286>}},
            {'$gen_call',{<0.15235.0>,#Ref<0.0.1.139774>},
                         {checkout,#Ref<0.0.1.139773>,true}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.2.77777>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138318>}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138336>}},
            {'$gen_call',{<0.15233.0>,#Ref<0.0.1.139816>},
                         {checkout,#Ref<0.0.1.139815>,true}},
            {'$gen_call',{<0.15245.0>,#Ref<0.0.1.139854>},
                         {checkout,#Ref<0.0.1.139853>,true}},
            {'$gen_call',{<0.15237.0>,#Ref<0.0.2.78173>},
                         {checkout,#Ref<0.0.2.78172>,...}},
            {'$gen_cast',{cancel_waiting,#Ref<0.0.1.138407>}},
            {'$gen_call',{<0.15228.0>,...},{...}},
            {'$gen_call',{...},...},
            {'$gen_call',...},
            {...}|...]},
 {links,[<0.714.1>,<0.817.1>,<0.947.1>,<0.1015.1>,<0.1045.1>,
         <0.1048.1>,<0.1038.1>,<0.983.1>,<0.1002.1>,<0.962.1>,
         <0.877.1>,<0.909.1>,<0.938.1>,<0.892.1>,<0.849.1>,<0.866.1>,
         <0.832.1>,<0.765.1>,<0.789.1>,<0.804.1>|...]},
 {dictionary,[{'$initial_call',{poolboy,init,1}},
              {'$ancestors',[pgpool_sup,<0.673.0>]}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.672.0>},
 {total_heap_size,393326},
 {heap_size,196650},
 {stack_size,33},
 {reductions,14837255},
 {garbage_collection,[{max_heap_size,#{error_logger => true,
                                       kill => true,
                                       size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,3}]},
 {suspending,[]}]

Does someone have  an insight of what may be going wrong? I see that the process status is waiting...

Thank you,
r.



--
J.
Reply | Threaded
Open this post in threaded view
|

Re: exception exit: timeout in gen_server:call

Jesper Louis Andersen-2
On Tue, Dec 3, 2019 at 7:37 PM Roberto Ostinelli <[hidden email]> wrote:
This function nothing is but a single postgres query, a “select * from users where id = ‘123’”, properly indexed.


If it transfers 2 gigabyte of data, then this single postgres query is going to take some time.

If someone is doing updates which require a full table lock on the users table, this query is going to take some time.
The only thing I can see is a latency towards the db (infra-aws regions unfortunately). It really is that at a certain moment, randomly (sometimes after 5 minutes, other times after 2 days) this happens and there’s no recovery whatsoever. 


Other tricks:

* If your initial intuitive drill down into the system bears no fruit, start caring about facts.
* Measure the maximal latency over the equery call you've made for a 10-15 second period. Plot it.
* We are interested in microstutters in the pacing. If they are present, it is likely there is some problem which then suddenly tips the system over. If not, then it is more likely that it is something we don't know.
* The database might be fast, but there is still latency to the first byte, and there is the transfer time to the last byte. If a query is 50ms, say, then you are only going to run 20 of those per connection.
* Pipeline the queries. A query which waits for an answer affects every sibling query as well.

Down the line:

* Postgres can log slow queries. Turn that on.
* Postgres can log whenever it holds a lock for more than a certain time window. Turn that on.

Narrow down where the problem can occur by having systems provide facts to you. Don't go for "what is wrong?" Go for "What would I like to know?". This helps observability (In the Control Theory / Charity Majors sense).

Reply | Threaded
Open this post in threaded view
|

Re: exception exit: timeout in gen_server:call

Roberto Ostinelli-5
If it transfers 2 gigabyte of data, then this single postgres query is going to take some time.

Of course, but this is not the case. Data is a very small packet.

If someone is doing updates which require a full table lock on the users table, this query is going to take some time.

No locks, read only: “select * from users where id = ‘123’”. Writes are on user registration only, so irrelevant.

Other tricks:

* If your initial intuitive drill down into the system bears no fruit, start caring about facts.
* Measure the maximal latency over the equery call you've made for a 10-15 second period. Plot it.
* We are interested in microstutters in the pacing. If they are present, it is likely there is some problem which then suddenly tips the system over. If not, then it is more likely that it is something we don't know.
* The database might be fast, but there is still latency to the first byte, and there is the transfer time to the last byte. If a query is 50ms, say, then you are only going to run 20 of those per connection.
* Pipeline the queries. A query which waits for an answer affects every sibling query as well.

Down the line:

* Postgres can log slow queries. Turn that on.
* Postgres can log whenever it holds a lock for more than a certain time window. Turn that on.

Narrow down where the problem can occur by having systems provide facts to you. Don't go for "what is wrong?" Go for "What would I like to know?". This helps observability (In the Control Theory / Charity Majors sense).

I did most of those. There are no slow queries, the database is literally sleeping. The issue is with the latency and the variance on the latency responses.

You nailed it at first: it's a matter of flow control. Simply put, my HTTP server is faster than the queries to the database (due to the extra-latency), even though prepared statements are used. This never occurs on other installations where databases are local, so I simply underestimated this aspect.

As per my previous statement, I'm going to add a flow restriction by which if there are no available database workers in the pool, I'll reject the HTTP call (I guess with a 503 or similar).

Thank you for your help,
r.
Reply | Threaded
Open this post in threaded view
|

Re: exception exit: timeout in gen_server:call

Roger Lipscombe-2
In reply to this post by Jesper Louis Andersen-2
On Tue, 3 Dec 2019 at 19:32, Jesper Louis Andersen
<[hidden email]> wrote:
> * Measure the maximal latency over the equery call you've made for a 10-15 second period. Plot it.

We wrap our database calls (and pool checkouts) in
folsom_metrics:histogram_timed_update, which reports percentiles to
graphite. Since folsom is (afaict) abandoned, you could see if (e.g.)
exometer has something similar. We also report connections-in-use,
etc., which is helpful.

> * Pipeline the queries. A query which waits for an answer affects every sibling query as well.

Other pooling models are available. Pull, rather than push, might help
-- because it could pull N pending requests and pipeline them. Can't
find the relevant links right now, though.


>
> Down the line:
>
> * Postgres can log slow queries. Turn that on.
> * Postgres can log whenever it holds a lock for more than a certain time window. Turn that on.
>
> Narrow down where the problem can occur by having systems provide facts to you. Don't go for "what is wrong?" Go for "What would I like to know?". This helps observability (In the Control Theory / Charity Majors sense).
>
Reply | Threaded
Open this post in threaded view
|

Re: exception exit: timeout in gen_server:call

Tristan Sloughter-4
For metrics and tracing the pgo (https://github.com/erleans/pgo) library is a Postgres lib with built in pool that is currently instrumented with telemetry (https://github.com/beam-telemetry/telemetry) for metrics and can optionally create spans with opencensus (https://github.com/census-instrumentation/opencensus-erlang).

Hm, looks like I haven't yet added the `queue_time`to the metrics, I'll try to remember to do that soon when I replace OpenCensus with OpenTelemetry (https://github.com/open-telemetry/opentelemetry-erlang) https://opentelemetry.io/. At the same time I'll be expanding the available metrics to include information about the pool.

So only the query time is sent in the telemetry event at this time.

Tristan

On Wed, Dec 4, 2019, at 01:33, Roger Lipscombe wrote:

> On Tue, 3 Dec 2019 at 19:32, Jesper Louis Andersen
> <[hidden email]> wrote:
> > * Measure the maximal latency over the equery call you've made for a 10-15 second period. Plot it.
>
> We wrap our database calls (and pool checkouts) in
> folsom_metrics:histogram_timed_update, which reports percentiles to
> graphite. Since folsom is (afaict) abandoned, you could see if (e.g.)
> exometer has something similar. We also report connections-in-use,
> etc., which is helpful.
>
> > * Pipeline the queries. A query which waits for an answer affects every sibling query as well.
>
> Other pooling models are available. Pull, rather than push, might help
> -- because it could pull N pending requests and pipeline them. Can't
> find the relevant links right now, though.
>
>
> >
> > Down the line:
> >
> > * Postgres can log slow queries. Turn that on.
> > * Postgres can log whenever it holds a lock for more than a certain time window. Turn that on.
> >
> > Narrow down where the problem can occur by having systems provide facts to you. Don't go for "what is wrong?" Go for "What would I like to know?". This helps observability (In the Control Theory / Charity Majors sense).
> >
>