Erlang VM hanging on node death

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Erlang VM hanging on node death

Steve Cohen
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Erlang VM hanging on node death

Juan Jose Comellas-3
How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If this happens, the node that isn't able to respond before the net_ticktime expires will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Erlang VM hanging on node death

Steve Cohen
Juan, 
Here's the sequence of events:
1. One of our machines was inadvertently shut off, killing all of the processes on it
2. We immediately saw a drop in CPU across the board on the sessions cluster. CPU on the sessions cluster eventually went to zero.
3. We were completely unable to use remote console on any of the machines in the cluster, and they all needed to be restarted.

So, to answer your question, we don't know how long it took for down messages to be processed, since we didn't have visibility at the time.  We suspected a problem with the net_ticktime, but what's confusing to us is that the host that went down went down hard, so the DOWN events should have been created on the other nodes, not sent across distribution (correct me if I'm wrong here). Also, my intuition is that processing DOWN messages would cause CPU usage on the cluster to go up, but we saw the exact opposite.  

Since we couldn't connect to the machines via remote console, we couldn't call connect_node. It was my understanding that the connect call would happen when the node in question reestablished itself. 


On Tue, Jul 11, 2017 at 8:34 PM, Juan Jose Comellas <[hidden email]> wrote:
How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If this happens, the node that isn't able to respond before the net_ticktime expires will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Erlang VM hanging on node death

Lukas Larsson-8
Hello Steve,

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 


On Thu, Jul 13, 2017 at 1:10 AM, Steve Cohen <[hidden email]> wrote: 
Here's the sequence of events:
1. One of our machines was inadvertently shut off, killing all of the processes on it
2. We immediately saw a drop in CPU across the board on the sessions cluster. CPU on the sessions cluster eventually went to zero.
3. We were completely unable to use remote console on any of the machines in the cluster, and they all needed to be restarted.

The two scenarios you are describing seem to contradict each other? First you talk about the sessions servers being bogged down, and then that the CPU of the sessions cluster went to almost zero? What is it that I'm missing?

Did you gather any port mortem dumps from these machines? i.e. a erl_crash.dump or a core dump?

Also you have forgotten to mention what version of Erlang/OTP that you are using.
 
So, to answer your question, we don't know how long it took for down messages to be processed, since we didn't have visibility at the time.  We suspected a problem with the net_ticktime, but what's confusing to us is that the host that went down went down hard, so the DOWN events should have been created on the other nodes, not sent across distribution (correct me if I'm wrong here).

When a TCP connection used for the erlang distribution is terminated, all the down messages are (as you say) generated locally. 
 
Also, my intuition is that processing DOWN messages would cause CPU usage on the cluster to go up, but we saw the exact opposite.  


With the poweroff of the machine, are you sure that the TCP layer caught the shutdown? If it didn't, then the next fail-safe is the net_ticktime.
 
Since we couldn't connect to the machines via remote console, we couldn't call connect_node. It was my understanding that the connect call would happen when the node in question reestablished itself. 

Yes, it should re-connect when needed. It is quite strange that you couldn't connect via remote shell. A crash dump or core dump would really help to understand what is going on.
 


On Tue, Jul 11, 2017 at 8:34 PM, Juan Jose Comellas <[hidden email]> wrote:
How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If this happens, the node that isn't able to respond before the net_ticktime expires will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Erlang VM hanging on node death

Steve Cohen
Lukas,
The second situation is more representative of what happened; CPU quickly trended towards zero, and the VMs were unresponsive. The situation was stable, and didn't generate an erl_crash.dump or a core dump. Next time this happens, we'll try to trigger one.  

Since we couldn't get into the VMs, all we have to go on is telemetry, which isn't as accurate as being in the remote console. If it helps, I'd be glad to share our telemetry data. The entire cluster immediately experienced a drop in CPU. It was quite strange.

Agreed about the remote shell, I guess without a dump, we're stuck.


On Thu, Jul 13, 2017 at 12:12 AM, Lukas Larsson <[hidden email]> wrote:
Hello Steve,

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 


On Thu, Jul 13, 2017 at 1:10 AM, Steve Cohen <[hidden email]> wrote: 
Here's the sequence of events:
1. One of our machines was inadvertently shut off, killing all of the processes on it
2. We immediately saw a drop in CPU across the board on the sessions cluster. CPU on the sessions cluster eventually went to zero.
3. We were completely unable to use remote console on any of the machines in the cluster, and they all needed to be restarted.

The two scenarios you are describing seem to contradict each other? First you talk about the sessions servers being bogged down, and then that the CPU of the sessions cluster went to almost zero? What is it that I'm missing?

Did you gather any port mortem dumps from these machines? i.e. a erl_crash.dump or a core dump?

Also you have forgotten to mention what version of Erlang/OTP that you are using.
 
So, to answer your question, we don't know how long it took for down messages to be processed, since we didn't have visibility at the time.  We suspected a problem with the net_ticktime, but what's confusing to us is that the host that went down went down hard, so the DOWN events should have been created on the other nodes, not sent across distribution (correct me if I'm wrong here).

When a TCP connection used for the erlang distribution is terminated, all the down messages are (as you say) generated locally. 
 
Also, my intuition is that processing DOWN messages would cause CPU usage on the cluster to go up, but we saw the exact opposite.  


With the poweroff of the machine, are you sure that the TCP layer caught the shutdown? If it didn't, then the next fail-safe is the net_ticktime.
 
Since we couldn't connect to the machines via remote console, we couldn't call connect_node. It was my understanding that the connect call would happen when the node in question reestablished itself. 

Yes, it should re-connect when needed. It is quite strange that you couldn't connect via remote shell. A crash dump or core dump would really help to understand what is going on.
 


On Tue, Jul 11, 2017 at 8:34 PM, Juan Jose Comellas <[hidden email]> wrote:
How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If this happens, the node that isn't able to respond before the net_ticktime expires will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Erlang VM hanging on node death

Juan Jose Comellas-3
Steve, is it possible that the processes were trying to contact other processes on the node that went down and were blocking in responses from them? If so, do you have timeouts in the calls you make to those processes. Also, are you referencing the remote processes in the node that went down by pid or by global name?

As Lukas said, it's difficult to know what happened without more information, but the answers to the questions above might shed some light on the cause(s) of the problem.

On Thu, Jul 13, 2017 at 2:26 PM, Steve Cohen <[hidden email]> wrote:
Lukas,
The second situation is more representative of what happened; CPU quickly trended towards zero, and the VMs were unresponsive. The situation was stable, and didn't generate an erl_crash.dump or a core dump. Next time this happens, we'll try to trigger one.  

Since we couldn't get into the VMs, all we have to go on is telemetry, which isn't as accurate as being in the remote console. If it helps, I'd be glad to share our telemetry data. The entire cluster immediately experienced a drop in CPU. It was quite strange.

Agreed about the remote shell, I guess without a dump, we're stuck.


On Thu, Jul 13, 2017 at 12:12 AM, Lukas Larsson <[hidden email]> wrote:
Hello Steve,

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 


On Thu, Jul 13, 2017 at 1:10 AM, Steve Cohen <[hidden email]> wrote: 
Here's the sequence of events:
1. One of our machines was inadvertently shut off, killing all of the processes on it
2. We immediately saw a drop in CPU across the board on the sessions cluster. CPU on the sessions cluster eventually went to zero.
3. We were completely unable to use remote console on any of the machines in the cluster, and they all needed to be restarted.

The two scenarios you are describing seem to contradict each other? First you talk about the sessions servers being bogged down, and then that the CPU of the sessions cluster went to almost zero? What is it that I'm missing?

Did you gather any port mortem dumps from these machines? i.e. a erl_crash.dump or a core dump?

Also you have forgotten to mention what version of Erlang/OTP that you are using.
 
So, to answer your question, we don't know how long it took for down messages to be processed, since we didn't have visibility at the time.  We suspected a problem with the net_ticktime, but what's confusing to us is that the host that went down went down hard, so the DOWN events should have been created on the other nodes, not sent across distribution (correct me if I'm wrong here).

When a TCP connection used for the erlang distribution is terminated, all the down messages are (as you say) generated locally. 
 
Also, my intuition is that processing DOWN messages would cause CPU usage on the cluster to go up, but we saw the exact opposite.  


With the poweroff of the machine, are you sure that the TCP layer caught the shutdown? If it didn't, then the next fail-safe is the net_ticktime.
 
Since we couldn't connect to the machines via remote console, we couldn't call connect_node. It was my understanding that the connect call would happen when the node in question reestablished itself. 

Yes, it should re-connect when needed. It is quite strange that you couldn't connect via remote shell. A crash dump or core dump would really help to understand what is going on.
 


On Tue, Jul 11, 2017 at 8:34 PM, Juan Jose Comellas <[hidden email]> wrote:
How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If this happens, the node that isn't able to respond before the net_ticktime expires will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Erlang VM hanging on node death

Jesper Louis Andersen-2
Very little to go by.

My recommendation would be to analyze bottom up through the layers if possible. Telemetry of IO is useful. The Erlang VM tracks this. Telemetry of port and process count as well. A sharp drop in CPU load would suggest processes are blocked on something and waiting for stuff to happen. By working from the network and up, you can often gain valuable information which can be used to rule out hypothesis underway.

What does your logger say about the situation? Anything odd in those logs? Do you run with system_monitor enabled and does it mention something? Crash logs? Do you run with exometer_core or something such in the cluster (you should! establish a baseline of typical operation so you know when things look weird. You really want measurement on the critical paths, at least sampling. Otherwise you have no chance at scaling the load over time). What does the kernel say about TCP send queues on the distribution channels? Are the TCP windows there closed or open?

The usual way is to form some kind of hypothesis. Then devise a method to either confirm or reject the hypothesis. Then form a new one and so on. Write down everything you find in a document which is shared among investigators (Git or Google Docs, etc). Track knowledge. One thing which is very important to look out for are assumptions. If you have an assumption, you should figure out a way to determine if it is true or not. You are most often lead astray by an incorrect assumption somewhere in the chain of events. And this has you hunting for a problem in a corner of the system where no problems occur.

Other haphazard list of stuff:

- any nasty NIFs?
- dTrace on the boxes? (Godsend when things go wrong)
- Consider establishing an Erlang shell early if it is a resource problem. Then one is handy when things start going wrong.
- Can you provoke the error in a smaller test cluster? Possibly be artificially resource constraining its network devices as well?
- If you can't create a shell, something rather central could be hosed. Try figuring out if you ran out of resources etc.
- A slow disk coupled with synchronous calls can easily block a machine
- Excessive debug logging too, but that would max out the CPU load
- The problem might not even be in Erlang. Everything from faulty hardware, faulty kernel, bad cloud provider, to Elixir or LFE might be the culprit. Narrowing down the list of likely candidates is important.




On Thu, Jul 13, 2017 at 7:58 PM Juan Jose Comellas <[hidden email]> wrote:
Steve, is it possible that the processes were trying to contact other processes on the node that went down and were blocking in responses from them? If so, do you have timeouts in the calls you make to those processes. Also, are you referencing the remote processes in the node that went down by pid or by global name?

As Lukas said, it's difficult to know what happened without more information, but the answers to the questions above might shed some light on the cause(s) of the problem.

On Thu, Jul 13, 2017 at 2:26 PM, Steve Cohen <[hidden email]> wrote:
Lukas,
The second situation is more representative of what happened; CPU quickly trended towards zero, and the VMs were unresponsive. The situation was stable, and didn't generate an erl_crash.dump or a core dump. Next time this happens, we'll try to trigger one.  

Since we couldn't get into the VMs, all we have to go on is telemetry, which isn't as accurate as being in the remote console. If it helps, I'd be glad to share our telemetry data. The entire cluster immediately experienced a drop in CPU. It was quite strange.

Agreed about the remote shell, I guess without a dump, we're stuck.


On Thu, Jul 13, 2017 at 12:12 AM, Lukas Larsson <[hidden email]> wrote:
Hello Steve,

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 


On Thu, Jul 13, 2017 at 1:10 AM, Steve Cohen <[hidden email]> wrote: 
Here's the sequence of events:
1. One of our machines was inadvertently shut off, killing all of the processes on it
2. We immediately saw a drop in CPU across the board on the sessions cluster. CPU on the sessions cluster eventually went to zero.
3. We were completely unable to use remote console on any of the machines in the cluster, and they all needed to be restarted.

The two scenarios you are describing seem to contradict each other? First you talk about the sessions servers being bogged down, and then that the CPU of the sessions cluster went to almost zero? What is it that I'm missing?

Did you gather any port mortem dumps from these machines? i.e. a erl_crash.dump or a core dump?

Also you have forgotten to mention what version of Erlang/OTP that you are using.
 
So, to answer your question, we don't know how long it took for down messages to be processed, since we didn't have visibility at the time.  We suspected a problem with the net_ticktime, but what's confusing to us is that the host that went down went down hard, so the DOWN events should have been created on the other nodes, not sent across distribution (correct me if I'm wrong here).

When a TCP connection used for the erlang distribution is terminated, all the down messages are (as you say) generated locally. 
 
Also, my intuition is that processing DOWN messages would cause CPU usage on the cluster to go up, but we saw the exact opposite.  


With the poweroff of the machine, are you sure that the TCP layer caught the shutdown? If it didn't, then the next fail-safe is the net_ticktime.
 
Since we couldn't connect to the machines via remote console, we couldn't call connect_node. It was my understanding that the connect call would happen when the node in question reestablished itself. 

Yes, it should re-connect when needed. It is quite strange that you couldn't connect via remote shell. A crash dump or core dump would really help to understand what is going on.
 


On Tue, Jul 11, 2017 at 8:34 PM, Juan Jose Comellas <[hidden email]> wrote:
How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If this happens, the node that isn't able to respond before the net_ticktime expires will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Erlang VM hanging on node death

Dániel Szoboszlay
Hi,

It's just a guess, but maybe the rex processes (the servers accepting rpc calls) get blocked for a long time or get into some deadlock situation following the node crash. This would explain why can't you open a new remote shell: that request too goes via rpc.

Try using spawn/4 to start a process on one of the inaccessible nodes or do raw gen_server:call-s using {Name, Node}, because these requests don't have to go through the rex server. Maybe you can debug the problem with these tools. Or, if these techniques would not work, you can be sure that the problem is somewhere deep within erts...

Cheers,
Daniel

On Thu, 13 Jul 2017 at 21:43 Jesper Louis Andersen <[hidden email]> wrote:
Very little to go by.

My recommendation would be to analyze bottom up through the layers if possible. Telemetry of IO is useful. The Erlang VM tracks this. Telemetry of port and process count as well. A sharp drop in CPU load would suggest processes are blocked on something and waiting for stuff to happen. By working from the network and up, you can often gain valuable information which can be used to rule out hypothesis underway.

What does your logger say about the situation? Anything odd in those logs? Do you run with system_monitor enabled and does it mention something? Crash logs? Do you run with exometer_core or something such in the cluster (you should! establish a baseline of typical operation so you know when things look weird. You really want measurement on the critical paths, at least sampling. Otherwise you have no chance at scaling the load over time). What does the kernel say about TCP send queues on the distribution channels? Are the TCP windows there closed or open?

The usual way is to form some kind of hypothesis. Then devise a method to either confirm or reject the hypothesis. Then form a new one and so on. Write down everything you find in a document which is shared among investigators (Git or Google Docs, etc). Track knowledge. One thing which is very important to look out for are assumptions. If you have an assumption, you should figure out a way to determine if it is true or not. You are most often lead astray by an incorrect assumption somewhere in the chain of events. And this has you hunting for a problem in a corner of the system where no problems occur.

Other haphazard list of stuff:

- any nasty NIFs?
- dTrace on the boxes? (Godsend when things go wrong)
- Consider establishing an Erlang shell early if it is a resource problem. Then one is handy when things start going wrong.
- Can you provoke the error in a smaller test cluster? Possibly be artificially resource constraining its network devices as well?
- If you can't create a shell, something rather central could be hosed. Try figuring out if you ran out of resources etc.
- A slow disk coupled with synchronous calls can easily block a machine
- Excessive debug logging too, but that would max out the CPU load
- The problem might not even be in Erlang. Everything from faulty hardware, faulty kernel, bad cloud provider, to Elixir or LFE might be the culprit. Narrowing down the list of likely candidates is important.




On Thu, Jul 13, 2017 at 7:58 PM Juan Jose Comellas <[hidden email]> wrote:
Steve, is it possible that the processes were trying to contact other processes on the node that went down and were blocking in responses from them? If so, do you have timeouts in the calls you make to those processes. Also, are you referencing the remote processes in the node that went down by pid or by global name?

As Lukas said, it's difficult to know what happened without more information, but the answers to the questions above might shed some light on the cause(s) of the problem.

On Thu, Jul 13, 2017 at 2:26 PM, Steve Cohen <[hidden email]> wrote:
Lukas,
The second situation is more representative of what happened; CPU quickly trended towards zero, and the VMs were unresponsive. The situation was stable, and didn't generate an erl_crash.dump or a core dump. Next time this happens, we'll try to trigger one.  

Since we couldn't get into the VMs, all we have to go on is telemetry, which isn't as accurate as being in the remote console. If it helps, I'd be glad to share our telemetry data. The entire cluster immediately experienced a drop in CPU. It was quite strange.

Agreed about the remote shell, I guess without a dump, we're stuck.


On Thu, Jul 13, 2017 at 12:12 AM, Lukas Larsson <[hidden email]> wrote:
Hello Steve,

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 


On Thu, Jul 13, 2017 at 1:10 AM, Steve Cohen <[hidden email]> wrote: 
Here's the sequence of events:
1. One of our machines was inadvertently shut off, killing all of the processes on it
2. We immediately saw a drop in CPU across the board on the sessions cluster. CPU on the sessions cluster eventually went to zero.
3. We were completely unable to use remote console on any of the machines in the cluster, and they all needed to be restarted.

The two scenarios you are describing seem to contradict each other? First you talk about the sessions servers being bogged down, and then that the CPU of the sessions cluster went to almost zero? What is it that I'm missing?

Did you gather any port mortem dumps from these machines? i.e. a erl_crash.dump or a core dump?

Also you have forgotten to mention what version of Erlang/OTP that you are using.
 
So, to answer your question, we don't know how long it took for down messages to be processed, since we didn't have visibility at the time.  We suspected a problem with the net_ticktime, but what's confusing to us is that the host that went down went down hard, so the DOWN events should have been created on the other nodes, not sent across distribution (correct me if I'm wrong here).

When a TCP connection used for the erlang distribution is terminated, all the down messages are (as you say) generated locally. 
 
Also, my intuition is that processing DOWN messages would cause CPU usage on the cluster to go up, but we saw the exact opposite.  


With the poweroff of the machine, are you sure that the TCP layer caught the shutdown? If it didn't, then the next fail-safe is the net_ticktime.
 
Since we couldn't connect to the machines via remote console, we couldn't call connect_node. It was my understanding that the connect call would happen when the node in question reestablished itself. 

Yes, it should re-connect when needed. It is quite strange that you couldn't connect via remote shell. A crash dump or core dump would really help to understand what is going on.
 


On Tue, Jul 11, 2017 at 8:34 PM, Juan Jose Comellas <[hidden email]> wrote:
How long does it take for all the DOWN messages to be sent/processed?

These messages might not be allowing the net tick messages (see net_ticktime in http://erlang.org/doc/man/kernel_app.html) to be responded in time. If this happens, the node that isn't able to respond before the net_ticktime expires will be assumed to be disconnected.

What happens if after processing all the DOWN messages you issue a call to net_kernel:connect_node/1 for each of the nodes that seems to be down?

On Mon, Jul 10, 2017 at 4:14 PM, Steve Cohen <[hidden email]> wrote:
Hi all,

We have 12 nodes in a our guilds cluster, and on each, 500,000 processes.  We have another cluster that has 15 nodes with roughly four million processes on it, called sessions. Both clusters are in the same erlang distribution since our guilds monitor sessions and vice-versa.

Now, when one of our guild servers dies, as expected it generates a large number of DOWN messages to the sessions cluster. These messages bog down the sessions servers (obviously) while they process them, but when they're done processing, distribution appears to be completely broken. 
 
By broken, I mean that the nodes are disconnected from one another, they're not exchanging messages, CPU usage was 0 and we couldn't even launch the remote console. 

I can't imagine this is expected behavior, and was wondering if someone can shed some light on it.
We're open to the idea that we're doing something very, very wrong.


Thanks in advance for the help

--
Steve Cohen

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
-Steve

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Loading...