** exception exit: noconnection

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

** exception exit: noconnection

Adam Lindberg-7
Hi!

I’m running some tests using distributed Erlang. I set up a cluster of Erlang nodes doing Distributed Systems™ stuff, and a hidden node that have a connection to each of the nodes in that cluster. The hidden node orchestrates the test by starting all Erlang nodes as ports. It then starts a process (gen_server) on each node that manipulates stuff on that node. It also loads some mock modules among other things. The hidden node also has some managing gen_servers running locally, which some of the mocks makes RPC calls to from the cluster nodes (to simulate and orchestrate mocked hardware components).

Now I wanted to test how my system behaves when killing some random nodes, chaos monkey style. So I picked the easiest option of using rpc:cast(RandomClusterNode, erlang, halt, [137]). However, now my test dies with the following obscure error: ** exception exit: noconnection. This even happens when first spawning a fun that then calls erlang:halt(137) (as to avoid the RPC connection somehow breaking).

After searching a bit on the Internet it seems to be some internal uncatchable (!) error generated by Erlang [1][2], but it is not at all clear when it happens, and how to avoid it. After some debugging in the gen_servers running on the hidden node, I can see the error by setting process_flag(trap_exit, true) and printing it in terminate/2 but I still can’t catch it. I can’t even catch it in the shell by enclosing my run in a try-catch block! It’s almost not mentioned at all in the official documentation [3]. Most likely I’m setting up my test nodes and the application/test code in a way that generates this error, but I have no idea what exactly leads to it.

I guess I have two problems:

1. What is the error, and how can I handle / avoid it?
2. Why is it not documented?

Cheers,
Adam


[1]: http://erlang.org/pipermail/erlang-questions/2012-April/066219.html
[2]: http://erlang.org/pipermail/erlang-questions/2013-April/073246.html
[3]: http://erlang.org/doc/getting_started/robustness.html

Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Lukas Larsson-8


On Tue, Dec 3, 2019 at 11:56 AM Adam Lindberg <[hidden email]> wrote:
Hi!

I’m running some tests using distributed Erlang. I set up a cluster of Erlang nodes doing Distributed Systems™ stuff, and a hidden node that have a connection to each of the nodes in that cluster. The hidden node orchestrates the test by starting all Erlang nodes as ports. It then starts a process (gen_server) on each node that manipulates stuff on that node. It also loads some mock modules among other things. The hidden node also has some managing gen_servers running locally, which some of the mocks makes RPC calls to from the cluster nodes (to simulate and orchestrate mocked hardware components).

Now I wanted to test how my system behaves when killing some random nodes, chaos monkey style. So I picked the easiest option of using rpc:cast(RandomClusterNode, erlang, halt, [137]). However, now my test dies with the following obscure error: ** exception exit: noconnection. This even happens when first spawning a fun that then calls erlang:halt(137) (as to avoid the RPC connection somehow breaking).

After searching a bit on the Internet it seems to be some internal uncatchable (!) error generated by Erlang [1][2], but it is not at all clear when it happens, and how to avoid it. After some debugging in the gen_servers running on the hidden node, I can see the error by setting process_flag(trap_exit, true) and printing it in terminate/2 but I still can’t catch it. I can’t even catch it in the shell by enclosing my run in a try-catch block! It’s almost not mentioned at all in the official documentation [3]. Most likely I’m setting up my test nodes and the application/test code in a way that generates this error, but I have no idea what exactly leads to it.

I guess I have two problems:

1. What is the error, and how can I handle / avoid it?

I'm not sure, but could it be that your process is linked to a process on the remote side? That what you are getting is a broken link error?
 
2. Why is it not documented?

Cheers,
Adam


[1]: http://erlang.org/pipermail/erlang-questions/2012-April/066219.html
[2]: http://erlang.org/pipermail/erlang-questions/2013-April/073246.html
[3]: http://erlang.org/doc/getting_started/robustness.html

Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Adam Lindberg-7
I have indeed linked processes. I realized that that is why the exception is “uncatchable” in the shell perhaps. Because the shell process dies because it is linked to my test processes, and the function running the test hasn’t encountered an error yet.

Does links in Erlang always crash with {'EXIT', Pid, noconnection} when a node dies?

Cheers,
Adam

> On 4. Dec 2019, at 08:38, Lukas Larsson <[hidden email]> wrote:
>
>
>
> On Tue, Dec 3, 2019 at 11:56 AM Adam Lindberg <[hidden email]> wrote:
> Hi!
>
> I’m running some tests using distributed Erlang. I set up a cluster of Erlang nodes doing Distributed Systems™ stuff, and a hidden node that have a connection to each of the nodes in that cluster. The hidden node orchestrates the test by starting all Erlang nodes as ports. It then starts a process (gen_server) on each node that manipulates stuff on that node. It also loads some mock modules among other things. The hidden node also has some managing gen_servers running locally, which some of the mocks makes RPC calls to from the cluster nodes (to simulate and orchestrate mocked hardware components).
>
> Now I wanted to test how my system behaves when killing some random nodes, chaos monkey style. So I picked the easiest option of using rpc:cast(RandomClusterNode, erlang, halt, [137]). However, now my test dies with the following obscure error: ** exception exit: noconnection. This even happens when first spawning a fun that then calls erlang:halt(137) (as to avoid the RPC connection somehow breaking).
>
> After searching a bit on the Internet it seems to be some internal uncatchable (!) error generated by Erlang [1][2], but it is not at all clear when it happens, and how to avoid it. After some debugging in the gen_servers running on the hidden node, I can see the error by setting process_flag(trap_exit, true) and printing it in terminate/2 but I still can’t catch it. I can’t even catch it in the shell by enclosing my run in a try-catch block! It’s almost not mentioned at all in the official documentation [3]. Most likely I’m setting up my test nodes and the application/test code in a way that generates this error, but I have no idea what exactly leads to it.
>
> I guess I have two problems:
>
> 1. What is the error, and how can I handle / avoid it?
>
> I'm not sure, but could it be that your process is linked to a process on the remote side? That what you are getting is a broken link error?
>  
> 2. Why is it not documented?
>
> Cheers,
> Adam
>
>
> [1]: http://erlang.org/pipermail/erlang-questions/2012-April/066219.html
> [2]: http://erlang.org/pipermail/erlang-questions/2013-April/073246.html
> [3]: http://erlang.org/doc/getting_started/robustness.html
>

Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Lukas Larsson-8
On Wed, Dec 4, 2019 at 9:54 AM Adam Lindberg <[hidden email]> wrote:
Does links in Erlang always crash with {'EXIT', Pid, noconnection} when a node dies?

Yes, it should be. It is also the reason given in monitor messages.
Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Adam Lindberg-7
In reply to this post by Adam Lindberg-7
Ah, thanks! That’s good to know. :-)

Maybe I missed it but I can’t find this documented anywhere (in e.g. erlang:link/1 or erlang:monitor/2). The only place I can find it referenced is in an example in the Getting Started User’s Guide: http://erlang.org/doc/getting_started/robustness.html

Perhaps it should be documented more prominently?

Next question is I need to clarify is: can gen_server processes never receive exit messages as normal info messages? If I enable trap_exit I only receive a call to terminate with the noconnection error eventually...

Cheers,
Adam

> On 4. Dec 2019, at 09:54, Adam Lindberg <[hidden email]> wrote:
>
> I have indeed linked processes. I realized that that is why the exception is “uncatchable” in the shell perhaps. Because the shell process dies because it is linked to my test processes, and the function running the test hasn’t encountered an error yet.
>
> Does links in Erlang always crash with {'EXIT', Pid, noconnection} when a node dies?
>
> Cheers,
> Adam
>
>> On 4. Dec 2019, at 08:38, Lukas Larsson <[hidden email]> wrote:
>>
>>
>>
>> On Tue, Dec 3, 2019 at 11:56 AM Adam Lindberg <[hidden email]> wrote:
>> Hi!
>>
>> I’m running some tests using distributed Erlang. I set up a cluster of Erlang nodes doing Distributed Systems™ stuff, and a hidden node that have a connection to each of the nodes in that cluster. The hidden node orchestrates the test by starting all Erlang nodes as ports. It then starts a process (gen_server) on each node that manipulates stuff on that node. It also loads some mock modules among other things. The hidden node also has some managing gen_servers running locally, which some of the mocks makes RPC calls to from the cluster nodes (to simulate and orchestrate mocked hardware components).
>>
>> Now I wanted to test how my system behaves when killing some random nodes, chaos monkey style. So I picked the easiest option of using rpc:cast(RandomClusterNode, erlang, halt, [137]). However, now my test dies with the following obscure error: ** exception exit: noconnection. This even happens when first spawning a fun that then calls erlang:halt(137) (as to avoid the RPC connection somehow breaking).
>>
>> After searching a bit on the Internet it seems to be some internal uncatchable (!) error generated by Erlang [1][2], but it is not at all clear when it happens, and how to avoid it. After some debugging in the gen_servers running on the hidden node, I can see the error by setting process_flag(trap_exit, true) and printing it in terminate/2 but I still can’t catch it. I can’t even catch it in the shell by enclosing my run in a try-catch block! It’s almost not mentioned at all in the official documentation [3]. Most likely I’m setting up my test nodes and the application/test code in a way that generates this error, but I have no idea what exactly leads to it.
>>
>> I guess I have two problems:
>>
>> 1. What is the error, and how can I handle / avoid it?
>>
>> I'm not sure, but could it be that your process is linked to a process on the remote side? That what you are getting is a broken link error?
>>
>> 2. Why is it not documented?
>>
>> Cheers,
>> Adam
>>
>>
>> [1]: http://erlang.org/pipermail/erlang-questions/2012-April/066219.html
>> [2]: http://erlang.org/pipermail/erlang-questions/2013-April/073246.html
>> [3]: http://erlang.org/doc/getting_started/robustness.html
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Lukas Larsson-8


On Wed, Dec 4, 2019 at 10:59 AM Adam Lindberg <[hidden email]> wrote:
Ah, thanks! That’s good to know. :-)

Maybe I missed it but I can’t find this documented anywhere (in e.g. erlang:link/1 or erlang:monitor/2). The only place I can find it referenced is in an example in the Getting Started User’s Guide: http://erlang.org/doc/getting_started/robustness.html


It is mentioned under the Info section in the erlang:monitor/2 documentation.
 
Perhaps it should be documented more prominently?

Yes it should, just as noproc is.
 
Next question is I need to clarify is: can gen_server processes never receive exit messages as normal info messages? If I enable trap_exit I only receive a call to terminate with the noconnection error eventually...

I'm not sure I understand what you mean.

When trapping exits, a gen_server will either get the exit in the terminate or handle_info callback. Which one depends on which process sends the exit signal. If it is the "parent" process, i.e. the process that started the gen_server, then the terminate callback will be called. If it some other process it is the handle_info callback that is called. The assumption here is that if the parent exits for any reason, you want to terminate your gen_server, but if a child or peer exits, then you want to handle that and possibly continue running.
 

Cheers,
Adam

> On 4. Dec 2019, at 09:54, Adam Lindberg <[hidden email]> wrote:
>
> I have indeed linked processes. I realized that that is why the exception is “uncatchable” in the shell perhaps. Because the shell process dies because it is linked to my test processes, and the function running the test hasn’t encountered an error yet.
>
> Does links in Erlang always crash with {'EXIT', Pid, noconnection} when a node dies?
>
> Cheers,
> Adam
>
>> On 4. Dec 2019, at 08:38, Lukas Larsson <[hidden email]> wrote:
>>
>>
>>
>> On Tue, Dec 3, 2019 at 11:56 AM Adam Lindberg <[hidden email]> wrote:
>> Hi!
>>
>> I’m running some tests using distributed Erlang. I set up a cluster of Erlang nodes doing Distributed Systems™ stuff, and a hidden node that have a connection to each of the nodes in that cluster. The hidden node orchestrates the test by starting all Erlang nodes as ports. It then starts a process (gen_server) on each node that manipulates stuff on that node. It also loads some mock modules among other things. The hidden node also has some managing gen_servers running locally, which some of the mocks makes RPC calls to from the cluster nodes (to simulate and orchestrate mocked hardware components).
>>
>> Now I wanted to test how my system behaves when killing some random nodes, chaos monkey style. So I picked the easiest option of using rpc:cast(RandomClusterNode, erlang, halt, [137]). However, now my test dies with the following obscure error: ** exception exit: noconnection. This even happens when first spawning a fun that then calls erlang:halt(137) (as to avoid the RPC connection somehow breaking).
>>
>> After searching a bit on the Internet it seems to be some internal uncatchable (!) error generated by Erlang [1][2], but it is not at all clear when it happens, and how to avoid it. After some debugging in the gen_servers running on the hidden node, I can see the error by setting process_flag(trap_exit, true) and printing it in terminate/2 but I still can’t catch it. I can’t even catch it in the shell by enclosing my run in a try-catch block! It’s almost not mentioned at all in the official documentation [3]. Most likely I’m setting up my test nodes and the application/test code in a way that generates this error, but I have no idea what exactly leads to it.
>>
>> I guess I have two problems:
>>
>> 1. What is the error, and how can I handle / avoid it?
>>
>> I'm not sure, but could it be that your process is linked to a process on the remote side? That what you are getting is a broken link error?
>>
>> 2. Why is it not documented?
>>
>> Cheers,
>> Adam
>>
>>
>> [1]: http://erlang.org/pipermail/erlang-questions/2012-April/066219.html
>> [2]: http://erlang.org/pipermail/erlang-questions/2013-April/073246.html
>> [3]: http://erlang.org/doc/getting_started/robustness.html
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Adam Lindberg-7
On 4. Dec 2019, at 11:31, Lukas Larsson <[hidden email]> wrote:

>
>
>
> On Wed, Dec 4, 2019 at 10:59 AM Adam Lindberg <[hidden email]> wrote:
> Ah, thanks! That’s good to know. :-)
>
> Maybe I missed it but I can’t find this documented anywhere (in e.g. erlang:link/1 or erlang:monitor/2). The only place I can find it referenced is in an example in the Getting Started User’s Guide: http://erlang.org/doc/getting_started/robustness.html
>
>
> It is mentioned under the Info section in the erlang:monitor/2 documentation.

Interesting, didn’t show up very early in my search results. Thanks for the pointers.

>  
> Perhaps it should be documented more prominently?
>
> Yes it should, just as noproc is.

That would be great. I’ll prepare a PR.

>  
> Next question is I need to clarify is: can gen_server processes never receive exit messages as normal info messages? If I enable trap_exit I only receive a call to terminate with the noconnection error eventually...
>
> I'm not sure I understand what you mean.
>
> When trapping exits, a gen_server will either get the exit in the terminate or handle_info callback. Which one depends on which process sends the exit signal. If it is the "parent" process, i.e. the process that started the gen_server, then the terminate callback will be called. If it some other process it is the handle_info callback that is called. The assumption here is that if the parent exits for any reason, you want to terminate your gen_server, but if a child or peer exits, then you want to handle that and possibly continue running.

Yeah, that it is coming from the parent is most likely my case. I think I painted myself into a very obscure corner here. I start some linked, unsupervised gen_server processes from a shell function, then run the tests with the the help of those. Once the test process on the system under test dies with 'noconnection’ it arrives at the shell process, which is the parent to the test processes.

One thing that I think could be improved is the error printout in the shell:

    (test@host)1> my_test:start().
    Running...
    ** exception exit: noconnection
    (test@host)2>

Contrast with:

    (test@host)3> exit(foo).
    ** exception exit: foo

In the first case, it is actually not the function that raises the exception, but the shell process that receives an exit signal. It would be nice if there was a visual difference here. The intuitive thing to to is to run "catch my_test:start()” which obviously does nothing since it is not the function that crashes, it is a linked process started by the function that sends an exit signal to the running shell process. Perhaps something along the lines of:

    (test@host)1> my_test:start().
    Running...
    ** shell process received exit signal: noconnection
    (test@host)2>

Cheers,
Adam

>
> Cheers,
> Adam
>
> > On 4. Dec 2019, at 09:54, Adam Lindberg <[hidden email]> wrote:
> >
> > I have indeed linked processes. I realized that that is why the exception is “uncatchable” in the shell perhaps. Because the shell process dies because it is linked to my test processes, and the function running the test hasn’t encountered an error yet.
> >
> > Does links in Erlang always crash with {'EXIT', Pid, noconnection} when a node dies?
> >
> > Cheers,
> > Adam
> >
> >> On 4. Dec 2019, at 08:38, Lukas Larsson <[hidden email]> wrote:
> >>
> >>
> >>
> >> On Tue, Dec 3, 2019 at 11:56 AM Adam Lindberg <[hidden email]> wrote:
> >> Hi!
> >>
> >> I’m running some tests using distributed Erlang. I set up a cluster of Erlang nodes doing Distributed Systems™ stuff, and a hidden node that have a connection to each of the nodes in that cluster. The hidden node orchestrates the test by starting all Erlang nodes as ports. It then starts a process (gen_server) on each node that manipulates stuff on that node. It also loads some mock modules among other things. The hidden node also has some managing gen_servers running locally, which some of the mocks makes RPC calls to from the cluster nodes (to simulate and orchestrate mocked hardware components).
> >>
> >> Now I wanted to test how my system behaves when killing some random nodes, chaos monkey style. So I picked the easiest option of using rpc:cast(RandomClusterNode, erlang, halt, [137]). However, now my test dies with the following obscure error: ** exception exit: noconnection. This even happens when first spawning a fun that then calls erlang:halt(137) (as to avoid the RPC connection somehow breaking).
> >>
> >> After searching a bit on the Internet it seems to be some internal uncatchable (!) error generated by Erlang [1][2], but it is not at all clear when it happens, and how to avoid it. After some debugging in the gen_servers running on the hidden node, I can see the error by setting process_flag(trap_exit, true) and printing it in terminate/2 but I still can’t catch it. I can’t even catch it in the shell by enclosing my run in a try-catch block! It’s almost not mentioned at all in the official documentation [3]. Most likely I’m setting up my test nodes and the application/test code in a way that generates this error, but I have no idea what exactly leads to it.
> >>
> >> I guess I have two problems:
> >>
> >> 1. What is the error, and how can I handle / avoid it?
> >>
> >> I'm not sure, but could it be that your process is linked to a process on the remote side? That what you are getting is a broken link error?
> >>
> >> 2. Why is it not documented?
> >>
> >> Cheers,
> >> Adam
> >>
> >>
> >> [1]: http://erlang.org/pipermail/erlang-questions/2012-April/066219.html
> >> [2]: http://erlang.org/pipermail/erlang-questions/2013-April/073246.html
> >> [3]: http://erlang.org/doc/getting_started/robustness.html
> >>
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Roger Lipscombe-2
On Wed, 4 Dec 2019 at 11:17, Adam Lindberg <[hidden email]> wrote:
> Perhaps something along the lines of:
>
>     (test@host)1> my_test:start().
>     Running...
>     ** shell process received exit signal: noconnection
>     (test@host)2>

Except: the shell process isn't receiving an exit signal. It's being
killed. You can see that if you examine self() before and after that
message -- the REPL restarts the shell process.

To make this more obvious, I have a custom prompt which displays the
shell's pid: https://github.com/rlipscombe/rl_erl_prompt [1].

But even with that said, the message could be improved, certainly.

[1]: I note in passing that there's actually attempted support for
colour in there. I could never get it working.
Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Adam Lindberg-7
It does receive an exit signal, and then dies because of it, no?

To further split hairs: it’s not killed by anyone specifically (i.e. exit(ShellProcess, kill)), it dies just like any other Erlang process because it receives an exit signal from a linked process. I’m pretty sure there is _some_ process _somewhere_ that also catches the error and makes sure the printout is  done. I don’t know if it is the current shell process or some higher level manager process though (as I didn’t read the source code).

And to go ever further down the rabbit hole: there is no such thing as “killing” an Erlang process. You can only send exit signals. It’s just that there is a special exit signal (‘kill’) that is uncatchable and where the VM makes sure the process dies.

Cheers,
Adam

> On 4. Dec 2019, at 15:37, Roger Lipscombe <[hidden email]> wrote:
>
> On Wed, 4 Dec 2019 at 11:17, Adam Lindberg <[hidden email]> wrote:
>> Perhaps something along the lines of:
>>
>>    (test@host)1> my_test:start().
>>    Running...
>>    ** shell process received exit signal: noconnection
>>    (test@host)2>
>
> Except: the shell process isn't receiving an exit signal. It's being
> killed. You can see that if you examine self() before and after that
> message -- the REPL restarts the shell process.
>
> To make this more obvious, I have a custom prompt which displays the
> shell's pid: https://github.com/rlipscombe/rl_erl_prompt [1].
>
> But even with that said, the message could be improved, certainly.
>
> [1]: I note in passing that there's actually attempted support for
> colour in there. I could never get it working.

Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Roger Lipscombe-2
On Wed, 4 Dec 2019 at 15:06, Adam Lindberg <[hidden email]> wrote:
> It does receive an exit signal, and then dies because of it, no?

Sorry. Lack of precision: it's not *trapping* exit signals by default,
so it gets default-killed. The process that owns the shell process
traps *that*, but can't tell the difference, so emitting a different
message for the two cases might not be that simple.
Reply | Threaded
Open this post in threaded view
|

Re: ** exception exit: noconnection

Adam Lindberg-7
Got it.

I got mostly tripped up on not being able to wrap the code in a try-catch and be able to catch it. Therefore I think it would make sense to print something else when the shell actually receives an exit signal so users can understand that they are special somehow.

Cheers,
Adam

> On 4. Dec 2019, at 16:41, Roger Lipscombe <[hidden email]> wrote:
>
> On Wed, 4 Dec 2019 at 15:06, Adam Lindberg <[hidden email]> wrote:
>> It does receive an exit signal, and then dies because of it, no?
>
> Sorry. Lack of precision: it's not *trapping* exit signals by default,
> so it gets default-killed. The process that owns the shell process
> traps *that*, but can't tell the difference, so emitting a different
> message for the two cases might not be that simple.