Automatically reconnecting nodes when they come back online

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Automatically reconnecting nodes when they come back online

Scott Lystig Fritchie-2
To all who know more about this than I do:

First, I'm just beginning to learn about Erlang/OTP so I figured I'd
use to implement something useful.

Part of what I'd like to build will involve a "conductor" controller
node that directs some other "player" nodes to all do something at
approximately the same time - ultimately to actually test the
operation of another piece of distributed software.  As part of those
operations, I expect the player nodes may sometimes crash (actually
cause a Windows BSOD in some cases) and then eventually come back to
life.

What I'm wondering about is what some folks have found to be good ways
of getting nodes to rejoin the cluster when they come back to life.
They way I'm thinking about it now, is that the player nodes will be
passive in the sense that they won't actively connect to any other
nodes - they'll only get connected when the conductor node invites
them in.  I'm also not looking for fault tolerance on the conductor
node at this point; if that one fails badly I'll just get some coffee
and rerun the scenario again.

My first two thoughts were:
1.  When the conductor node connects up the player nodes it would also
spawn a process whose sole job is to periodically ping the other nodes
to ensure they're connected.  Then when one goes down, those pings
will just fail during that time but when the node comes back a ping
will reconnect it to the other nodes.  All this time, I'd be
monitoring the node up/down messages.
2.  I'd start by monitoring all the nodes as the conductor connects
them and when receiving a node down message, spawn a process whose job
it is to periodically ping only that node only until it comes back.

Are there some good practices out there for systems that want to
behave like this?

Thanks in advance,

/stt

Reply | Threaded
Open this post in threaded view
|

Automatically reconnecting nodes when they come back online

ノートン ジョーセフ ウェイ ン
I don't have a direct answer to your question.

However, are you aware of the slave module?

Some of the recipe(s) in this module might be of use to you.

https://github.com/norton/qc/blob/master/src/qc_slave.erl

On 2013/04/27, at 2:00, Scott Thoman <scott> wrote:

> To all who know more about this than I do:
>
> First, I'm just beginning to learn about Erlang/OTP so I figured I'd
> use to implement something useful.
>
> Part of what I'd like to build will involve a "conductor" controller
> node that directs some other "player" nodes to all do something at
> approximately the same time - ultimately to actually test the
> operation of another piece of distributed software.  As part of those
> operations, I expect the player nodes may sometimes crash (actually
> cause a Windows BSOD in some cases) and then eventually come back to
> life.
>
> What I'm wondering about is what some folks have found to be good ways
> of getting nodes to rejoin the cluster when they come back to life.
> They way I'm thinking about it now, is that the player nodes will be
> passive in the sense that they won't actively connect to any other
> nodes - they'll only get connected when the conductor node invites
> them in.  I'm also not looking for fault tolerance on the conductor
> node at this point; if that one fails badly I'll just get some coffee
> and rerun the scenario again.
>
> My first two thoughts were:
> 1.  When the conductor node connects up the player nodes it would also
> spawn a process whose sole job is to periodically ping the other nodes
> to ensure they're connected.  Then when one goes down, those pings
> will just fail during that time but when the node comes back a ping
> will reconnect it to the other nodes.  All this time, I'd be
> monitoring the node up/down messages.
> 2.  I'd start by monitoring all the nodes as the conductor connects
> them and when receiving a node down message, spawn a process whose job
> it is to periodically ping only that node only until it comes back.
>
> Are there some good practices out there for systems that want to
> behave like this?
>
> Thanks in advance,
>
> /stt
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130427/029d3845/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Automatically reconnecting nodes when they come back online

Scott Lystig Fritchie-2
On Fri, Apr 26, 2013 at 1:13 PM, Joseph Wayne Norton
<norton> wrote:

> I don't have a direct answer to your question.
>
> However, are you aware of the slave module?
>
> Some of the recipe(s) in this module might be of use to you.
>
> https://github.com/norton/qc/blob/master/src/qc_slave.erl
>
> On 2013/04/27, at 2:00, Scott Thoman <scott> wrote:
>
> To all who know more about this than I do:
>
> First, I'm just beginning to learn about Erlang/OTP so I figured I'd
> use to implement something useful.
>
> Part of what I'd like to build will involve a "conductor" controller
> node that directs some other "player" nodes to all do something at
> approximately the same time - ultimately to actually test the
> operation of another piece of distributed software.  As part of those
> operations, I expect the player nodes may sometimes crash (actually
> cause a Windows BSOD in some cases) and then eventually come back to
> life.
>
> What I'm wondering about is what some folks have found to be good ways
> of getting nodes to rejoin the cluster when they come back to life.
> They way I'm thinking about it now, is that the player nodes will be
> passive in the sense that they won't actively connect to any other
> nodes - they'll only get connected when the conductor node invites
> them in.  I'm also not looking for fault tolerance on the conductor
> node at this point; if that one fails badly I'll just get some coffee
> and rerun the scenario again.
>
> My first two thoughts were:
> 1.  When the conductor node connects up the player nodes it would also
> spawn a process whose sole job is to periodically ping the other nodes
> to ensure they're connected.  Then when one goes down, those pings
> will just fail during that time but when the node comes back a ping
> will reconnect it to the other nodes.  All this time, I'd be
> monitoring the node up/down messages.
> 2.  I'd start by monitoring all the nodes as the conductor connects
> them and when receiving a node down message, spawn a process whose job
> it is to periodically ping only that node only until it comes back.
>
> Are there some good practices out there for systems that want to
> behave like this?
>
> Thanks in advance,
>
> /stt
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions

I'm not aware of it yet but I'll take a look...

Thanks,
/stt

Reply | Threaded
Open this post in threaded view
|

Automatically reconnecting nodes when they come back online

Scott Lystig Fritchie-2
It looks like the slave thing won't quite work in my case since I'll
likely be in a heterogeneous environment where the controller is linux
but, unfortunately, the machines-under-test will be Windows.

I will keep that in mind, though, if I need that functionality now
that I know it exists. :)

/stt

Reply | Threaded
Open this post in threaded view
|

Automatically reconnecting nodes when they come back online

Ignas Vyšniauskas
In reply to this post by Scott Lystig Fritchie-2
On 04/26/2013 07:00 PM, Scott Thoman wrote:
> My first two thoughts were: 1.  When the conductor node connects up
> the player nodes it would also spawn a process whose sole job is to
> periodically ping the other nodes to ensure they're connected.  Then
> when one goes down, those pings will just fail during that time but
> when the node comes back a ping will reconnect it to the other
> nodes. All this time, I'd be monitoring the node up/down messages. 2.
> I'd start by monitoring all the nodes as the conductor connects them
> and when receiving a node down message, spawn a process whose job it
> is to periodically ping only that node only until it comes back.

* you don't need a pinging mechanism, just use the existing
`net_kernel:monitor(true)` and handle the events.
* if you can afford a fixed node name for at least the "conductor" node,
then you can do something along the lines you described yourself --
should be trivial.
* otherwise you can try to hack things using `net_adm:world()` or
something like that for "dynamic" node discovery

Also, take a look at the `{sync_nodes_optional, NodeList}` parameter of
`kernel`.

--
Ignas

Reply | Threaded
Open this post in threaded view
|

Automatically reconnecting nodes when they come back online

Dmitry Kolesnikov
Hello,

I am using the following approach for similar issue.

- net_kernel:monitor allow your process to receive nodeup/nodedown events.

- the player nodes requires a list of 'seed' nodes at config file. It should connect those seed nodes at boot time. If none of seeds is connected then player node has to die with alarm.

- Dmitry


On Apr 29, 2013, at 9:41 AM, Ignas Vy?niauskas <baliulia> wrote:

> On 04/26/2013 07:00 PM, Scott Thoman wrote:
>> My first two thoughts were: 1.  When the conductor node connects up
>> the player nodes it would also spawn a process whose sole job is to
>> periodically ping the other nodes to ensure they're connected.  Then
>> when one goes down, those pings will just fail during that time but
>> when the node comes back a ping will reconnect it to the other
>> nodes. All this time, I'd be monitoring the node up/down messages. 2.
>> I'd start by monitoring all the nodes as the conductor connects them
>> and when receiving a node down message, spawn a process whose job it
>> is to periodically ping only that node only until it comes back.
>
> * you don't need a pinging mechanism, just use the existing
> `net_kernel:monitor(true)` and handle the events.
> * if you can afford a fixed node name for at least the "conductor" node,
> then you can do something along the lines you described yourself --
> should be trivial.
> * otherwise you can try to hack things using `net_adm:world()` or
> something like that for "dynamic" node discovery
>
> Also, take a look at the `{sync_nodes_optional, NodeList}` parameter of
> `kernel`.
>
> --
> Ignas
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions