I have just finished to write the small application required for the Erlang
certification, and it was interesting as it made me work with some things
that I never did before...
And some questions came up. Maybe you can shed some light:
As far as I can tell, there isn't a way to automatically connect nodes. In
order to (for example) access a global server and allow for it's node to
crash and come back online transparently, one has to know that node's name
and it's host name. I don't really like the idea of hardcoding the node/host
names... Or one has to have a central registrar node which all new nodes
must connect to, but that is just moving away the problem, because that
central node might crash too.
Is it as I think, that there is no way to go around this problem, or have I
Another question is about io. If there is a process reading from the
standard input, is there any way to cancel that input request and let the
process keep running?
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.
"Vlad Dumitrescu" <vladdu> wrote:
> As far as I can tell, there isn't a way to automatically connect
> nodes. In order to (for example) access a global server and allow
> for it's node to crash and come back online transparently, one has
> to know that node's name and it's host name.
There are two sides of this problem.
First, there is inital connection. For this, you'll have to provide
the node name to the system in some way. Once you have done this,
global can be used to automatically set up a fully connected net.
There is no auto-discovery mechanism in the standard distribution, but
it is quite simple to write your own, either using broadcast or
multicast. [We're using a broadcast mechanism for some nodes in our
systems.] You might have to think about security issues though. It
all depends on the application.
Second, once all nodes have contact, you'd like to make sure that all
nodes keep their connections. If one node crashes and restart, you're
back to initial start, which can be handled. A worse situation is if
you have at least three nodes, and because of network/host loads, one
of the TCP connections times out. In this case, you don't have a
fully connected net anymore, and global stops working(*). Probably
Mnesia as well. This is a big defect in global(**). [In our system,
each node runs a 'pinger' process, which starts to periodically ping
each node as it goes down, until it either comes back up, or is
removed from the system. Once it's up again, you might end up in with
a partitioned network which regained it's contact, which is another
difficult problem to solve. We solve it by restarting one of the
partitions, and some db magic :) ]
[*] Unfortunately, it doesn't even detect this situation, so the
result might be that the name registry becomes inconsistent, or that
global:sync() hangs (which means that the global handshake procedure
hangs or failed), or it crashes (which is the best of the three).
[**] Since I designed one incaration of global, you can blame me ;)
> I don't really like the idea of hardcoding the node/host names...
You should never have to do that of course. [In our system, each node
is added by an operator (he doesn't know he's adding an Erlang node of
course), which provides the IP address of another box in the system.
We contact the node on that box, and store the new node name in the
configuration files in the rest of the system. Auto-discovery could
be used instead, and we probably will do that for some special systems
in the future.]
>A worse situation is if
>you have at least three nodes, and because of network/host loads, one
>of the TCP connections times out. In this case, you don't have a
>fully connected net anymore, and global stops working(*). Probably
>Mnesia as well. This is a big defect in global(**). [In our system,
>each node runs a 'pinger' process, which starts to periodically ping
>each node as it goes down, until it either comes back up, or is
>removed from the system. Once it's up again, you might end up in with
>a partitioned network which regained it's contact, which is another
>difficult problem to solve. We solve it by restarting one of the
>partitions, and some db magic :) ]
In our system, the AXD 301, we do something similar, but also
enable the flag 'kernel -dist_auto_connect once', in order to
handle partitioned networks in a controlled manner. This flag
makes sure that two nodes can't reconnect, once separated, without
at least one of the nodes restarting. In addition to this, we have
a "backdoor ping" (UDP-based) to detect communication failures:
if we get a ping from a known node that's not in the node list,
we have a partitioned network.
One way to handle the auto-connect problem could be to let mnesia
connect. If your system is set up so that you have a few mnesia
nodes that handle the persistent database, and other nodes that
just run diskless mnesia clients, you can start the diskless
clients with -mnesia extra_db_nodes <persistent nodes>'. Then,
the diskless clients will attempt to find at least one of the
persistent nodes in order to retrieve the mnesia schema.
Ulf Wiger tfn: +46 8 719 81 95
Senior System Architect mob: +46 70 519 81 95
Strategic Product & System Management ATM Multiservice Networks
Data Backbone & Optical Services Division Ericsson Telecom AB