mnesia recovery

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

mnesia recovery

Evans, Matthew
Hi,

This is a rather convoluted question.

We have a distributed system with disc copies/disc only copies of mnesia tables on nodes A and B.  Other nodes in the system C through M have RAM only copies of those tables.

Ordinarily if node A fails and recovers shortly later we are fine since mnesia is smart enough to re-sync data back to node A from node B.

We hit a situation yesterday where node A failed, some time later the whole distributed system was restarted but node B never recovered.

The logic is such that startup is effectively blocked since we know the "good" data is on node B.

How to handle this in the field? If, for reasons beyond our control node B can not be recovered easily, I am wondering is there a way to get the data from node B to node A (I am assuming we can access the partition on node B)?

Would it be possible to:

1) Stop mnesia on all nodes
2) Copy the contents of the mnesia directory from node B to node A (minus the schema definitions)
3) Empty the mnesia directory from node B
4) Restart everything

In this case I am hoping that mnesia would see node A as good and node B as having no data and would copy data to the new node B.

Basically this situation needs to be resolved on the field by engineers with little or no Erlang skills. Certainly escripts could be written to help.

Thanks

Matt
Reply | Threaded
Open this post in threaded view
|

Re: mnesia recovery

Igor Ribeiro Sucupira
On Thu, Jul 15, 2010 at 10:41 AM, Evans, Matthew <[hidden email]> wrote:
> Hi,
>
> This is a rather convoluted question.
>
> We have a distributed system with disc copies/disc only copies of mnesia tables on nodes A and B.  Other nodes in the system C
> through M have RAM only copies of those tables.

I'm assuming all nodes have exactly the same tables and the same data
(including the schema). Is that the case? If it's not, could you
describe the pool in more detail?

> Ordinarily if node A fails and recovers shortly later we are fine since mnesia is smart enough to re-sync data back to node A from
> node B.
>
> We hit a situation yesterday where node A failed, some time later the whole distributed system was restarted but node B never
> recovered.

What does that mean? Is node B corrupted? Or is it just refusing to
start because the other nodes are down and B is not the most
up-to-date node? I don't see any other case for "never recovered" and
I'm assuming you have the former (corruption), since you said the
other nodes were restarted and that B has "good" data.

> The logic is such that startup is effectively blocked since we know the "good" data is on node B.
>
> How to handle this in the field? If, for reasons beyond our control node B can not be recovered easily, I am wondering is there a
> way to get the data from node B to node A (I am assuming we can access the partition on node B)?

Assuming B is the most up-to-date node and has some corrupted tables,
you can copy the working files of those tables from some other node to
node B (yeah... they may be outdated, but there's not much to do in
this case) and than start node B. Everything should work fine.

If that's not your problem, maybe this function could help you, anyway:
http://erlang.org/doc/man/mnesia.html#force_load_table-1

I've used force_load_table/1 in situations where Mnesia was refusing
to load the table in some node because it believed its copy was not
current (but I knew it was).

Good luck.
Igor.

> Would it be possible to:
>
> 1) Stop mnesia on all nodes
> 2) Copy the contents of the mnesia directory from node B to node A (minus the schema definitions)
> 3) Empty the mnesia directory from node B
> 4) Restart everything
>
> In this case I am hoping that mnesia would see node A as good and node B as having no data and would copy data to the new
> node B.
>
> Basically this situation needs to be resolved on the field by engineers with little or no Erlang skills. Certainly escripts could be
> written to help.
>
> Thanks
>
> Matt

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mnesia recovery

Dan Gudmundsson-2
Also you can bootstrap a disc_node like you do with a ram_node..

Delete everything on the mnesia_dir (not the dir), start mnesia and
call mnesia:change_config(extra_db_nodes, [AliveAndKicking]).

That should copy every table that the node should have a copy of and
 if it is available in the system

/Dan

On Fri, Jul 16, 2010 at 5:24 AM, Igor Ribeiro Sucupira <[hidden email]> wrote:

> On Thu, Jul 15, 2010 at 10:41 AM, Evans, Matthew <[hidden email]> wrote:
>> Hi,
>>
>> This is a rather convoluted question.
>>
>> We have a distributed system with disc copies/disc only copies of mnesia tables on nodes A and B.  Other nodes in the system C
>> through M have RAM only copies of those tables.
>
> I'm assuming all nodes have exactly the same tables and the same data
> (including the schema). Is that the case? If it's not, could you
> describe the pool in more detail?
>
>> Ordinarily if node A fails and recovers shortly later we are fine since mnesia is smart enough to re-sync data back to node A from
>> node B.
>>
>> We hit a situation yesterday where node A failed, some time later the whole distributed system was restarted but node B never
>> recovered.
>
> What does that mean? Is node B corrupted? Or is it just refusing to
> start because the other nodes are down and B is not the most
> up-to-date node? I don't see any other case for "never recovered" and
> I'm assuming you have the former (corruption), since you said the
> other nodes were restarted and that B has "good" data.
>
>> The logic is such that startup is effectively blocked since we know the "good" data is on node B.
>>
>> How to handle this in the field? If, for reasons beyond our control node B can not be recovered easily, I am wondering is there a
>> way to get the data from node B to node A (I am assuming we can access the partition on node B)?
>
> Assuming B is the most up-to-date node and has some corrupted tables,
> you can copy the working files of those tables from some other node to
> node B (yeah... they may be outdated, but there's not much to do in
> this case) and than start node B. Everything should work fine.
>
> If that's not your problem, maybe this function could help you, anyway:
> http://erlang.org/doc/man/mnesia.html#force_load_table-1
>
> I've used force_load_table/1 in situations where Mnesia was refusing
> to load the table in some node because it believed its copy was not
> current (but I knew it was).
>
> Good luck.
> Igor.
>
>> Would it be possible to:
>>
>> 1) Stop mnesia on all nodes
>> 2) Copy the contents of the mnesia directory from node B to node A (minus the schema definitions)
>> 3) Empty the mnesia directory from node B
>> 4) Restart everything
>>
>> In this case I am hoping that mnesia would see node A as good and node B as having no data and would copy data to the new
>> node B.
>>
>> Basically this situation needs to be resolved on the field by engineers with little or no Erlang skills. Certainly escripts could be
>> written to help.
>>
>> Thanks
>>
>> Matt
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:[hidden email]
>
>

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mnesia recovery

Håkan Mattsson
On Fri, Jul 16, 2010 at 8:23 AM, Dan Gudmundsson <[hidden email]> wrote:
> Also you can bootstrap a disc_node like you do with a ram_node..
>
> Delete everything on the mnesia_dir (not the dir), start mnesia and
> call mnesia:change_config(extra_db_nodes, [AliveAndKicking]).
>
> That should copy every table that the node should have a copy of and
>  if it is available in the system

This method can be very useful if Mnesia's  files have got corrupted or if
you have replaced the hardware.

Otherwise you can use mnesia:set_master_nodes/1 to force Mnesia to
load its tables from another node.

/Håkan

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:[hidden email]