Mnesia tables don't finish syncing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Mnesia tables don't finish syncing

Oliver Korpilla
Hello.

I have a system with multiple nodes, one is central and the other start later and connect to central.

All tables are RAM copies, even schema.

On startup of the slave nodes I do:
1) Connect nodes
2) Spawn a process on central and do mnesia:add_table_copy for each table for my new node (including schema)
3) mnesia:wait_for_tables for all of these tables

I repeat steps 2) and 3) up to 5 times total with a wait time of 2s. In about 3% of cases (regression test run) it seems not to work no matter how long I wait.

What I see from step 2) is as expected, the calls to mnesia:add_table_copy return
a) {atomic, ok} on the first try
b) {aborted, {already_exists, <table>, <node>}} on all further tries

And no matter how many retries, in the failure scenarios the list of failed tables I still wait for does not get shorter.

Could people on this list please make further suggestions where I could look for faults? And how could I find them?

So far I have no way to pinpoint the actual error except that when I of course try to access such tables I get "badarg". But I don't see why the sync fails. How do I get more logs or a report pinpointing the actual problem?

Thank you and kind regards,
Oliver
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Mnesia tables don't finish syncing

Oliver Korpilla
Hello,

I made progress with this. I can get Mnesia to sync like this:

1) worker node comes up
2) connects to central node (repeat until connection is made)
3) starts Mnesia
4) spawn on central node: mnesia:change_config(extra_db_nodes, nodes())
5) call locally mnesia:set_master_nodes([<central node>])

As for the DB tables this works. Simply works.

However, now I traded one sync problem ("Does Mnesia sync its tables between nodes?") with another ("Is Mnesia on central node finished with Mnesia start?")

In case the central node boots up after the worker nodes it can happen sporadically that the spawn (#4 above) crashes on central node and Mnesia with it.

Error message is then {case_clause, starting, [...]

The mnesia:change_config is not simply returning an error code, as far as I can see the call simply fails and takes Mnesia with it? (Please correct me if I misunderstand.) I do retry when something else but {ok, _} is returned but this simply seems to crash Mnesia altogether?


I looked for any way to query any way to protect myself against this interim state and tried the following:

Query mnesia:system_info(running_db_nodes) and check if central node is in the list.

But apparently this query is updated before Mnesia is really up. And my call into mnesia:change_config/2 still crashes.

How can I synchronize DB startup between nodes better?

Thank you and best regards,
Oliver 
 

Gesendet: Donnerstag, 01. Juni 2017 um 11:15 Uhr
Von: "Oliver Korpilla" <[hidden email]>
An: [hidden email]
Betreff: [erlang-questions] Mnesia tables don't finish syncing
Hello.

I have a system with multiple nodes, one is central and the other start later and connect to central.

All tables are RAM copies, even schema.

On startup of the slave nodes I do:
1) Connect nodes
2) Spawn a process on central and do mnesia:add_table_copy for each table for my new node (including schema)
3) mnesia:wait_for_tables for all of these tables

I repeat steps 2) and 3) up to 5 times total with a wait time of 2s. In about 3% of cases (regression test run) it seems not to work no matter how long I wait.

What I see from step 2) is as expected, the calls to mnesia:add_table_copy return
a) {atomic, ok} on the first try
b) {aborted, {already_exists, <table>, <node>}} on all further tries

And no matter how many retries, in the failure scenarios the list of failed tables I still wait for does not get shorter.

Could people on this list please make further suggestions where I could look for faults? And how could I find them?

So far I have no way to pinpoint the actual error except that when I of course try to access such tables I get "badarg". But I don't see why the sync fails. How do I get more logs or a report pinpointing the actual problem?

Thank you and kind regards,
Oliver
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions