|
Howdy. Is there a difference of opinion/definition on what
"synchronous" in Mnesia's synchronous disk logging means? In the context of disc_copies tables ... it seems to me that Mnesia's use of the phrase means: * The transaction coordinator waits synchronously for 2PC votes from all participants. * Each participant uses disk_log:log/2 or disk_log:blog/2 to record local votes and commit/abort decisions, but participants are *not* using the disk_log:sync/1 to force the log to disk. The disk_log:sync/1 function has an extremely high penalty, but sometimes that penalty is worth the cost. For example, some read+write transactions may contain data that you *really* do not want to lose. For data that important, if all replicas suddenly lose power, it is possible to lose the logs and thus the newly-updated data before it is written safely to disk on each replica machine. But I can't find a Mnesia transaction knob/button that I can twist/press to request that level of safety. Is there such a thing? -Scott |
|
On Mon, 23 Jan 2006, Scott Lystig Fritchie wrote:
SLF> Howdy. Is there a difference of opinion/definition on what SLF> "synchronous" in Mnesia's synchronous disk logging means? SLF> SLF> In the context of disc_copies tables ... it seems to me that Mnesia's SLF> use of the phrase means: SLF> SLF> * The transaction coordinator waits synchronously for 2PC votes SLF> from all participants. In Mnesia the coordinator does always wait synchronously for 2PC (and 3PC) votes from all participants, regardless of the transaction being "synchronous" or not. In "synchronous" transactions", the coordinator does also wait for the participants to complete their part of the commit work in the transaction before the control is returned to the caller. SLF> * Each participant uses disk_log:log/2 or disk_log:blog/2 to SLF> record local votes and commit/abort decisions, but participants SLF> are *not* using the disk_log:sync/1 to force the log to disk. Correct. SLF> The disk_log:sync/1 function has an extremely high penalty, but SLF> sometimes that penalty is worth the cost. For example, some SLF> read+write transactions may contain data that you *really* do not want SLF> to lose. For data that important, if all replicas suddenly lose SLF> power, it is possible to lose the logs and thus the newly-updated data SLF> before it is written safely to disk on each replica machine. I agree that such a feature can be useful. At least if the there are no write caches enabled in the disk hardware. Otherwise you could lose some data anyway in case of a power failure. SLF> But I can't find a Mnesia transaction knob/button that I can SLF> twist/press to request that level of safety. Is there such a thing? No currently there are no such thing in Mnesia. /Håkan |
|
>>>>> "hm" == Hakan Mattsson <[hidden email]> writes:
hm> In Mnesia the coordinator does always wait synchronously for 2PC hm> (and 3PC) votes from all participants, regardless of the hm> transaction being "synchronous" or not. That makes sense ... the coordinator can do Very Bad Things if it doesn't gather all votes. hm> I agree that such a feature can be useful. At least if the there hm> are no write caches enabled in the disk hardware. Otherwise you hm> could lose some data anyway in case of a power failure. Even if your disk subsystem(*) has an NVRAM write-back cache, there is risk of data loss unless you explicitly the fsync(2) system call. With Mnesia using the disk_log module, which in turn usually uses write(2) only, you are not certain that the OS will have copied write(2)'s data to the disk device. In most cases, the kernel can (and will) wait for many seconds before flushing that data to the disk device. SLF> But I can't find a Mnesia transaction knob/button that I can SLF> twist/press to request that level of safety. Is there such a SLF> thing? hm> No currently there are no such thing in Mnesia. That's what I'd thought. Assuming that I wanted to try to add that to Mnesia ... I think I'd need to add extra info to the commit record that's sent to each participant. Something that said: this log record is important enough to use fsync after writing. Hm. I suppose a poor man's safety net would be to run a shell script like this on each Mnesia node with disc_copies or disc_only_copies: while [ 1 ]; do sync sleep 1 done Easy to do, doesn't require code changes, and would limit worst-case data loss to roughly 1-2 seconds. (Assuming that disc_log and the file Port that disc_log uses do not do any buffering.) On the other hand, performance may suck. Too bad disk drives are so too darn slow. -Scott (*) Even if the disk logical device is a NVRAM/solid-state disk drive. |
|
Talking about poor mans solutions, you can also use mnesia:dump_log(), which closes the files after operation. The log dumping is otherwise automatic which you can control with time or number of transactions, see manual. A per transaction disk sync option requires some hacking though. /Dan Scott Lystig Fritchie writes: > >>>>> "hm" == Hakan Mattsson <[hidden email]> writes: > > hm> In Mnesia the coordinator does always wait synchronously for 2PC > hm> (and 3PC) votes from all participants, regardless of the > hm> transaction being "synchronous" or not. > > That makes sense ... the coordinator can do Very Bad Things if it > doesn't gather all votes. > > hm> I agree that such a feature can be useful. At least if the there > hm> are no write caches enabled in the disk hardware. Otherwise you > hm> could lose some data anyway in case of a power failure. > > Even if your disk subsystem(*) has an NVRAM write-back cache, there is > risk of data loss unless you explicitly the fsync(2) system call. > > With Mnesia using the disk_log module, which in turn usually uses > write(2) only, you are not certain that the OS will have copied > write(2)'s data to the disk device. In most cases, the kernel can > (and will) wait for many seconds before flushing that data to the disk > device. > > SLF> But I can't find a Mnesia transaction knob/button that I can > SLF> twist/press to request that level of safety. Is there such a > SLF> thing? > > hm> No currently there are no such thing in Mnesia. > > That's what I'd thought. > > Assuming that I wanted to try to add that to Mnesia ... I think I'd > need to add extra info to the commit record that's sent to each > participant. Something that said: this log record is important enough > to use fsync after writing. Hm. > > I suppose a poor man's safety net would be to run a shell script like > this on each Mnesia node with disc_copies or disc_only_copies: > > while [ 1 ]; do > sync > sleep 1 > done > > Easy to do, doesn't require code changes, and would limit worst-case > data loss to roughly 1-2 seconds. (Assuming that disc_log and the > file Port that disc_log uses do not do any buffering.) On the other > hand, performance may suck. > > Too bad disk drives are so too darn slow. > > -Scott > > (*) Even if the disk logical device is a NVRAM/solid-state disk drive. -- Dan Gudmundsson Project: Mnesia, Erlang/OTP Ericsson Utvecklings AB Phone: +46 8 727 5762 UAB/F/P Mobile: +46 70 519 9469 S-125 25 Stockholm Visit addr: Armborstv 1 |
|
In reply to this post by Scott Lystig Fritchie
On 2006-01-25 22:54, Scott Lystig Fritchie wrote:
...deleted > Even if your disk subsystem(*) has an NVRAM write-back cache, there is > risk of data loss unless you explicitly the fsync(2) system call. if you are running linux also remember this: ''The Linux fsync man page says: "It does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync on the file descriptor of the directory is also needed."'' (http://archives.postgresql.org/pgsql-hackers/2004-10/msg01037.php) bengt |
| Powered by Nabble | Edit this page |
