Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Sean Hinde-2
> One thing about mnesia is that it's not really prepared for
> applications that write constantly to disk-based tables.

It is not optimal I agree. There are some relatively simple things which
could be done to improve this though.

One simple idea would be to have independently specified paths to the
various log and dets files. Certainly having the log file on its own disk
could substantially increase performance of the dumper.

Files could also be striped across multiple disks using RAID type systems.

Another more complex enhancement would be to treat the log file as a simple
recovery log and use a memory based store as the actual source for data to
be propogated into the dets files. This could even just contain a list of
keys which have been updated in each transaction and the dumper could get
the data from the main memory table (with some extra stuff for detection of
multiple updates of the same record. Hmmm).

Or there could be a separate UNIX thread which runs through the log and does
the propogation into the main dets files.

I'm sure there are many things which can be done - though splitting out the
path of the different files could perhaps be the simplest and most effective
(allowing one to throw more hardware bandwidth at the problem).

- Sean



NOTICE AND DISCLAIMER:
This email (including attachments) is confidential.  If you have received
this email in error please notify the sender immediately and delete this
email from your system without copying or disseminating it or placing any
reliance upon its contents.  We cannot accept liability for any breaches of
confidence arising through use of email.  Any opinions expressed in this
email (including attachments) are those of the author and do not necessarily
reflect our opinions.  We will not accept responsibility for any commitments
made by our employees outside the scope of our business.  We do not warrant
the accuracy or completeness of such information.




Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Shawn Pearce
Sean Hinde <Sean.Hinde> scrawled:
> > One thing about mnesia is that it's not really prepared for
> > applications that write constantly to disk-based tables.
>
> It is not optimal I agree. There are some relatively simple things which
> could be done to improve this though.
>
> One simple idea would be to have independently specified paths to the
> various log and dets files. Certainly having the log file on its own disk
> could substantially increase performance of the dumper.

Oracle does this with their database and it is a big performance
booster.  The other thing they do is allow a table to be striped
across multiple disks by making a table exist in multiple file system
files at once.  (They stripe disk allocations across the files.)  This
does help to manage larger tables as well.

> Files could also be striped across multiple disks using RAID type systems.

We don't have a RAID system on this machine yet, but your correct, we
really should be working with RAID if we're really serious.  (Which we
aren't yet, we're still in development, I had just hoped for better
performance before RAID was added, as IMHO, a good RAID array only adds
so much before it too becomes a bottleneck.)

> Another more complex enhancement would be to treat the log file as a simple
> recovery log and use a memory based store as the actual source for data to
> be propogated into the dets files. This could even just contain a list of
> keys which have been updated in each transaction and the dumper could get
> the data from the main memory table (with some extra stuff for detection of
> multiple updates of the same record. Hmmm).

Again, Oracle does this.  They identify a single record by its physical
disk position (and never, ever move the record).  They can then make
their log merely a redo log (indeed that is the name).  By making the
changes to the data table in memory, and writing them to disk in bulk,
without needing to execute the log file against the table's data file(s),
it saves time later on.

I for one would like to be able to force my transactions (write that is)
to wait for the log writer to finish writing them if the problem is that
mnesia will overload like this.  Should I step up the frequency of my
log writer in order to force it to write more frequently, and hopefully
at least simulate that the writes are waiting for the log?

--
Shawn.

  ``If this had been a real
    life, you would have
    received instructions
    on where to go and what
    to do.''


Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Andi Kleen
On Tue, Jan 02, 2001 at 03:49:16PM -0500, Shawn Pearce wrote:
> Oracle does this with their database and it is a big performance
> booster.  The other thing they do is allow a table to be striped
> across multiple disks by making a table exist in multiple file system
> files at once.  (They stripe disk allocations across the files.)  This
> does help to manage larger tables as well.

Near all modern OS can do that themselves using volume managers and software
RAID -- it would probably be a waste of time to implement it in Mnesia too.


-Andi


Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Shawn Pearce
Andi Kleen <ak> scrawled:
> On Tue, Jan 02, 2001 at 03:49:16PM -0500, Shawn Pearce wrote:
> > Oracle does this with their database and it is a big performance
> > booster.  The other thing they do is allow a table to be striped
> > across multiple disks by making a table exist in multiple file system
> > files at once.  (They stripe disk allocations across the files.)  This
> > does help to manage larger tables as well.
>
> Near all modern OS can do that themselves using volume managers and software
> RAID -- it would probably be a waste of time to implement it in Mnesia too.

This is true, and I agree.  However, it does allow Oracle to easily
handle >2GB datafiles on Unixes that cannot deal with it.  It also
lets you stick to 32 bit file offsets by adding a file number
``prefix''.

Keep in mind its nice to be able to split backups onto tapes by
designing the database datafiles such that one data file fits onto a
tape.  Or a cluster of datafiles fits onto a tape.  What if we have a
100GB database, how do we dump it onto 20GB tapes??  If its one huge
file, its harder to dump than if its a collection of 10GB files.  Or
1GB files that can be put on tape at 20 (or 19) files at a time.

But striping may be out of the question.  Maybe its just a linear
joining?

Anyway, just a thought on top of my other comments with Mnesia.

--
Shawn.

  ``If this had been a real
    life, you would have
    received instructions
    on where to go and what
    to do.''


Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Ulf Wiger-4
On Tue, 2 Jan 2001, Shawn Pearce wrote:

>Andi Kleen <ak> scrawled:
>> On Tue, Jan 02, 2001 at 03:49:16PM -0500, Shawn Pearce wrote:
>> > Oracle does this with their database and it is a big performance
>> > booster.  The other thing they do is allow a table to be striped
>> > across multiple disks by making a table exist in multiple file system
>> > files at once.  (They stripe disk allocations across the files.)  This
>> > does help to manage larger tables as well.
>>
>> Near all modern OS can do that themselves using volume managers and software
>> RAID -- it would probably be a waste of time to implement it in Mnesia too.
>
>This is true, and I agree.  However, it does allow Oracle to easily
>handle >2GB datafiles on Unixes that cannot deal with it.  It also
>lets you stick to 32 bit file offsets by adding a file number
>``prefix''.

I must confess that I haven't kept up with OS vs DBMS design in the
past few years, but it used to be commonly accepted that you simply
couldn't build a really fast DBMS on top of the standard file and
memory management provided by the leading operating systems --
certainly not if you wanted similar behaviour across multiple
platforms. I don't know if it's still true...

One DBMS I worked with was Cincom's SUPRA. I remember that you had the
choice upon installing the database whether you wanted it to reside on
a normal file system (good for testing) or on a raw partition (good
for speed). In the case of the raw partition, SUPRA would use its own
file I/O driver.

I agree that the problem is worse for mnesia, since it must deal with
variable size objects. I know that work is ongoing to rewrite dets for
much better performance. One of the biggies is that an Erlang program
shall be able to perform multiple disk operations in one instruction
to the file driver. This should be a big booster for dets.

/Uffe
--
Ulf Wiger                                    tfn: +46  8 719 81 95
Senior System Architect                      mob: +46 70 519 81 95
Strategic Product & System Management    ATM Multiservice Networks
Data Backbone & Optical Services Division      Ericsson Telecom AB



Reply | Threaded
Open this post in threaded view
|

job regulation (was Re: Mnesia disk performance (was RE: multi-attribute mnesia indexes?))

Ulf Wiger-4
In reply to this post by Shawn Pearce
On Tue, 2 Jan 2001, Shawn Pearce wrote:

>I for one would like to be able to force my transactions (write that
>is) to wait for the log writer to finish writing them if the problem
>is that mnesia will overload like this.  Should I step up the
>frequency of my log writer in order to force it to write more
>frequently, and hopefully at least simulate that the writes are
>waiting for the log?

The right way to do this is probably to introduce job scheduling for
transactions.

At AXD 301, we've had reason to think very hard about load regulation
-- with pretty fantastic results, I might add: the AXD 301 should be
almost impervious to denial-of-service attacks, and easily meets the
BellCore requirement of 90% throughput at 150% continuous load.

What we've found though, is that for really effective load regulation,
_all_ significant jobs in the system should be made queuable in a job
scheduler. Since we can't do this with mnesia (no hooks for load
regulation), our job scheduler samples the "background load" and takes
this into account.

I would really like to se a generic framework for plugging in a load
regulator in an OTP system. The default behaviour should be "no
regulation", and it should be possible to use different load
regulators (but with common semantics) for different products.

For mnesia, I lean towards the following three measures:

- provide for synchronous transactions:
  these would work similarly to sync_dirty, i.e. they return when
  the transaction is committed on all nodes.

- make it possible/configurable to force a log dump before returning
  to the caller. I suggested a wrapper before (still haven't tested
  it myself). Would that method work?

- make it possible to hook log dumps into a load regulator. This
  would require a framework for load regulation as mentioned above.

To complete the picture, the programs starting transactions would of
course also be load regulated. As mnesia cannot know the priority and
total cost of a transaction, this would have to be done by the
applications.

/Uffe
--
Ulf Wiger                                    tfn: +46  8 719 81 95
Senior System Architect                      mob: +46 70 519 81 95
Strategic Product & System Management    ATM Multiservice Networks
Data Backbone & Optical Services Division      Ericsson Telecom AB



Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Andi Kleen
In reply to this post by Ulf Wiger-4
On Wed, Jan 03, 2001 at 10:08:25AM +0100, Ulf Wiger wrote:
> I must confess that I haven't kept up with OS vs DBMS design in the
> past few years, but it used to be commonly accepted that you simply
> couldn't build a really fast DBMS on top of the standard file and
> memory management provided by the leading operating systems --
> certainly not if you wanted similar behaviour across multiple
> platforms. I don't know if it's still true...

The file systems have catched up a lot and often databases are run in
files these days now. On a modern extent based fs with preallocated files
you basically do raw IO in the extents.

If you want 100% the same behaviour everywhere you'll of course need to
do it from scratch, but it'll cost you in effort, and increases the
mainteance effort required by administrators a lot.


>
> One DBMS I worked with was Cincom's SUPRA. I remember that you had the
> choice upon installing the database whether you wanted it to reside on
> a normal file system (good for testing) or on a raw partition (good
> for speed). In the case of the raw partition, SUPRA would use its own
> file I/O driver.

Not really an option anymore (using own drivers), except for some very
special cases.


-Andi


Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

David Gould
In reply to this post by Andi Kleen
On Tue, Jan 02, 2001 at 10:05:56PM +0100, Andi Kleen wrote:

> On Tue, Jan 02, 2001 at 03:49:16PM -0500, Shawn Pearce wrote:
> > Oracle does this with their database and it is a big performance
> > booster.  The other thing they do is allow a table to be striped
> > across multiple disks by making a table exist in multiple file system
> > files at once.  (They stripe disk allocations across the files.)  This
> > does help to manage larger tables as well.
>
> Near all modern OS can do that themselves using volume managers and software
> RAID -- it would probably be a waste of time to implement it in Mnesia too.
>
> -Andi

(Psst, Andi, what are you doing on this list? ;-) )

Anyway, it is still very worthwhile to be able to place tables and indexes
by name from the DB or application and not rely on raid or lvm systems to
do this automagically. The DBA or app designer can know quite a lot about
access patterns and paths and use this to assign table fragments and indexes
that are used at the same time to separate spindles or even controllers/buses.
And one almost always wants to place transaction (undo/redo) logs on separate
spindles/paths from data spaces.

It is useful to have an lvm or raid system to make manageing storage easier
and safer and more flexible, but in the end, you need to be able to place
specific chunks'o'stuff (technical database term) onto specific logical
volumes or raid sets anyway to be able to control parallism of
spindles/actuators and not overload controllers or buses.

How hard would it be to add the capability of specifying a path of some
kind to mnesia/dets tables/indexes/logs ets?

-dg

--
David Gould                                                 dg
SuSE, Inc.,  580 2cd St. #210,  Oakland, CA 94607          510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."


Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

David Gould
In reply to this post by Ulf Wiger-4
On Wed, Jan 03, 2001 at 10:08:25AM +0100, Ulf Wiger wrote:
>
> I must confess that I haven't kept up with OS vs DBMS design in the
> past few years, but it used to be commonly accepted that you simply
> couldn't build a really fast DBMS on top of the standard file and
> memory management provided by the leading operating systems --
> certainly not if you wanted similar behaviour across multiple
> platforms. I don't know if it's still true...

Mostly still true in the hardcore DBMS world, but better/easier than it
used to be.
 
> choice upon installing the database whether you wanted it to reside on
> a normal file system (good for testing) or on a raw partition (good
> for speed). In the case of the raw partition, SUPRA would use its own
> file I/O driver.

This is still common, it depends really on whether the OS provides both
unbuffered and asynchronous I/O to files (eg, VMS, NT, some unix), or
only to raw disks (eg, some unix, current Linux (work in progress)).

The real issue is unbuffered and asynch I/O, DBMSs are happy to use
a filesystem if it can provide these features.

-dg

--
David Gould                                                 dg
SuSE, Inc.,  580 2cd St. #210,  Oakland, CA 94607          510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."


Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Claes Wikström
In reply to this post by David Gould
On Wed, Jan 03, 2001 at 12:07:21PM -0800, David Gould wrote:
> On Tue, Jan 02, 2001 at 10:05:56PM +0100, Andi Kleen wrote:
>
> How hard would it be to add the capability of specifying a path of some
> kind to mnesia/dets tables/indexes/logs ets?
>

This would be pretty straightforward, maybe one directory per table
(specified in create_table) and one directory for the logs


/klacke



--
Claes Wikstrom                        -- Caps lock is nowhere and
Alteon WebSystems                     -- everything is under control          
http://www.bluetail.com/~klacke       --



Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

Shawn Pearce
Klacke <klacke> scrawled:
> On Wed, Jan 03, 2001 at 12:07:21PM -0800, David Gould wrote:
> > On Tue, Jan 02, 2001 at 10:05:56PM +0100, Andi Kleen wrote:
> >
> > How hard would it be to add the capability of specifying a path of some
> > kind to mnesia/dets tables/indexes/logs ets?
> >
>
> This would be pretty straightforward, maybe one directory per table
> (specified in create_table) and one directory for the logs

As a create_table option?

    {atomic, ok} = mnesia:create_table(sequence,
                    [{disc_copies, [node()]},
                                 {directory, "/mnesia01"},
                     {attributes, record_info(fields, sequence)}])

My only concern is that not all nodes in a distribued database would
want to store the tables necesarily at the same directory path.
(Different drive structures for instance.)

What about making the path always relative to the mnesia directory
given on the command line to erts?  Then if you want to relocate the
files of a table to another disk, you can make a symlink in the mnesia
data directory to point to a directory on the other disk.  So long as
all nodes have the same structure of their mnesia directory, your safe.
This is assuming that all OSes have a good notion of symlinks.  I'm no
Windows guru, but this makes me think that perhaps its not a good idea
to do with Windows.

The other option is to make either in schema or a new table that goes
with schema that records the name of a table, the name of the node
and the directory path of the storage file for the table.  If mnesia
could bring the system tables 'schema' and 'table_storage' online,
and ignore all other tables, the user could move the file, update the
mnesia system table 'table_storage', and then bring the system up the
rest of the way.  Oracle does this.  I'm sure other database vendors do
too.

Maybe rather than store the files in dets they could be stored in
a config file that is parsed at mnesia startup.  That way its changeable
before one starts up mnesia.  If an entry doesn't exist in the config
file, the default location (mnesia directory) could be used.

--
Shawn.

  ``If this had been a real
    life, you would have
    received instructions
    on where to go and what
    to do.''


Reply | Threaded
Open this post in threaded view
|

Mnesia disk performance (was RE: multi-attribute mnesia indexes?)

David Gould
On Wed, Jan 03, 2001 at 06:35:58PM -0500, Shawn Pearce wrote:

> Klacke <klacke> scrawled:
> > On Wed, Jan 03, 2001 at 12:07:21PM -0800, David Gould wrote:
> > > On Tue, Jan 02, 2001 at 10:05:56PM +0100, Andi Kleen wrote:
> > >
> > > How hard would it be to add the capability of specifying a path of some
> > > kind to mnesia/dets tables/indexes/logs ets?
> > >
> >
> > This would be pretty straightforward, maybe one directory per table
> > (specified in create_table) and one directory for the logs
>
> As a create_table option?
>
>     {atomic, ok} = mnesia:create_table(sequence,
>            [{disc_copies, [node()]},
> {directory, "/mnesia01"},
>             {attributes, record_info(fields, sequence)}])
>
> My only concern is that not all nodes in a distribued database would
> want to store the tables necesarily at the same directory path.
> (Different drive structures for instance.)

Exactly. It really needs to be a per table/per node option.

> What about making the path always relative to the mnesia directory
> given on the command line to erts?  Then if you want to relocate the
> files of a table to another disk, you can make a symlink in the mnesia
> data directory to point to a directory on the other disk.  So long as
> all nodes have the same structure of their mnesia directory, your safe.

This could work.

> This is assuming that all OSes have a good notion of symlinks.  I'm no
> Windows guru, but this makes me think that perhaps its not a good idea
> to do with Windows.

Er, except on Windows.
 
> The other option is to make either in schema or a new table that goes
> with schema that records the name of a table, the name of the node
> and the directory path of the storage file for the table.  If mnesia
> could bring the system tables 'schema' and 'table_storage' online,
> and ignore all other tables, the user could move the file, update the
> mnesia system table 'table_storage', and then bring the system up the
> rest of the way.  Oracle does this.  I'm sure other database vendors do
> too.

This is how Postgresql, Sybase, and Informix do it too. As far as I know
it is the "usual way". Of course for Mnesia this would want to be a
replicated table indexed on "node, table".

-dg

--
David Gould                                                 dg
SuSE, Inc.,  580 2cd St. #210,  Oakland, CA 94607          510.628.3380
why would you want to own /dev/null? "ooo! ooo! look! i stole nothing!
i'm the thief of nihilism! i'm the new god of zen monks."