Quantcast

Index Overhead In Mnesia

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Index Overhead In Mnesia

Ben Hood
Hi,

I'm looking into the rate of inserting rows in mnesia.

Having written the attached test (that can be parameterized to insert  
an arbitrary amount of rows in arbitrary chunk sizes), I've found out  
so far that the highest throughput seems to be somebody where between  
50 and 200 per transaction.

What surprised me a bit is the magnitude of the effect that index  
maintenance has on the rate of insertion.

If I place secondary indexes on two non-key attributes, the throughput  
drops off considerably.

For example, inserting 10000 rows in batches of 1000 whilst  
maintaining 2 non-key indexes produces the following rates of  
insertion per batch:

rate:insert(10000,1000).
Batch rate = 10688
Batch rate = 7182
Batch rate = 5001
Batch rate = 4072
Batch rate = 3300
Batch rate = 2866
Batch rate = 2377
Batch rate = 2166
Batch rate = 1807
Batch rate = 1303

The Batch rate is the amount of inserts per second in each batch.

This tallies up with the idea that at the beginning the index overhead  
is tiny, but grows on each insertion, which is normal.

I just didn't think that the throughput would drop off so sharply.

Does anybody know if I'm doing something completely wrong or if there  
is a much better way to use mnesia with large tables?

Thanks,

Ben

-module(rate).

-compile(export_all).

-record(a, {id,first,second}).

init() ->
     mnesia:create_schema([node()]),
     mnesia:start(),
     mnesia:delete_table(a),
     mnesia:create_table(a,
                         [{attributes, record_info(fields, a)}]),
     mnesia:add_table_index(a,first),
     mnesia:add_table_index(a,second),
     ok.

insert(N,BatchSize) ->
     mnesia:clear_table(a),
     batch(N, BatchSize).

batch(0,_) -> ok;
batch(N,BS) ->
     F = fun() -> write(#a{first = BS,second = BS},BS) end,
     {Time,_} = timer:tc(mnesia,transaction,[F]),
     io:format("Batch rate = ~p~n",[round(BS / Time * 1000000)]),
     batch(N - BS, BS).

write(_,0) -> ok;
write(X,N) ->
     mnesia:write(X#a{id = now()}),
     write(X,N-1).
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Hynek Vychodil
It is not obvious, but try change your write:

write(_,0) -> ok;
write(X,N) ->
    T = now(),
    mnesia:write(X#a{id = T, first = T, second = T}),
    write(X,N-1).

and think what happened to indexes in previous version :-)

On Tue, Jun 10, 2008 at 1:51 PM, Ben Hood <[hidden email]> wrote:
Hi,

I'm looking into the rate of inserting rows in mnesia.

Having written the attached test (that can be parameterized to insert
an arbitrary amount of rows in arbitrary chunk sizes), I've found out
so far that the highest throughput seems to be somebody where between
50 and 200 per transaction.

What surprised me a bit is the magnitude of the effect that index
maintenance has on the rate of insertion.

If I place secondary indexes on two non-key attributes, the throughput
drops off considerably.

For example, inserting 10000 rows in batches of 1000 whilst
maintaining 2 non-key indexes produces the following rates of
insertion per batch:

rate:insert(10000,1000).
Batch rate = 10688
Batch rate = 7182
Batch rate = 5001
Batch rate = 4072
Batch rate = 3300
Batch rate = 2866
Batch rate = 2377
Batch rate = 2166
Batch rate = 1807
Batch rate = 1303

The Batch rate is the amount of inserts per second in each batch.

This tallies up with the idea that at the beginning the index overhead
is tiny, but grows on each insertion, which is normal.

I just didn't think that the throughput would drop off so sharply.

Does anybody know if I'm doing something completely wrong or if there
is a much better way to use mnesia with large tables?

Thanks,

Ben

-module(rate).

-compile(export_all).

-record(a, {id,first,second}).

init() ->
    mnesia:create_schema([node()]),
    mnesia:start(),
    mnesia:delete_table(a),
    mnesia:create_table(a,
                        [{attributes, record_info(fields, a)}]),
    mnesia:add_table_index(a,first),
    mnesia:add_table_index(a,second),
    ok.

insert(N,BatchSize) ->
    mnesia:clear_table(a),
    batch(N, BatchSize).

batch(0,_) -> ok;
batch(N,BS) ->
    F = fun() -> write(#a{first = BS,second = BS},BS) end,
    {Time,_} = timer:tc(mnesia,transaction,[F]),
    io:format("Batch rate = ~p~n",[round(BS / Time * 1000000)]),
    batch(N - BS, BS).

write(_,0) -> ok;
write(X,N) ->
    mnesia:write(X#a{id = now()}),
    write(X,N-1).
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions



--
--Hynek (Pichi) Vychodil
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Ben Hood
Hynek,

On 10 Jun 2008, at 13:46, Hynek Vychodil wrote:

> It is not obvious, but try change your write:
>
> write(_,0) -> ok;
> write(X,N) ->
>     T = now(),
>     mnesia:write(X#a{id = T, first = T, second = T}),
>     write(X,N-1).
>
> and think what happened to indexes in previous version :-)

Very interesting!

So, if I understand you correctly, what I conclude is that the index  
performance degrades when you have lots of identical keys.

Do you know why this is the case?

Does the index perhaps maintain a list of values for each key and  
hence the set of values has to be updated continuously?

If so, is this a bug or a feature?

Thx,

Ben
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Hynek Vychodil
I think this is feature of bag and duplicate_bag ets table types and is mentioned in manual.

On Tue, Jun 10, 2008 at 5:00 PM, Ben Hood <[hidden email]> wrote:
Hynek,


On 10 Jun 2008, at 13:46, Hynek Vychodil wrote:

It is not obvious, but try change your write:

write(_,0) -> ok;
write(X,N) ->
   T = now(),
   mnesia:write(X#a{id = T, first = T, second = T}),
   write(X,N-1).

and think what happened to indexes in previous version :-)

Very interesting!

So, if I understand you correctly, what I conclude is that the index performance degrades when you have lots of identical keys.

Do you know why this is the case?

Does the index perhaps maintain a list of values for each key and hence the set of values has to be updated continuously?

If so, is this a bug or a feature?

Thx,

Ben



--
--Hynek (Pichi) Vychodil
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Hynek Vychodil
In reply to this post by Ben Hood
I think this is feature of bag and duplicate_bag ets table types and is mentioned in manual.

On Tue, Jun 10, 2008 at 5:00 PM, Ben Hood <[hidden email]> wrote:
Hynek,


On 10 Jun 2008, at 13:46, Hynek Vychodil wrote:

It is not obvious, but try change your write:

write(_,0) -> ok;
write(X,N) ->
   T = now(),
   mnesia:write(X#a{id = T, first = T, second = T}),
   write(X,N-1).

and think what happened to indexes in previous version :-)

Very interesting!

So, if I understand you correctly, what I conclude is that the index performance degrades when you have lots of identical keys.

Do you know why this is the case?

Does the index perhaps maintain a list of values for each key and hence the set of values has to be updated continuously?

If so, is this a bug or a feature?

Thx,

Ben



--
--Hynek (Pichi) Vychodil
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Ben Hood
Hynek,

On Tue, Jun 10, 2008 at 4:30 PM, Hynek Vychodil
<[hidden email]> wrote:
> I think this is feature of bag and duplicate_bag ets table types and is
> mentioned in manual.

Maybe, but my table does not specify a type, so it be a set by default.

According to the mnesia manual, the three types of table are set,
ordered_set or bag, whereas in ets/dets you can have a duplicate_bag
as well.

In the mnesia man page it states:

"Indices do not come free, they occupy space which is proportional to
the size of the table. They also cause insertions into the table to
execute slightly slower."

But I can't see anything specific about using bags and indices.

I googled this post by Ulf a few years ago, which may shed some light
onto the situation:

http://www.erlang.org/pipermail/erlang-questions/2005-July/016083.html

There seems to be a reference to a (now unmaintained) version of a
mnesia index, which I assume to be the rdbms module in jungerl.

Does anybody know if that addresses the issue, or even exactly what
the underlying problem is?

Ben
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Sean Hinde
In reply to this post by Ben Hood
Hi,

> So, if I understand you correctly, what I conclude is that the index
> performance degrades when you have lots of identical keys.

Yep!

> Do you know why this is the case?
>
> Does the index perhaps maintain a list of values for each key and
> hence the set of values has to be updated continuously?

Yes

> If so, is this a bug or a feature?

It is a long standing "feature" of mnesia secondary indexes. It is one  
that I also ran into a few weeks into my use of Erlang/mnesia.

I guess it is in the same category as repeated queue scanning for  
selective receive - easy to workaround in most cases, so no major  
incentive to change the behaviour.

Sean

_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Ben Hood
> I guess it is in the same category as repeated queue scanning for selective
> receive - easy to workaround in most cases, so no major incentive to change
> the behaviour.

So you're saying that one should decompose the duplicates into another
table that just contains a canonical representation of each unique
combination and just reference these from the original table?

Ben
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Hynek Vychodil


On Tue, Jun 10, 2008 at 9:44 PM, Ben Hood <[hidden email]> wrote:
> I guess it is in the same category as repeated queue scanning for selective
> receive - easy to workaround in most cases, so no major incentive to change
> the behaviour.

So you're saying that one should decompose the duplicates into another
table that just contains a canonical representation of each unique
combination and just reference these from the original table?

Ben

or don't use index if cardinality of element is far less than cardinality of table. Index is useless in this case anyway.

--
--Hynek (Pichi) Vychodil
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Scott Lystig Fritchie
In reply to this post by Sean Hinde
Sean Hinde <[hidden email]> wrote:

>> If so, is this a bug or a feature?

sh> It is a long standing "feature" of mnesia secondary indexes. It is
sh> one that I also ran into a few weeks into my use of Erlang/mnesia.

Yes, "feature" is a good word for it.

sh> I guess it is in the same category as repeated queue scanning for
sh> selective receive - easy to workaround in most cases, so no major
sh> incentive to change the behaviour.

At least one workaround has been mentioned elsewhere in this thread,
IIRC.  Another one is to change the term stored in each secondary
index'ed column from:

  term()                     ... where term() is too popular, e.g. 5

to something like:

  {term(), much_less_frequent_filler_term()}

... where the 2nd term is calculated by something like:

  {_, _, FillerNumber} = erlang:now()

Because the entire term is unique (or has very few duplicates), the
secondary index bag (or duplicate_bag, I forget) doesn't blow up with
O(N^2) behavior.

This causes problems such as:

* needing to use index_match_object() (e.g. with a pattern of {5, '_'})
instead of index_read() to fetch records matching 5.

* needing to use {DesiredTerm, '_'} in other places that you perform
queries, such as using QLC.

* perhaps others that I've forgotten?

-Scott
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Index Overhead In Mnesia

Ben Hood
Scott,

On Thu, Jun 12, 2008 at 12:32 AM, Scott Lystig Fritchie
<[hidden email]> wrote:

>
>  term()                     ... where term() is too popular, e.g. 5
>
> to something like:
>
>  {term(), much_less_frequent_filler_term()}
>
> ... where the 2nd term is calculated by something like:
>
>  {_, _, FillerNumber} = erlang:now()
>
> Because the entire term is unique (or has very few duplicates), the
> secondary index bag (or duplicate_bag, I forget) doesn't blow up with
> O(N^2) behavior.

Doesn't this render the index useless though? Is this not just
creating more index entries so that reading the index is the same as
doing a full table scan?

Ben
_______________________________________________
erlang-questions mailing list
[hidden email]
http://www.erlang.org/mailman/listinfo/erlang-questions
Loading...