|
Hi,
I'm looking into the rate of inserting rows in mnesia. Having written the attached test (that can be parameterized to insert an arbitrary amount of rows in arbitrary chunk sizes), I've found out so far that the highest throughput seems to be somebody where between 50 and 200 per transaction. What surprised me a bit is the magnitude of the effect that index maintenance has on the rate of insertion. If I place secondary indexes on two non-key attributes, the throughput drops off considerably. For example, inserting 10000 rows in batches of 1000 whilst maintaining 2 non-key indexes produces the following rates of insertion per batch: rate:insert(10000,1000). Batch rate = 10688 Batch rate = 7182 Batch rate = 5001 Batch rate = 4072 Batch rate = 3300 Batch rate = 2866 Batch rate = 2377 Batch rate = 2166 Batch rate = 1807 Batch rate = 1303 The Batch rate is the amount of inserts per second in each batch. This tallies up with the idea that at the beginning the index overhead is tiny, but grows on each insertion, which is normal. I just didn't think that the throughput would drop off so sharply. Does anybody know if I'm doing something completely wrong or if there is a much better way to use mnesia with large tables? Thanks, Ben -module(rate). -compile(export_all). -record(a, {id,first,second}). init() -> mnesia:create_schema([node()]), mnesia:start(), mnesia:delete_table(a), mnesia:create_table(a, [{attributes, record_info(fields, a)}]), mnesia:add_table_index(a,first), mnesia:add_table_index(a,second), ok. insert(N,BatchSize) -> mnesia:clear_table(a), batch(N, BatchSize). batch(0,_) -> ok; batch(N,BS) -> F = fun() -> write(#a{first = BS,second = BS},BS) end, {Time,_} = timer:tc(mnesia,transaction,[F]), io:format("Batch rate = ~p~n",[round(BS / Time * 1000000)]), batch(N - BS, BS). write(_,0) -> ok; write(X,N) -> mnesia:write(X#a{id = now()}), write(X,N-1). _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
It is not obvious, but try change your write:
write(_,0) -> ok; write(X,N) -> T = now(), mnesia:write(X#a{id = T, first = T, second = T}), write(X,N-1). and think what happened to indexes in previous version :-) On Tue, Jun 10, 2008 at 1:51 PM, Ben Hood <[hidden email]> wrote: Hi, -- --Hynek (Pichi) Vychodil _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
Hynek,
On 10 Jun 2008, at 13:46, Hynek Vychodil wrote: > It is not obvious, but try change your write: > > write(_,0) -> ok; > write(X,N) -> > T = now(), > mnesia:write(X#a{id = T, first = T, second = T}), > write(X,N-1). > > and think what happened to indexes in previous version :-) Very interesting! So, if I understand you correctly, what I conclude is that the index performance degrades when you have lots of identical keys. Do you know why this is the case? Does the index perhaps maintain a list of values for each key and hence the set of values has to be updated continuously? If so, is this a bug or a feature? Thx, Ben _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
I think this is feature of bag and duplicate_bag ets table types and is mentioned in manual.
On Tue, Jun 10, 2008 at 5:00 PM, Ben Hood <[hidden email]> wrote: Hynek, -- --Hynek (Pichi) Vychodil _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Ben Hood
I think this is feature of bag and duplicate_bag ets table types and is mentioned in manual.
On Tue, Jun 10, 2008 at 5:00 PM, Ben Hood <[hidden email]> wrote: Hynek, -- --Hynek (Pichi) Vychodil _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
Hynek,
On Tue, Jun 10, 2008 at 4:30 PM, Hynek Vychodil <[hidden email]> wrote: > I think this is feature of bag and duplicate_bag ets table types and is > mentioned in manual. Maybe, but my table does not specify a type, so it be a set by default. According to the mnesia manual, the three types of table are set, ordered_set or bag, whereas in ets/dets you can have a duplicate_bag as well. In the mnesia man page it states: "Indices do not come free, they occupy space which is proportional to the size of the table. They also cause insertions into the table to execute slightly slower." But I can't see anything specific about using bags and indices. I googled this post by Ulf a few years ago, which may shed some light onto the situation: http://www.erlang.org/pipermail/erlang-questions/2005-July/016083.html There seems to be a reference to a (now unmaintained) version of a mnesia index, which I assume to be the rdbms module in jungerl. Does anybody know if that addresses the issue, or even exactly what the underlying problem is? Ben _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Ben Hood
Hi,
> So, if I understand you correctly, what I conclude is that the index > performance degrades when you have lots of identical keys. Yep! > Do you know why this is the case? > > Does the index perhaps maintain a list of values for each key and > hence the set of values has to be updated continuously? Yes > If so, is this a bug or a feature? It is a long standing "feature" of mnesia secondary indexes. It is one that I also ran into a few weeks into my use of Erlang/mnesia. I guess it is in the same category as repeated queue scanning for selective receive - easy to workaround in most cases, so no major incentive to change the behaviour. Sean _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
> I guess it is in the same category as repeated queue scanning for selective
> receive - easy to workaround in most cases, so no major incentive to change > the behaviour. So you're saying that one should decompose the duplicates into another table that just contains a canonical representation of each unique combination and just reference these from the original table? Ben _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
On Tue, Jun 10, 2008 at 9:44 PM, Ben Hood <[hidden email]> wrote:
or don't use index if cardinality of element is far less than cardinality of table. Index is useless in this case anyway. -- --Hynek (Pichi) Vychodil _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Sean Hinde
Sean Hinde <[hidden email]> wrote:
>> If so, is this a bug or a feature? sh> It is a long standing "feature" of mnesia secondary indexes. It is sh> one that I also ran into a few weeks into my use of Erlang/mnesia. Yes, "feature" is a good word for it. sh> I guess it is in the same category as repeated queue scanning for sh> selective receive - easy to workaround in most cases, so no major sh> incentive to change the behaviour. At least one workaround has been mentioned elsewhere in this thread, IIRC. Another one is to change the term stored in each secondary index'ed column from: term() ... where term() is too popular, e.g. 5 to something like: {term(), much_less_frequent_filler_term()} ... where the 2nd term is calculated by something like: {_, _, FillerNumber} = erlang:now() Because the entire term is unique (or has very few duplicates), the secondary index bag (or duplicate_bag, I forget) doesn't blow up with O(N^2) behavior. This causes problems such as: * needing to use index_match_object() (e.g. with a pattern of {5, '_'}) instead of index_read() to fetch records matching 5. * needing to use {DesiredTerm, '_'} in other places that you perform queries, such as using QLC. * perhaps others that I've forgotten? -Scott _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
Scott,
On Thu, Jun 12, 2008 at 12:32 AM, Scott Lystig Fritchie <[hidden email]> wrote: > > term() ... where term() is too popular, e.g. 5 > > to something like: > > {term(), much_less_frequent_filler_term()} > > ... where the 2nd term is calculated by something like: > > {_, _, FillerNumber} = erlang:now() > > Because the entire term is unique (or has very few duplicates), the > secondary index bag (or duplicate_bag, I forget) doesn't blow up with > O(N^2) behavior. Doesn't this render the index useless though? Is this not just creating more index entries so that reading the index is the same as doing a full table scan? Ben _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
| Powered by Nabble | Edit this page |
