Why so slow kmeans implementation

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Why so slow kmeans implementation

Andrea Peruffo
is a benchmark of different languages on a kmeans clustering algo.

I made an implementation of it but is terribly slow...

A question is about the data structure "dict" I used, is that the proper use case or there is anything better?
I have checked with "ets" but looks even slower...

Is there anything wrong with the provided implementation?

Thanks in advance.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Why so slow kmeans implementation

dmkolesnikov
Hello,

dict is not a fastest data structure structure, ets out performs it by level of magnitude.
if you have an issue with "ets performance” then you are doing something wrong.  
Personally, I prefer to use gb_trees, rbtree or binary-search tree instead of dict. 

Could you share your test data set for kmean? 
We can look deeply where the issue is.

Best Regards, 
Dmitry

On 24 Feb 2015, at 10:40, Andrea Peruffo <[hidden email]> wrote:

is a benchmark of different languages on a kmeans clustering algo.

I made an implementation of it but is terribly slow...

A question is about the data structure "dict" I used, is that the proper use case or there is anything better?
I have checked with "ets" but looks even slower...

Is there anything wrong with the provided implementation?

Thanks in advance.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Why so slow kmeans implementation

Björn Gustavsson-4
In reply to this post by Andrea Peruffo
On Tue, Feb 24, 2015 at 9:40 AM, Andrea Peruffo
<[hidden email]> wrote:

> This:
> https://github.com/andreaferretti/kmeans
> is a benchmark of different languages on a kmeans clustering algo.
>
> I made an implementation of it but is terribly slow...
>
> A question is about the data structure "dict" I used, is that the proper use
> case or there is anything better?
> I have checked with "ets" but looks even slower...
>

Using sofs was 10 times faster on my computer:

groupBy(L, Fn) ->
TableId = groupBy(L, Fn, []),
values(TableId).

values([]) -> [];
values([{_, V} | T]) ->
[V | values(T)].

groupBy([H|T], Fn, Acc) ->
    Pair = {erlang:phash2(Fn(H)),H},
    groupBy(T, Fn, [Pair|Acc]);
groupBy([], _, Acc) ->
    L0 = sofs:relation(Acc),
    L = sofs:relation_to_family(L0),
    sofs:to_external(L).

https://github.com/andreaferretti/kmeans/pull/12

/Bjorn

--
Björn Gustavsson, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Why so slow kmeans implementation

Loïc Hoguin-3
On 02/24/2015 10:53 AM, Björn Gustavsson wrote:

> On Tue, Feb 24, 2015 at 9:40 AM, Andrea Peruffo
> <[hidden email]> wrote:
>> This:
>> https://github.com/andreaferretti/kmeans
>> is a benchmark of different languages on a kmeans clustering algo.
>>
>> I made an implementation of it but is terribly slow...
>>
>> A question is about the data structure "dict" I used, is that the proper use
>> case or there is anything better?
>> I have checked with "ets" but looks even slower...
>>
>
> Using sofs was 10 times faster on my computer:

If only a kind soul could improve the sofs documentation so it is
understandable by lesser human beings... :-)

--
Loïc Hoguin
http://ninenines.eu
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Why so slow kmeans implementation

Hynek Vychodil
I second that.

BTW the process dictionary version is 50 times faster on mine computer.


Anyway I would definitely not use Erlang for this sort of task in the production. I would write NIF or use port or C-node for this sort of task.

On Tue, Feb 24, 2015 at 11:36 AM, Loïc Hoguin <[hidden email]> wrote:
On 02/24/2015 10:53 AM, Björn Gustavsson wrote:
On Tue, Feb 24, 2015 at 9:40 AM, Andrea Peruffo
<[hidden email]> wrote:
This:
https://github.com/andreaferretti/kmeans
is a benchmark of different languages on a kmeans clustering algo.

I made an implementation of it but is terribly slow...

A question is about the data structure "dict" I used, is that the proper use
case or there is anything better?
I have checked with "ets" but looks even slower...


Using sofs was 10 times faster on my computer:

If only a kind soul could improve the sofs documentation so it is understandable by lesser human beings... :-)

--
Loïc Hoguin
http://ninenines.eu

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Why so slow kmeans implementation

Hynek Vychodil
In reply to this post by Andrea Peruffo
The main problem is the behaviour of dict:append/3. From documentation:

For example append/3 could be defined as:

append(Key, Val, D) ->
    update(Key, fun (Old) -> Old ++ [Val] end, [Val], D).

dict module is implemented in pure erlang which means Old ++ [Val] is what exactly happens behind. When Old part is growing it becomes the problem due its quadratic complexity. The solution is use dict:update/4 in the way:

preppend(Key, Val, D) ->
    dict:update(Key, fun (Old) -> [Val|Old] end, [Val], D).

With this fix, it is around 24 times faster on mine computer. https://github.com/andreaferretti/kmeans/pull/14

Faster process dictionary version is available at https://github.com/pichi/kmeans/tree/proc_dict
It is not good practice but can be an inspiration how to write fast code when there is not other option like NIF, C-port or C-node.


On Tue, Feb 24, 2015 at 9:40 AM, Andrea Peruffo <[hidden email]> wrote:
is a benchmark of different languages on a kmeans clustering algo.

I made an implementation of it but is terribly slow...

A question is about the data structure "dict" I used, is that the proper use case or there is anything better?
I have checked with "ets" but looks even slower...

Is there anything wrong with the provided implementation?

Thanks in advance.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions