ETS performance visualisation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ETS performance visualisation

Kacper Mentel
Hi! 

My name is Kacper and I'm participating in this year's edition of Google Summer of Code under BEAM Community. My GSoC's project is an extension to Erlang Performance Lab. Erlang Lab is an Open Source project aimed at analysing the performance of Erlang and Elixir systems. The project provides web-based visualisations of the analysis. It can help developers to better understand their systems, observe the system behaviour and identify weaknesses.

My work is focused on visualising ETS tables performance. If you want to learn more about my GSoC project or erlangpl here are the links to have a look at:


The extension to the EPL I'm going to implement will deliver 3 views which can be seen as layers through which you can look at the ETS tables. The idea is simple — if you want to trace a problem related to an ETS table, you start from the top-level view (the most general one) and go down through the subsequent layers (views) until you reach the one that will help you identify the source of the problem you are looking for. The scheme of the views looks as follows: cluster view (shows all the nodes in a cluster) -> node view (shows all ETS tables on a node) -> ETS details (shows more detailed information).

I've implemented the first view so far. Here is the screenshot: https://drive.google.com/file/d/0BzXT6yCa9ckQVzF0RnhibG9LbFE/view?usp=sharing

Now I'm working on the second view (node view). Here are some screenshots:

It shows all the ETS tables from a node linked to their owner's processes. The color of the tables reflects the amount of allocated memory for the table (red -> the heaviest, blue -> the lightest). This is only an example how we can visualise ETS related data.

The most imprtant questions to you are: 
- What is worth measuring and visualising in terms of ETS performance? 
- What kind of information can help developers to find ETS related problems?

Here are some of my (and EPL team's) ideas:

- the amount of the allocated memory for a particular ETS table,
- memory utilization/fragmentation - real size of a particular table in bytes compared to the amount of the memory allocated for that table
- number of rows in a partiuclar table
- some statistics about locks protecting ETS tables (collected using lcnt module)
- mean acces time to the table in a unit of time

I would be very grateful for your feedback, suggestions and ideas.

Cheers,
Kacper

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ETS performance visualisation

Bastien Chamagne

Hi Kacper, amazing idea.

Without thinking too much I'd say: "writes per seconds", "reads per seconds" are good metrics. Whether write_concurrency or read_concurrency is enabled.

Good luck on this project!


On 05/07/2017 12:42, Kacper Mentel wrote:
Hi! 

My name is Kacper and I'm participating in this year's edition of Google Summer of Code under BEAM Community. My GSoC's project is an extension to Erlang Performance Lab. Erlang Lab is an Open Source project aimed at analysing the performance of Erlang and Elixir systems. The project provides web-based visualisations of the analysis. It can help developers to better understand their systems, observe the system behaviour and identify weaknesses.

My work is focused on visualising ETS tables performance. If you want to learn more about my GSoC project or erlangpl here are the links to have a look at:


The extension to the EPL I'm going to implement will deliver 3 views which can be seen as layers through which you can look at the ETS tables. The idea is simple — if you want to trace a problem related to an ETS table, you start from the top-level view (the most general one) and go down through the subsequent layers (views) until you reach the one that will help you identify the source of the problem you are looking for. The scheme of the views looks as follows: cluster view (shows all the nodes in a cluster) -> node view (shows all ETS tables on a node) -> ETS details (shows more detailed information).

I've implemented the first view so far. Here is the screenshot: https://drive.google.com/file/d/0BzXT6yCa9ckQVzF0RnhibG9LbFE/view?usp=sharing

Now I'm working on the second view (node view). Here are some screenshots:

It shows all the ETS tables from a node linked to their owner's processes. The color of the tables reflects the amount of allocated memory for the table (red -> the heaviest, blue -> the lightest). This is only an example how we can visualise ETS related data.

The most imprtant questions to you are: 
- What is worth measuring and visualising in terms of ETS performance? 
- What kind of information can help developers to find ETS related problems?

Here are some of my (and EPL team's) ideas:

- the amount of the allocated memory for a particular ETS table,
- memory utilization/fragmentation - real size of a particular table in bytes compared to the amount of the memory allocated for that table
- number of rows in a partiuclar table
- some statistics about locks protecting ETS tables (collected using lcnt module)
- mean acces time to the table in a unit of time

I would be very grateful for your feedback, suggestions and ideas.

Cheers,
Kacper


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ETS performance visualisation

Jesper Louis Andersen-2
In reply to this post by Kacper Mentel
On Wed, Jul 5, 2017 at 12:46 PM Kacper Mentel <[hidden email]> wrote:

- mean acces time to the table in a unit of time


This is a pet-peeve of mine: the mean access time is often misleading, and usually completely wrong.

1. You can only use the mean for anything once you know the statistical model in which the data fits. Say we *assume* the data is normally distributed. Then we can use the mean for something as long as we also report the variance. But a general rule of computer science is that data is rarely normally distributed. It is much more common that data is (bi-)modal: there is a fast case, and then a slow code path for some pathological case. Thus, any mention of the mean will report a number in between the fast and slow class: there will be no data here!

2. Reporting the median (50th percentile) is slightly better. But it signals "I don't care for half of my customers" in the sense you ignore half of the requests. I'm far more interested in the 90th, 95th, 99th, 99.9th, 99.99th, 99.999th, percentiles and the maximal value than the mean for anything I do. Tracking this is easily done with HdrHistogram (see Gil Tene's work - the idea is to make histogram buckets follow the structure of a floating point number representation with exponent and mantissa which keeps the resolution high around 0.0).

3. I'm interested in a histogram over the latencies. But since histograms require you to come up with the size of the bars, a kernel density plot is almost always what I go after for these.

One of the interesting things I've found is that if you plot the above, the conclusions tend to change quite a lot. For instance that the algorithm which is *really* fast in the common case is *really* slow when it hits the slow path. It may be so slow it is unusable. But if you report the mean, the system can "hide" the slow query by amortizing it over the fast ones. I don't find this to be fair.

Another takeaway is that improving the 99th percentile tend to improve the latency curve for the system as a whole. ETS is a system in which lookups should not take more than 1-2 microseconds. But this means it should also hold for the 99.99th percentile.

Finally, I have a hunch the {read_concurrency, true} options will have a far greater impact on parallel access to the table if you have a high amount of cores. Reporting the mean would allow the system to "hide" that it is stalling one core.

Aside: If you haven't, your work should have a section which describes how the test cases work around the problem of "coordinated omission" in which the test generator coordinates with the system to hide request latencies which are really higher than what they should be.

Have fun working on the project! Take or leave the above suggestions as you see fit!

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Loading...