|
|
Hello,
We run a large scale WebRTC-based VOIP application, and currently we are evaluating whether or not dirty CPU schedulers can help us get more consistent performance with some critical NIFs. We've created a test that mimics our production load and have run it on AWS EC2 instances of various sizes. When dirty CPU schedulers are working as expected, we do see measurable improvement in performance, but unfortunately we have seen that they can drop to 0 utilization. They do not recover from this state.
The timings from 2 of our 12 NIFs are in the vicinity of the 1 msec rule of thumb, so we have experimented with them both on and off dirty CPU schedulers. We've left the other NIFs on the traditional schedulers. More detailed NIF timings can be found below.
At the end of this message is the concerning data (from recon:scheduler_usage/1).We have schedulers 1-36 as the traditional schedulers bound to the first hardware thread of each core, and schedulers 37-72 are dirty CPU schedulers. Based on mpstat it appears the dirty CPU schedulers are not bound to threads. After about 10 minutes of running the load test, schedulers 42, 43, 44, 61, 63, 64, 66, 68, 72 went to at 0 utilization and stayed there indefinitely. As time goes on, more dirty CPU schedulers enter this 0 utilization state. Schedulers 1-36 do not exhibit this behaviour. We have reproduced this consistently on other EC2 c5 instances of various sizes. We have tried various beam adjustments, but we have seen the same behaviour each time. Some examples of things we have tried:
+sbwt very_long +sbwtdcpu very_long +swt very_low +swtdcpu very_low +sfwi 500
Is there anything else that we can try to avoid this 0-utilization issue? We are happy to provide more data if needed.
NIF: webrtc_acm_add_data_10ms median: 606 usec 90 percentile: 1230 usec 99 percentile: 2285 usec 99.9 percentile: 3223 usec
NIF: webrtc_acm_playout_data_10ms median: 98 usec 90 percentile: 371 usec 99 percentile: 410 usec 99.9 percentile: 449 usec
EC2 instance: c5.18xlarge
OS: Linux 4.4.0-1073-aws #83-Ubuntu SMP Sat Nov 17 00:26:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
beam flags: +P 4194304 +sbt db +SP 50:50 +sub true +A 128 +K true
Erlang banner: Erlang/OTP 21 [erts-10.2] [source] [64-bit] [smp:36:36] [ds:36:36:10] [async-threads:128] [hipe]
cpu_topology: [{node,[{processor,[{core,[{thread,{logical,0}}, {thread,{logical,36}}]}, {core,[{thread,{logical,1}},{thread,{logical,37}}]}, {core,[{thread,{logical,2}},{thread,{logical,38}}]}, {core,[{thread,{logical,3}},{thread,{logical,39}}]}, {core,[{thread,{logical,4}},{thread,{logical,40}}]}, {core,[{thread,{logical,5}},{thread,{logical,41}}]}, {core,[{thread,{logical,6}},{thread,{logical,42}}]}, {core,[{thread,{logical,7}},{thread,{logical,43}}]}, {core,[{thread,{logical,8}},{thread,{logical,44}}]}, {core,[{thread,{logical,9}},{thread,{logical,45}}]}, {core,[{thread,{logical,10}},{thread,{logical,46}}]}, {core,[{thread,{logical,11}},{thread,{logical,47}}]}, {core,[{thread,{logical,12}},{thread,{logical,48}}]}, {core,[{thread,{logical,13}},{thread,{logical,49}}]}, {core,[{thread,{logical,14}},{thread,{logical,50}}]}, {core,[{thread,{logical,15}},{thread,{logical,51}}]}, {core,[{thread,{logical,16}},{thread,{logical,52}}]}, {core,[{thread,{logical,17}},{thread,{logical,53}}]}]}]}, {node,[{processor,[{core,[{thread,{logical,18}}, {thread,{logical,54}}]}, {core,[{thread,{logical,19}},{thread,{logical,55}}]}, {core,[{thread,{logical,20}},{thread,{logical,56}}]}, {core,[{thread,{logical,21}},{thread,{logical,57}}]}, {core,[{thread,{logical,22}},{thread,{logical,58}}]}, {core,[{thread,{logical,23}},{thread,{logical,59}}]}, {core,[{thread,{logical,24}},{thread,{logical,60}}]}, {core,[{thread,{logical,25}},{thread,{logical,61}}]}, {core,[{thread,{logical,26}},{thread,{logical,62}}]}, {core,[{thread,{logical,27}},{thread,{logical,63}}]}, {core,[{thread,{logical,28}},{thread,{logical,64}}]}, {core,[{thread,{logical,29}},{thread,{logical,65}}]}, {core,[{thread,{logical,30}},{thread,{logical,66}}]}, {core,[{thread,{logical,31}},{thread,{logical,67}}]}, {core,[{thread,{logical,32}},{thread,{logical,68}}]}, {core,[{thread,{logical,33}},{thread,{logical,69}}]}, {core,[{thread,{logical,34}},{thread,{logical,70}}]}, {core,[{thread,{logical,35}},{thread,{logical,71}}]}]}]}]
scheduler_bindings: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22, 23,24,25,26,27,28,29,30,31,32,33,34,35}
CPU utilization (from mpstat) during our load test: Linux 4.4.0-1073-aws (isw-core-x-qa-awsoh02) 01/04/2019 _x86_64_ (72 CPU)
08:55:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 08:55:15 PM all 51.58 0.00 1.64 0.01 0.00 0.42 0.00 0.00 0.00 46.36 08:55:15 PM 0 62.34 0.00 2.69 0.06 0.00 0.00 0.00 0.00 0.00 34.90 08:55:15 PM 1 62.84 0.00 2.58 0.02 0.00 0.00 0.00 0.00 0.00 34.55 08:55:15 PM 2 62.04 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.44 08:55:15 PM 3 62.17 0.00 2.62 0.00 0.00 0.00 0.00 0.00 0.00 35.21 08:55:15 PM 4 62.14 0.00 2.54 0.02 0.00 0.00 0.00 0.00 0.00 35.30 08:55:15 PM 5 62.06 0.00 2.51 0.00 0.00 0.00 0.00 0.00 0.00 35.43 08:55:15 PM 6 61.93 0.00 2.58 0.01 0.00 0.00 0.02 0.00 0.00 35.46 08:55:15 PM 7 61.99 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.48 08:55:15 PM 8 62.01 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.47 08:55:15 PM 9 61.88 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.63 08:55:15 PM 10 62.18 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.34 08:55:15 PM 11 61.89 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.64 08:55:15 PM 12 61.98 0.00 2.50 0.00 0.00 0.00 0.00 0.00 0.00 35.52 08:55:15 PM 13 61.80 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.66 08:55:15 PM 14 61.96 0.00 2.56 0.00 0.00 0.00 0.00 0.00 0.00 35.48 08:55:15 PM 15 62.11 0.00 2.57 0.00 0.00 0.00 0.00 0.00 0.00 35.31 08:55:15 PM 16 62.03 0.00 2.55 0.00 0.00 0.00 0.00 0.00 0.00 35.42 08:55:15 PM 17 61.90 0.00 2.53 0.00 0.00 0.00 0.00 0.00 0.00 35.57 08:55:15 PM 18 58.36 0.00 2.41 0.09 0.00 3.72 0.00 0.00 0.00 35.43 08:55:15 PM 19 58.29 0.00 2.45 0.02 0.00 4.00 0.00 0.00 0.00 35.24 08:55:15 PM 20 58.45 0.00 2.36 0.02 0.00 3.73 0.00 0.00 0.00 35.44 08:55:15 PM 21 58.56 0.00 2.34 0.04 0.00 3.54 0.00 0.00 0.00 35.51 08:55:15 PM 22 58.12 0.00 2.46 0.00 0.00 3.84 0.00 0.00 0.00 35.58 08:55:15 PM 23 58.24 0.00 2.36 0.01 0.00 3.82 0.00 0.00 0.00 35.57 08:55:15 PM 24 58.84 0.00 2.42 0.00 0.00 3.60 0.00 0.00 0.00 35.14 08:55:15 PM 25 58.35 0.00 2.43 0.00 0.00 3.72 0.00 0.00 0.00 35.51 08:55:15 PM 26 62.09 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45 08:55:15 PM 27 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45 08:55:15 PM 28 62.15 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.37 08:55:15 PM 29 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45 08:55:15 PM 30 62.74 0.00 2.45 0.00 0.00 0.00 0.00 0.00 0.00 34.81 08:55:15 PM 31 62.14 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.39 08:55:15 PM 32 62.35 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.13 08:55:15 PM 33 62.13 0.00 2.42 0.00 0.00 0.00 0.00 0.00 0.00 35.46 08:55:15 PM 34 62.08 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45 08:55:15 PM 35 62.15 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.35 08:55:15 PM 36 43.19 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.01 08:55:15 PM 37 43.09 0.00 0.83 0.01 0.00 0.00 0.00 0.00 0.00 56.06 08:55:15 PM 38 42.59 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.63 08:55:15 PM 39 42.60 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.58 08:55:15 PM 40 42.94 0.00 0.76 0.02 0.00 0.00 0.00 0.00 0.00 56.28 08:55:15 PM 41 42.11 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.13 08:55:15 PM 42 43.26 0.00 0.85 0.01 0.00 0.00 0.00 0.00 0.00 55.88 08:55:15 PM 43 42.23 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 56.99 08:55:15 PM 44 42.24 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.95 08:55:15 PM 45 41.65 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.55 08:55:15 PM 46 41.76 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.46 08:55:15 PM 47 42.66 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.52 08:55:15 PM 48 41.83 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.40 08:55:15 PM 49 42.04 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.13 08:55:15 PM 50 41.55 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.67 08:55:15 PM 51 42.37 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 56.84 08:55:15 PM 52 41.66 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 57.57 08:55:15 PM 53 41.77 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.45 08:55:15 PM 54 40.33 0.00 0.88 0.01 0.00 0.00 0.00 0.00 0.00 58.79 08:55:15 PM 55 41.62 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.59 08:55:15 PM 56 41.51 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.67 08:55:15 PM 57 40.35 0.00 0.76 0.00 0.00 0.00 0.00 0.00 0.00 58.89 08:55:15 PM 58 41.92 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.28 08:55:15 PM 59 41.48 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 57.72 08:55:15 PM 60 40.83 0.00 0.75 0.01 0.00 0.00 0.00 0.00 0.00 58.41 08:55:15 PM 61 40.95 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 58.26 08:55:15 PM 62 40.85 0.00 0.80 0.01 0.00 0.00 0.00 0.00 0.00 58.35 08:55:15 PM 63 42.11 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.12 08:55:15 PM 64 42.22 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.99 08:55:15 PM 65 41.55 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 57.67 08:55:15 PM 66 41.68 0.00 0.81 0.00 0.00 0.00 0.00 0.00 0.00 57.51 08:55:15 PM 67 41.06 0.00 0.87 0.01 0.00 0.00 0.00 0.00 0.00 58.07 08:55:15 PM 68 41.80 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.41 08:55:15 PM 69 41.38 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.87 08:55:15 PM 70 42.17 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.08 08:55:15 PM 71 41.26 0.00 0.81 0.00 0.00 0.00 0.01 0.00 0.00 57.92
1> rp(recon:scheduler_usage(1000)). [{1,0.8581586716045433}, {2,0.8607597847736659}, {3,0.8501543449289042}, {4,0.8619718426658023}, {5,0.8660226578705742}, {6,0.8555763933364806}, {7,0.8557195972097537}, {8,0.8668284625654983}, {9,0.858588675506532}, {10,0.8650501369640686}, {11,0.8615557214275474}, {12,0.8523392789683996}, {13,0.8469329348407398}, {14,0.8615637534991641}, {15,0.8591930739986557}, {16,0.8578596164201188}, {17,0.8549696086049403}, {18,0.8677098230685564}, {19,0.8556328340054443}, {20,0.8642542459479063}, {21,0.8670655179064826}, {22,0.8562638308551557}, {23,0.8511202058055239}, {24,0.8697339264798696}, {25,0.8594823462601853}, {26,0.8634550008001265}, {27,0.8568590130172198}, {28,0.864025757952233}, {29,0.8575614770622326}, {30,0.8611760239143281}, {31,0.8589789125669829}, {32,0.8537754468447621}, {33,0.8675538912040871}, {34,0.8607522803519515}, {35,0.8639389099445753}, {36,0.8743759133171144}, {37,0.9526507413773687}, {38,0.948925611917577}, {39,0.9511904612810083}, {40,0.9581015421004746}, {41,0.9534977201467516}, {42,0.0}, {43,0.0}, {44,0.0}, {45,0.9522518499948341}, {46,0.9453306755503674}, {47,0.9586374601140047}, {48,0.9574611845671873}, {49,0.950914304057791}, {50,0.9582124294398642}, {51,0.9448074463900029}, {52,0.9474771999511407}, {53,0.9450778098401229}, {54,0.0}, {55,0.9478167911711937}, {56,0.9612587860764551}, {57,0.943468696358506}, {58,0.9509498642900455}, {59,0.9603595134495414}, {60,0.9517397912369205}, {61,0.0}, {62,0.9587029756970658}, {63,0.0}, {64,0.0}, {65,0.9526167902065421}, {66,0.0}, {67,0.9475668680715805}, {68,0.0}, {69,0.9630301804839606}, {70,0.9475359592879946}, {71,0.9541718747681374}, {72,0.0}]
Thanks for your time,
Jesse Stimpson
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Hi,
This sounds like the issue I bumped into.
After some time, many of the dirty schedulers are not performing any work anymore.
Which was related to a corrupted list of sleeping dirty schedulers.
I haven’t found what triggered it in my case (maybe a GC?). For me it took approx. 1 day before it happened.
Doing Multi Scheduler Block can trigger the problem with the list of sleeping dirty schedulers which gets corrupted.
The sleepers list can contain only one or some schedulers. That list is later used to wake up schedulers when there is work to do.
For each enqueued dirty-cpu or dirty-io job, it wakes the first scheduler in the sleepers list.
But if this list only point to a few schedulers only those are woken up to pick jobs from the queue.
Since fixing that specific problem I have not seen the collapse of dirty schedulers anymore.
It would be great if you can test and see if it fixes your problem as well, https://github.com/erlang/otp/pull/2027Any feedback on the bug fix is appreciated.
/Jonas
> On 8 Jan 2019, at 17:13, Jesse Stimpson < [hidden email]> wrote:
>
> Hello,
>
> We run a large scale WebRTC-based VOIP application, and currently we are evaluating whether or not dirty CPU schedulers can help us get more consistent performance with some critical NIFs. We've created a test that mimics our production load and have run it on AWS EC2 instances of various sizes. When dirty CPU schedulers are working as expected, we do see measurable improvement in performance, but unfortunately we have seen that they can drop to 0 utilization. They do not recover from this state.
>
> The timings from 2 of our 12 NIFs are in the vicinity of the 1 msec rule of thumb, so we have experimented with them both on and off dirty CPU schedulers. We've left the other NIFs on the traditional schedulers. More detailed NIF timings can be found below.
>
> At the end of this message is the concerning data (from recon:scheduler_usage/1).We have schedulers 1-36 as the traditional schedulers bound to the first hardware thread of each core, and schedulers 37-72 are dirty CPU schedulers. Based on mpstat it appears the dirty CPU schedulers are not bound to threads. After about 10 minutes of running the load test, schedulers 42, 43, 44, 61, 63, 64, 66, 68, 72 went to at 0 utilization and stayed there indefinitely. As time goes on, more dirty CPU schedulers enter this 0 utilization state. Schedulers 1-36 do not exhibit this behaviour. We have reproduced this consistently on other EC2 c5 instances of various sizes.
>
> We have tried various beam adjustments, but we have seen the same behaviour each time. Some examples of things we have tried:
> +sbwt very_long +sbwtdcpu very_long +swt very_low +swtdcpu very_low
> +sfwi 500
>
> Is there anything else that we can try to avoid this 0-utilization issue? We are happy to provide more data if needed.
>
> NIF: webrtc_acm_add_data_10ms
> median: 606 usec
> 90 percentile: 1230 usec
> 99 percentile: 2285 usec
> 99.9 percentile: 3223 usec
>
> NIF: webrtc_acm_playout_data_10ms
> median: 98 usec
> 90 percentile: 371 usec
> 99 percentile: 410 usec
> 99.9 percentile: 449 usec
>
> EC2 instance: c5.18xlarge
>
> OS: Linux 4.4.0-1073-aws #83-Ubuntu SMP Sat Nov 17 00:26:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> beam flags: +P 4194304 +sbt db +SP 50:50 +sub true +A 128 +K true
>
> Erlang banner: Erlang/OTP 21 [erts-10.2] [source] [64-bit] [smp:36:36] [ds:36:36:10] [async-threads:128] [hipe]
>
> cpu_topology:
> [{node,[{processor,[{core,[{thread,{logical,0}},
> {thread,{logical,36}}]},
> {core,[{thread,{logical,1}},{thread,{logical,37}}]},
> {core,[{thread,{logical,2}},{thread,{logical,38}}]},
> {core,[{thread,{logical,3}},{thread,{logical,39}}]},
> {core,[{thread,{logical,4}},{thread,{logical,40}}]},
> {core,[{thread,{logical,5}},{thread,{logical,41}}]},
> {core,[{thread,{logical,6}},{thread,{logical,42}}]},
> {core,[{thread,{logical,7}},{thread,{logical,43}}]},
> {core,[{thread,{logical,8}},{thread,{logical,44}}]},
> {core,[{thread,{logical,9}},{thread,{logical,45}}]},
> {core,[{thread,{logical,10}},{thread,{logical,46}}]},
> {core,[{thread,{logical,11}},{thread,{logical,47}}]},
> {core,[{thread,{logical,12}},{thread,{logical,48}}]},
> {core,[{thread,{logical,13}},{thread,{logical,49}}]},
> {core,[{thread,{logical,14}},{thread,{logical,50}}]},
> {core,[{thread,{logical,15}},{thread,{logical,51}}]},
> {core,[{thread,{logical,16}},{thread,{logical,52}}]},
> {core,[{thread,{logical,17}},{thread,{logical,53}}]}]}]},
> {node,[{processor,[{core,[{thread,{logical,18}},
> {thread,{logical,54}}]},
> {core,[{thread,{logical,19}},{thread,{logical,55}}]},
> {core,[{thread,{logical,20}},{thread,{logical,56}}]},
> {core,[{thread,{logical,21}},{thread,{logical,57}}]},
> {core,[{thread,{logical,22}},{thread,{logical,58}}]},
> {core,[{thread,{logical,23}},{thread,{logical,59}}]},
> {core,[{thread,{logical,24}},{thread,{logical,60}}]},
> {core,[{thread,{logical,25}},{thread,{logical,61}}]},
> {core,[{thread,{logical,26}},{thread,{logical,62}}]},
> {core,[{thread,{logical,27}},{thread,{logical,63}}]},
> {core,[{thread,{logical,28}},{thread,{logical,64}}]},
> {core,[{thread,{logical,29}},{thread,{logical,65}}]},
> {core,[{thread,{logical,30}},{thread,{logical,66}}]},
> {core,[{thread,{logical,31}},{thread,{logical,67}}]},
> {core,[{thread,{logical,32}},{thread,{logical,68}}]},
> {core,[{thread,{logical,33}},{thread,{logical,69}}]},
> {core,[{thread,{logical,34}},{thread,{logical,70}}]},
> {core,[{thread,{logical,35}},{thread,{logical,71}}]}]}]}]
>
> scheduler_bindings: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
> 23,24,25,26,27,28,29,30,31,32,33,34,35}
>
> CPU utilization (from mpstat) during our load test:
> Linux 4.4.0-1073-aws (isw-core-x-qa-awsoh02) 01/04/2019 _x86_64_ (72 CPU)
>
> 08:55:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 08:55:15 PM all 51.58 0.00 1.64 0.01 0.00 0.42 0.00 0.00 0.00 46.36
> 08:55:15 PM 0 62.34 0.00 2.69 0.06 0.00 0.00 0.00 0.00 0.00 34.90
> 08:55:15 PM 1 62.84 0.00 2.58 0.02 0.00 0.00 0.00 0.00 0.00 34.55
> 08:55:15 PM 2 62.04 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.44
> 08:55:15 PM 3 62.17 0.00 2.62 0.00 0.00 0.00 0.00 0.00 0.00 35.21
> 08:55:15 PM 4 62.14 0.00 2.54 0.02 0.00 0.00 0.00 0.00 0.00 35.30
> 08:55:15 PM 5 62.06 0.00 2.51 0.00 0.00 0.00 0.00 0.00 0.00 35.43
> 08:55:15 PM 6 61.93 0.00 2.58 0.01 0.00 0.00 0.02 0.00 0.00 35.46
> 08:55:15 PM 7 61.99 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 8 62.01 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.47
> 08:55:15 PM 9 61.88 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.63
> 08:55:15 PM 10 62.18 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.34
> 08:55:15 PM 11 61.89 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.64
> 08:55:15 PM 12 61.98 0.00 2.50 0.00 0.00 0.00 0.00 0.00 0.00 35.52
> 08:55:15 PM 13 61.80 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.66
> 08:55:15 PM 14 61.96 0.00 2.56 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 15 62.11 0.00 2.57 0.00 0.00 0.00 0.00 0.00 0.00 35.31
> 08:55:15 PM 16 62.03 0.00 2.55 0.00 0.00 0.00 0.00 0.00 0.00 35.42
> 08:55:15 PM 17 61.90 0.00 2.53 0.00 0.00 0.00 0.00 0.00 0.00 35.57
> 08:55:15 PM 18 58.36 0.00 2.41 0.09 0.00 3.72 0.00 0.00 0.00 35.43
> 08:55:15 PM 19 58.29 0.00 2.45 0.02 0.00 4.00 0.00 0.00 0.00 35.24
> 08:55:15 PM 20 58.45 0.00 2.36 0.02 0.00 3.73 0.00 0.00 0.00 35.44
> 08:55:15 PM 21 58.56 0.00 2.34 0.04 0.00 3.54 0.00 0.00 0.00 35.51
> 08:55:15 PM 22 58.12 0.00 2.46 0.00 0.00 3.84 0.00 0.00 0.00 35.58
> 08:55:15 PM 23 58.24 0.00 2.36 0.01 0.00 3.82 0.00 0.00 0.00 35.57
> 08:55:15 PM 24 58.84 0.00 2.42 0.00 0.00 3.60 0.00 0.00 0.00 35.14
> 08:55:15 PM 25 58.35 0.00 2.43 0.00 0.00 3.72 0.00 0.00 0.00 35.51
> 08:55:15 PM 26 62.09 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 27 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 28 62.15 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.37
> 08:55:15 PM 29 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 30 62.74 0.00 2.45 0.00 0.00 0.00 0.00 0.00 0.00 34.81
> 08:55:15 PM 31 62.14 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.39
> 08:55:15 PM 32 62.35 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.13
> 08:55:15 PM 33 62.13 0.00 2.42 0.00 0.00 0.00 0.00 0.00 0.00 35.46
> 08:55:15 PM 34 62.08 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 35 62.15 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 08:55:15 PM 36 43.19 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.01
> 08:55:15 PM 37 43.09 0.00 0.83 0.01 0.00 0.00 0.00 0.00 0.00 56.06
> 08:55:15 PM 38 42.59 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.63
> 08:55:15 PM 39 42.60 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.58
> 08:55:15 PM 40 42.94 0.00 0.76 0.02 0.00 0.00 0.00 0.00 0.00 56.28
> 08:55:15 PM 41 42.11 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 42 43.26 0.00 0.85 0.01 0.00 0.00 0.00 0.00 0.00 55.88
> 08:55:15 PM 43 42.23 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 44 42.24 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.95
> 08:55:15 PM 45 41.65 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.55
> 08:55:15 PM 46 41.76 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.46
> 08:55:15 PM 47 42.66 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.52
> 08:55:15 PM 48 41.83 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.40
> 08:55:15 PM 49 42.04 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 50 41.55 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 51 42.37 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 56.84
> 08:55:15 PM 52 41.66 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 57.57
> 08:55:15 PM 53 41.77 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.45
> 08:55:15 PM 54 40.33 0.00 0.88 0.01 0.00 0.00 0.00 0.00 0.00 58.79
> 08:55:15 PM 55 41.62 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.59
> 08:55:15 PM 56 41.51 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 57 40.35 0.00 0.76 0.00 0.00 0.00 0.00 0.00 0.00 58.89
> 08:55:15 PM 58 41.92 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.28
> 08:55:15 PM 59 41.48 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 57.72
> 08:55:15 PM 60 40.83 0.00 0.75 0.01 0.00 0.00 0.00 0.00 0.00 58.41
> 08:55:15 PM 61 40.95 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 58.26
> 08:55:15 PM 62 40.85 0.00 0.80 0.01 0.00 0.00 0.00 0.00 0.00 58.35
> 08:55:15 PM 63 42.11 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.12
> 08:55:15 PM 64 42.22 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 65 41.55 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 66 41.68 0.00 0.81 0.00 0.00 0.00 0.00 0.00 0.00 57.51
> 08:55:15 PM 67 41.06 0.00 0.87 0.01 0.00 0.00 0.00 0.00 0.00 58.07
> 08:55:15 PM 68 41.80 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.41
> 08:55:15 PM 69 41.38 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.87
> 08:55:15 PM 70 42.17 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.08
> 08:55:15 PM 71 41.26 0.00 0.81 0.00 0.00 0.00 0.01 0.00 0.00 57.92
>
>
> 1> rp(recon:scheduler_usage(1000)).
> [{1,0.8581586716045433},
> {2,0.8607597847736659},
> {3,0.8501543449289042},
> {4,0.8619718426658023},
> {5,0.8660226578705742},
> {6,0.8555763933364806},
> {7,0.8557195972097537},
> {8,0.8668284625654983},
> {9,0.858588675506532},
> {10,0.8650501369640686},
> {11,0.8615557214275474},
> {12,0.8523392789683996},
> {13,0.8469329348407398},
> {14,0.8615637534991641},
> {15,0.8591930739986557},
> {16,0.8578596164201188},
> {17,0.8549696086049403},
> {18,0.8677098230685564},
> {19,0.8556328340054443},
> {20,0.8642542459479063},
> {21,0.8670655179064826},
> {22,0.8562638308551557},
> {23,0.8511202058055239},
> {24,0.8697339264798696},
> {25,0.8594823462601853},
> {26,0.8634550008001265},
> {27,0.8568590130172198},
> {28,0.864025757952233},
> {29,0.8575614770622326},
> {30,0.8611760239143281},
> {31,0.8589789125669829},
> {32,0.8537754468447621},
> {33,0.8675538912040871},
> {34,0.8607522803519515},
> {35,0.8639389099445753},
> {36,0.8743759133171144},
> {37,0.9526507413773687},
> {38,0.948925611917577},
> {39,0.9511904612810083},
> {40,0.9581015421004746},
> {41,0.9534977201467516},
> {42,0.0},
> {43,0.0},
> {44,0.0},
> {45,0.9522518499948341},
> {46,0.9453306755503674},
> {47,0.9586374601140047},
> {48,0.9574611845671873},
> {49,0.950914304057791},
> {50,0.9582124294398642},
> {51,0.9448074463900029},
> {52,0.9474771999511407},
> {53,0.9450778098401229},
> {54,0.0},
> {55,0.9478167911711937},
> {56,0.9612587860764551},
> {57,0.943468696358506},
> {58,0.9509498642900455},
> {59,0.9603595134495414},
> {60,0.9517397912369205},
> {61,0.0},
> {62,0.9587029756970658},
> {63,0.0},
> {64,0.0},
> {65,0.9526167902065421},
> {66,0.0},
> {67,0.9475668680715805},
> {68,0.0},
> {69,0.9630301804839606},
> {70,0.9475359592879946},
> {71,0.9541718747681374},
> {72,0.0}]
>
>
> Thanks for your time,
>
> Jesse Stimpson
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Rikard and Jonas,
Thank you for your response. With the patch at OTP PR-2027, we do observe that the dirty CPU schedulers do not drop to zero utilization, but in addition we see that the performance of the traditional schedulers suffers. Specifically, when we apply our load test, scheduler usage toggles in between the following two states. I've snipped the output of similar values for readability.
[{1,0.8933020805456553}, ... {36,0.8882068476700309}, {37,0.4932569692588096}, ... {72,0.49842881346116297}]
and
[{1,1.0}, ... {36,1.0}, {37,0.42909590032946493}, ... {72,0.4641339106204525}]
When applying this same load to the un-patched Erlang 21.2, the traditional schedulers are utilized at about 75%. So it appears that the 2027 patch may put undesirable pressure on the schedulers as a trade-off for consistent performance from the dirty CPU schedulers. Are you observing a similar response on your system?
Thanks, Jesse Hi,
This sounds like the issue I bumped into.
After some time, many of the dirty schedulers are not performing any work anymore.
Which was related to a corrupted list of sleeping dirty schedulers.
I haven’t found what triggered it in my case (maybe a GC?). For me it took approx. 1 day before it happened.
Doing Multi Scheduler Block can trigger the problem with the list of sleeping dirty schedulers which gets corrupted.
The sleepers list can contain only one or some schedulers. That list is later used to wake up schedulers when there is work to do.
For each enqueued dirty-cpu or dirty-io job, it wakes the first scheduler in the sleepers list.
But if this list only point to a few schedulers only those are woken up to pick jobs from the queue.
Since fixing that specific problem I have not seen the collapse of dirty schedulers anymore.
It would be great if you can test and see if it fixes your problem as well, https://github.com/erlang/otp/pull/2027
Any feedback on the bug fix is appreciated.
/Jonas
> On 8 Jan 2019, at 17:13, Jesse Stimpson <[hidden email]> wrote:
>
> Hello,
>
> We run a large scale WebRTC-based VOIP application, and currently we are evaluating whether or not dirty CPU schedulers can help us get more consistent performance with some critical NIFs. We've created a test that mimics our production load and have run it on AWS EC2 instances of various sizes. When dirty CPU schedulers are working as expected, we do see measurable improvement in performance, but unfortunately we have seen that they can drop to 0 utilization. They do not recover from this state.
>
> The timings from 2 of our 12 NIFs are in the vicinity of the 1 msec rule of thumb, so we have experimented with them both on and off dirty CPU schedulers. We've left the other NIFs on the traditional schedulers. More detailed NIF timings can be found below.
>
> At the end of this message is the concerning data (from recon:scheduler_usage/1).We have schedulers 1-36 as the traditional schedulers bound to the first hardware thread of each core, and schedulers 37-72 are dirty CPU schedulers. Based on mpstat it appears the dirty CPU schedulers are not bound to threads. After about 10 minutes of running the load test, schedulers 42, 43, 44, 61, 63, 64, 66, 68, 72 went to at 0 utilization and stayed there indefinitely. As time goes on, more dirty CPU schedulers enter this 0 utilization state. Schedulers 1-36 do not exhibit this behaviour. We have reproduced this consistently on other EC2 c5 instances of various sizes.
>
> We have tried various beam adjustments, but we have seen the same behaviour each time. Some examples of things we have tried:
> +sbwt very_long +sbwtdcpu very_long +swt very_low +swtdcpu very_low
> +sfwi 500
>
> Is there anything else that we can try to avoid this 0-utilization issue? We are happy to provide more data if needed.
>
> NIF: webrtc_acm_add_data_10ms
> median: 606 usec
> 90 percentile: 1230 usec
> 99 percentile: 2285 usec
> 99.9 percentile: 3223 usec
>
> NIF: webrtc_acm_playout_data_10ms
> median: 98 usec
> 90 percentile: 371 usec
> 99 percentile: 410 usec
> 99.9 percentile: 449 usec
>
> EC2 instance: c5.18xlarge
>
> OS: Linux 4.4.0-1073-aws #83-Ubuntu SMP Sat Nov 17 00:26:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> beam flags: +P 4194304 +sbt db +SP 50:50 +sub true +A 128 +K true
>
> Erlang banner: Erlang/OTP 21 [erts-10.2] [source] [64-bit] [smp:36:36] [ds:36:36:10] [async-threads:128] [hipe]
>
> cpu_topology:
> [{node,[{processor,[{core,[{thread,{logical,0}},
> {thread,{logical,36}}]},
> {core,[{thread,{logical,1}},{thread,{logical,37}}]},
> {core,[{thread,{logical,2}},{thread,{logical,38}}]},
> {core,[{thread,{logical,3}},{thread,{logical,39}}]},
> {core,[{thread,{logical,4}},{thread,{logical,40}}]},
> {core,[{thread,{logical,5}},{thread,{logical,41}}]},
> {core,[{thread,{logical,6}},{thread,{logical,42}}]},
> {core,[{thread,{logical,7}},{thread,{logical,43}}]},
> {core,[{thread,{logical,8}},{thread,{logical,44}}]},
> {core,[{thread,{logical,9}},{thread,{logical,45}}]},
> {core,[{thread,{logical,10}},{thread,{logical,46}}]},
> {core,[{thread,{logical,11}},{thread,{logical,47}}]},
> {core,[{thread,{logical,12}},{thread,{logical,48}}]},
> {core,[{thread,{logical,13}},{thread,{logical,49}}]},
> {core,[{thread,{logical,14}},{thread,{logical,50}}]},
> {core,[{thread,{logical,15}},{thread,{logical,51}}]},
> {core,[{thread,{logical,16}},{thread,{logical,52}}]},
> {core,[{thread,{logical,17}},{thread,{logical,53}}]}]}]},
> {node,[{processor,[{core,[{thread,{logical,18}},
> {thread,{logical,54}}]},
> {core,[{thread,{logical,19}},{thread,{logical,55}}]},
> {core,[{thread,{logical,20}},{thread,{logical,56}}]},
> {core,[{thread,{logical,21}},{thread,{logical,57}}]},
> {core,[{thread,{logical,22}},{thread,{logical,58}}]},
> {core,[{thread,{logical,23}},{thread,{logical,59}}]},
> {core,[{thread,{logical,24}},{thread,{logical,60}}]},
> {core,[{thread,{logical,25}},{thread,{logical,61}}]},
> {core,[{thread,{logical,26}},{thread,{logical,62}}]},
> {core,[{thread,{logical,27}},{thread,{logical,63}}]},
> {core,[{thread,{logical,28}},{thread,{logical,64}}]},
> {core,[{thread,{logical,29}},{thread,{logical,65}}]},
> {core,[{thread,{logical,30}},{thread,{logical,66}}]},
> {core,[{thread,{logical,31}},{thread,{logical,67}}]},
> {core,[{thread,{logical,32}},{thread,{logical,68}}]},
> {core,[{thread,{logical,33}},{thread,{logical,69}}]},
> {core,[{thread,{logical,34}},{thread,{logical,70}}]},
> {core,[{thread,{logical,35}},{thread,{logical,71}}]}]}]}]
>
> scheduler_bindings: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
> 23,24,25,26,27,28,29,30,31,32,33,34,35}
>
> CPU utilization (from mpstat) during our load test:
> Linux 4.4.0-1073-aws (isw-core-x-qa-awsoh02) 01/04/2019 _x86_64_ (72 CPU)
>
> 08:55:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 08:55:15 PM all 51.58 0.00 1.64 0.01 0.00 0.42 0.00 0.00 0.00 46.36
> 08:55:15 PM 0 62.34 0.00 2.69 0.06 0.00 0.00 0.00 0.00 0.00 34.90
> 08:55:15 PM 1 62.84 0.00 2.58 0.02 0.00 0.00 0.00 0.00 0.00 34.55
> 08:55:15 PM 2 62.04 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.44
> 08:55:15 PM 3 62.17 0.00 2.62 0.00 0.00 0.00 0.00 0.00 0.00 35.21
> 08:55:15 PM 4 62.14 0.00 2.54 0.02 0.00 0.00 0.00 0.00 0.00 35.30
> 08:55:15 PM 5 62.06 0.00 2.51 0.00 0.00 0.00 0.00 0.00 0.00 35.43
> 08:55:15 PM 6 61.93 0.00 2.58 0.01 0.00 0.00 0.02 0.00 0.00 35.46
> 08:55:15 PM 7 61.99 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 8 62.01 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.47
> 08:55:15 PM 9 61.88 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.63
> 08:55:15 PM 10 62.18 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.34
> 08:55:15 PM 11 61.89 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.64
> 08:55:15 PM 12 61.98 0.00 2.50 0.00 0.00 0.00 0.00 0.00 0.00 35.52
> 08:55:15 PM 13 61.80 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.66
> 08:55:15 PM 14 61.96 0.00 2.56 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 15 62.11 0.00 2.57 0.00 0.00 0.00 0.00 0.00 0.00 35.31
> 08:55:15 PM 16 62.03 0.00 2.55 0.00 0.00 0.00 0.00 0.00 0.00 35.42
> 08:55:15 PM 17 61.90 0.00 2.53 0.00 0.00 0.00 0.00 0.00 0.00 35.57
> 08:55:15 PM 18 58.36 0.00 2.41 0.09 0.00 3.72 0.00 0.00 0.00 35.43
> 08:55:15 PM 19 58.29 0.00 2.45 0.02 0.00 4.00 0.00 0.00 0.00 35.24
> 08:55:15 PM 20 58.45 0.00 2.36 0.02 0.00 3.73 0.00 0.00 0.00 35.44
> 08:55:15 PM 21 58.56 0.00 2.34 0.04 0.00 3.54 0.00 0.00 0.00 35.51
> 08:55:15 PM 22 58.12 0.00 2.46 0.00 0.00 3.84 0.00 0.00 0.00 35.58
> 08:55:15 PM 23 58.24 0.00 2.36 0.01 0.00 3.82 0.00 0.00 0.00 35.57
> 08:55:15 PM 24 58.84 0.00 2.42 0.00 0.00 3.60 0.00 0.00 0.00 35.14
> 08:55:15 PM 25 58.35 0.00 2.43 0.00 0.00 3.72 0.00 0.00 0.00 35.51
> 08:55:15 PM 26 62.09 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 27 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 28 62.15 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.37
> 08:55:15 PM 29 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 30 62.74 0.00 2.45 0.00 0.00 0.00 0.00 0.00 0.00 34.81
> 08:55:15 PM 31 62.14 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.39
> 08:55:15 PM 32 62.35 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.13
> 08:55:15 PM 33 62.13 0.00 2.42 0.00 0.00 0.00 0.00 0.00 0.00 35.46
> 08:55:15 PM 34 62.08 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 35 62.15 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 08:55:15 PM 36 43.19 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.01
> 08:55:15 PM 37 43.09 0.00 0.83 0.01 0.00 0.00 0.00 0.00 0.00 56.06
> 08:55:15 PM 38 42.59 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.63
> 08:55:15 PM 39 42.60 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.58
> 08:55:15 PM 40 42.94 0.00 0.76 0.02 0.00 0.00 0.00 0.00 0.00 56.28
> 08:55:15 PM 41 42.11 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 42 43.26 0.00 0.85 0.01 0.00 0.00 0.00 0.00 0.00 55.88
> 08:55:15 PM 43 42.23 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 44 42.24 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.95
> 08:55:15 PM 45 41.65 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.55
> 08:55:15 PM 46 41.76 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.46
> 08:55:15 PM 47 42.66 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.52
> 08:55:15 PM 48 41.83 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.40
> 08:55:15 PM 49 42.04 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 50 41.55 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 51 42.37 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 56.84
> 08:55:15 PM 52 41.66 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 57.57
> 08:55:15 PM 53 41.77 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.45
> 08:55:15 PM 54 40.33 0.00 0.88 0.01 0.00 0.00 0.00 0.00 0.00 58.79
> 08:55:15 PM 55 41.62 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.59
> 08:55:15 PM 56 41.51 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 57 40.35 0.00 0.76 0.00 0.00 0.00 0.00 0.00 0.00 58.89
> 08:55:15 PM 58 41.92 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.28
> 08:55:15 PM 59 41.48 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 57.72
> 08:55:15 PM 60 40.83 0.00 0.75 0.01 0.00 0.00 0.00 0.00 0.00 58.41
> 08:55:15 PM 61 40.95 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 58.26
> 08:55:15 PM 62 40.85 0.00 0.80 0.01 0.00 0.00 0.00 0.00 0.00 58.35
> 08:55:15 PM 63 42.11 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.12
> 08:55:15 PM 64 42.22 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 65 41.55 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 66 41.68 0.00 0.81 0.00 0.00 0.00 0.00 0.00 0.00 57.51
> 08:55:15 PM 67 41.06 0.00 0.87 0.01 0.00 0.00 0.00 0.00 0.00 58.07
> 08:55:15 PM 68 41.80 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.41
> 08:55:15 PM 69 41.38 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.87
> 08:55:15 PM 70 42.17 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.08
> 08:55:15 PM 71 41.26 0.00 0.81 0.00 0.00 0.00 0.01 0.00 0.00 57.92
>
>
> 1> rp(recon:scheduler_usage(1000)).
> [{1,0.8581586716045433},
> {2,0.8607597847736659},
> {3,0.8501543449289042},
> {4,0.8619718426658023},
> {5,0.8660226578705742},
> {6,0.8555763933364806},
> {7,0.8557195972097537},
> {8,0.8668284625654983},
> {9,0.858588675506532},
> {10,0.8650501369640686},
> {11,0.8615557214275474},
> {12,0.8523392789683996},
> {13,0.8469329348407398},
> {14,0.8615637534991641},
> {15,0.8591930739986557},
> {16,0.8578596164201188},
> {17,0.8549696086049403},
> {18,0.8677098230685564},
> {19,0.8556328340054443},
> {20,0.8642542459479063},
> {21,0.8670655179064826},
> {22,0.8562638308551557},
> {23,0.8511202058055239},
> {24,0.8697339264798696},
> {25,0.8594823462601853},
> {26,0.8634550008001265},
> {27,0.8568590130172198},
> {28,0.864025757952233},
> {29,0.8575614770622326},
> {30,0.8611760239143281},
> {31,0.8589789125669829},
> {32,0.8537754468447621},
> {33,0.8675538912040871},
> {34,0.8607522803519515},
> {35,0.8639389099445753},
> {36,0.8743759133171144},
> {37,0.9526507413773687},
> {38,0.948925611917577},
> {39,0.9511904612810083},
> {40,0.9581015421004746},
> {41,0.9534977201467516},
> {42,0.0},
> {43,0.0},
> {44,0.0},
> {45,0.9522518499948341},
> {46,0.9453306755503674},
> {47,0.9586374601140047},
> {48,0.9574611845671873},
> {49,0.950914304057791},
> {50,0.9582124294398642},
> {51,0.9448074463900029},
> {52,0.9474771999511407},
> {53,0.9450778098401229},
> {54,0.0},
> {55,0.9478167911711937},
> {56,0.9612587860764551},
> {57,0.943468696358506},
> {58,0.9509498642900455},
> {59,0.9603595134495414},
> {60,0.9517397912369205},
> {61,0.0},
> {62,0.9587029756970658},
> {63,0.0},
> {64,0.0},
> {65,0.9526167902065421},
> {66,0.0},
> {67,0.9475668680715805},
> {68,0.0},
> {69,0.9630301804839606},
> {70,0.9475359592879946},
> {71,0.9541718747681374},
> {72,0.0}]
>
>
> Thanks for your time,
>
> Jesse Stimpson
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
-- 
| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Hi Jesse, we have not observe the behaviour you see. Can you share microstate accounting stats for the different cases? Maybe that could shed some light on what the schedulers are doing? 1> msacc:start(1000), msacc:print().
And maybe also running perf top as root on one of the scheduler threads that are fully utilised can be interesting. # perf top --tid <traditional scheduler thread id>
/Jonas On Wed, Jan 9, 2019 at 10:52 PM Jesse Stimpson < [hidden email]> wrote: Rikard and Jonas,
Thank you for your response. With the patch at OTP PR-2027, we do observe that the dirty CPU schedulers do not drop to zero utilization, but in addition we see that the performance of the traditional schedulers suffers. Specifically, when we apply our load test, scheduler usage toggles in between the following two states. I've snipped the output of similar values for readability.
[{1,0.8933020805456553}, ... {36,0.8882068476700309}, {37,0.4932569692588096}, ... {72,0.49842881346116297}]
and
[{1,1.0}, ... {36,1.0}, {37,0.42909590032946493}, ... {72,0.4641339106204525}]
When applying this same load to the un-patched Erlang 21.2, the traditional schedulers are utilized at about 75%. So it appears that the 2027 patch may put undesirable pressure on the schedulers as a trade-off for consistent performance from the dirty CPU schedulers. Are you observing a similar response on your system?
Thanks, Jesse
Hi,
This sounds like the issue I bumped into.
After some time, many of the dirty schedulers are not performing any work anymore.
Which was related to a corrupted list of sleeping dirty schedulers.
I haven’t found what triggered it in my case (maybe a GC?). For me it took approx. 1 day before it happened.
Doing Multi Scheduler Block can trigger the problem with the list of sleeping dirty schedulers which gets corrupted.
The sleepers list can contain only one or some schedulers. That list is later used to wake up schedulers when there is work to do.
For each enqueued dirty-cpu or dirty-io job, it wakes the first scheduler in the sleepers list.
But if this list only point to a few schedulers only those are woken up to pick jobs from the queue.
Since fixing that specific problem I have not seen the collapse of dirty schedulers anymore.
It would be great if you can test and see if it fixes your problem as well, https://github.com/erlang/otp/pull/2027
Any feedback on the bug fix is appreciated.
/Jonas
> On 8 Jan 2019, at 17:13, Jesse Stimpson <[hidden email]> wrote:
>
> Hello,
>
> We run a large scale WebRTC-based VOIP application, and currently we are evaluating whether or not dirty CPU schedulers can help us get more consistent performance with some critical NIFs. We've created a test that mimics our production load and have run it on AWS EC2 instances of various sizes. When dirty CPU schedulers are working as expected, we do see measurable improvement in performance, but unfortunately we have seen that they can drop to 0 utilization. They do not recover from this state.
>
> The timings from 2 of our 12 NIFs are in the vicinity of the 1 msec rule of thumb, so we have experimented with them both on and off dirty CPU schedulers. We've left the other NIFs on the traditional schedulers. More detailed NIF timings can be found below.
>
> At the end of this message is the concerning data (from recon:scheduler_usage/1).We have schedulers 1-36 as the traditional schedulers bound to the first hardware thread of each core, and schedulers 37-72 are dirty CPU schedulers. Based on mpstat it appears the dirty CPU schedulers are not bound to threads. After about 10 minutes of running the load test, schedulers 42, 43, 44, 61, 63, 64, 66, 68, 72 went to at 0 utilization and stayed there indefinitely. As time goes on, more dirty CPU schedulers enter this 0 utilization state. Schedulers 1-36 do not exhibit this behaviour. We have reproduced this consistently on other EC2 c5 instances of various sizes.
>
> We have tried various beam adjustments, but we have seen the same behaviour each time. Some examples of things we have tried:
> +sbwt very_long +sbwtdcpu very_long +swt very_low +swtdcpu very_low
> +sfwi 500
>
> Is there anything else that we can try to avoid this 0-utilization issue? We are happy to provide more data if needed.
>
> NIF: webrtc_acm_add_data_10ms
> median: 606 usec
> 90 percentile: 1230 usec
> 99 percentile: 2285 usec
> 99.9 percentile: 3223 usec
>
> NIF: webrtc_acm_playout_data_10ms
> median: 98 usec
> 90 percentile: 371 usec
> 99 percentile: 410 usec
> 99.9 percentile: 449 usec
>
> EC2 instance: c5.18xlarge
>
> OS: Linux 4.4.0-1073-aws #83-Ubuntu SMP Sat Nov 17 00:26:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> beam flags: +P 4194304 +sbt db +SP 50:50 +sub true +A 128 +K true
>
> Erlang banner: Erlang/OTP 21 [erts-10.2] [source] [64-bit] [smp:36:36] [ds:36:36:10] [async-threads:128] [hipe]
>
> cpu_topology:
> [{node,[{processor,[{core,[{thread,{logical,0}},
> {thread,{logical,36}}]},
> {core,[{thread,{logical,1}},{thread,{logical,37}}]},
> {core,[{thread,{logical,2}},{thread,{logical,38}}]},
> {core,[{thread,{logical,3}},{thread,{logical,39}}]},
> {core,[{thread,{logical,4}},{thread,{logical,40}}]},
> {core,[{thread,{logical,5}},{thread,{logical,41}}]},
> {core,[{thread,{logical,6}},{thread,{logical,42}}]},
> {core,[{thread,{logical,7}},{thread,{logical,43}}]},
> {core,[{thread,{logical,8}},{thread,{logical,44}}]},
> {core,[{thread,{logical,9}},{thread,{logical,45}}]},
> {core,[{thread,{logical,10}},{thread,{logical,46}}]},
> {core,[{thread,{logical,11}},{thread,{logical,47}}]},
> {core,[{thread,{logical,12}},{thread,{logical,48}}]},
> {core,[{thread,{logical,13}},{thread,{logical,49}}]},
> {core,[{thread,{logical,14}},{thread,{logical,50}}]},
> {core,[{thread,{logical,15}},{thread,{logical,51}}]},
> {core,[{thread,{logical,16}},{thread,{logical,52}}]},
> {core,[{thread,{logical,17}},{thread,{logical,53}}]}]}]},
> {node,[{processor,[{core,[{thread,{logical,18}},
> {thread,{logical,54}}]},
> {core,[{thread,{logical,19}},{thread,{logical,55}}]},
> {core,[{thread,{logical,20}},{thread,{logical,56}}]},
> {core,[{thread,{logical,21}},{thread,{logical,57}}]},
> {core,[{thread,{logical,22}},{thread,{logical,58}}]},
> {core,[{thread,{logical,23}},{thread,{logical,59}}]},
> {core,[{thread,{logical,24}},{thread,{logical,60}}]},
> {core,[{thread,{logical,25}},{thread,{logical,61}}]},
> {core,[{thread,{logical,26}},{thread,{logical,62}}]},
> {core,[{thread,{logical,27}},{thread,{logical,63}}]},
> {core,[{thread,{logical,28}},{thread,{logical,64}}]},
> {core,[{thread,{logical,29}},{thread,{logical,65}}]},
> {core,[{thread,{logical,30}},{thread,{logical,66}}]},
> {core,[{thread,{logical,31}},{thread,{logical,67}}]},
> {core,[{thread,{logical,32}},{thread,{logical,68}}]},
> {core,[{thread,{logical,33}},{thread,{logical,69}}]},
> {core,[{thread,{logical,34}},{thread,{logical,70}}]},
> {core,[{thread,{logical,35}},{thread,{logical,71}}]}]}]}]
>
> scheduler_bindings: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
> 23,24,25,26,27,28,29,30,31,32,33,34,35}
>
> CPU utilization (from mpstat) during our load test:
> Linux 4.4.0-1073-aws (isw-core-x-qa-awsoh02) 01/04/2019 _x86_64_ (72 CPU)
>
> 08:55:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 08:55:15 PM all 51.58 0.00 1.64 0.01 0.00 0.42 0.00 0.00 0.00 46.36
> 08:55:15 PM 0 62.34 0.00 2.69 0.06 0.00 0.00 0.00 0.00 0.00 34.90
> 08:55:15 PM 1 62.84 0.00 2.58 0.02 0.00 0.00 0.00 0.00 0.00 34.55
> 08:55:15 PM 2 62.04 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.44
> 08:55:15 PM 3 62.17 0.00 2.62 0.00 0.00 0.00 0.00 0.00 0.00 35.21
> 08:55:15 PM 4 62.14 0.00 2.54 0.02 0.00 0.00 0.00 0.00 0.00 35.30
> 08:55:15 PM 5 62.06 0.00 2.51 0.00 0.00 0.00 0.00 0.00 0.00 35.43
> 08:55:15 PM 6 61.93 0.00 2.58 0.01 0.00 0.00 0.02 0.00 0.00 35.46
> 08:55:15 PM 7 61.99 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 8 62.01 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.47
> 08:55:15 PM 9 61.88 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.63
> 08:55:15 PM 10 62.18 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.34
> 08:55:15 PM 11 61.89 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.64
> 08:55:15 PM 12 61.98 0.00 2.50 0.00 0.00 0.00 0.00 0.00 0.00 35.52
> 08:55:15 PM 13 61.80 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.66
> 08:55:15 PM 14 61.96 0.00 2.56 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 15 62.11 0.00 2.57 0.00 0.00 0.00 0.00 0.00 0.00 35.31
> 08:55:15 PM 16 62.03 0.00 2.55 0.00 0.00 0.00 0.00 0.00 0.00 35.42
> 08:55:15 PM 17 61.90 0.00 2.53 0.00 0.00 0.00 0.00 0.00 0.00 35.57
> 08:55:15 PM 18 58.36 0.00 2.41 0.09 0.00 3.72 0.00 0.00 0.00 35.43
> 08:55:15 PM 19 58.29 0.00 2.45 0.02 0.00 4.00 0.00 0.00 0.00 35.24
> 08:55:15 PM 20 58.45 0.00 2.36 0.02 0.00 3.73 0.00 0.00 0.00 35.44
> 08:55:15 PM 21 58.56 0.00 2.34 0.04 0.00 3.54 0.00 0.00 0.00 35.51
> 08:55:15 PM 22 58.12 0.00 2.46 0.00 0.00 3.84 0.00 0.00 0.00 35.58
> 08:55:15 PM 23 58.24 0.00 2.36 0.01 0.00 3.82 0.00 0.00 0.00 35.57
> 08:55:15 PM 24 58.84 0.00 2.42 0.00 0.00 3.60 0.00 0.00 0.00 35.14
> 08:55:15 PM 25 58.35 0.00 2.43 0.00 0.00 3.72 0.00 0.00 0.00 35.51
> 08:55:15 PM 26 62.09 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 27 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 28 62.15 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.37
> 08:55:15 PM 29 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 30 62.74 0.00 2.45 0.00 0.00 0.00 0.00 0.00 0.00 34.81
> 08:55:15 PM 31 62.14 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.39
> 08:55:15 PM 32 62.35 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.13
> 08:55:15 PM 33 62.13 0.00 2.42 0.00 0.00 0.00 0.00 0.00 0.00 35.46
> 08:55:15 PM 34 62.08 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 35 62.15 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 08:55:15 PM 36 43.19 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.01
> 08:55:15 PM 37 43.09 0.00 0.83 0.01 0.00 0.00 0.00 0.00 0.00 56.06
> 08:55:15 PM 38 42.59 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.63
> 08:55:15 PM 39 42.60 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.58
> 08:55:15 PM 40 42.94 0.00 0.76 0.02 0.00 0.00 0.00 0.00 0.00 56.28
> 08:55:15 PM 41 42.11 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 42 43.26 0.00 0.85 0.01 0.00 0.00 0.00 0.00 0.00 55.88
> 08:55:15 PM 43 42.23 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 44 42.24 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.95
> 08:55:15 PM 45 41.65 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.55
> 08:55:15 PM 46 41.76 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.46
> 08:55:15 PM 47 42.66 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.52
> 08:55:15 PM 48 41.83 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.40
> 08:55:15 PM 49 42.04 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 50 41.55 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 51 42.37 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 56.84
> 08:55:15 PM 52 41.66 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 57.57
> 08:55:15 PM 53 41.77 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.45
> 08:55:15 PM 54 40.33 0.00 0.88 0.01 0.00 0.00 0.00 0.00 0.00 58.79
> 08:55:15 PM 55 41.62 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.59
> 08:55:15 PM 56 41.51 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 57 40.35 0.00 0.76 0.00 0.00 0.00 0.00 0.00 0.00 58.89
> 08:55:15 PM 58 41.92 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.28
> 08:55:15 PM 59 41.48 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 57.72
> 08:55:15 PM 60 40.83 0.00 0.75 0.01 0.00 0.00 0.00 0.00 0.00 58.41
> 08:55:15 PM 61 40.95 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 58.26
> 08:55:15 PM 62 40.85 0.00 0.80 0.01 0.00 0.00 0.00 0.00 0.00 58.35
> 08:55:15 PM 63 42.11 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.12
> 08:55:15 PM 64 42.22 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 65 41.55 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 66 41.68 0.00 0.81 0.00 0.00 0.00 0.00 0.00 0.00 57.51
> 08:55:15 PM 67 41.06 0.00 0.87 0.01 0.00 0.00 0.00 0.00 0.00 58.07
> 08:55:15 PM 68 41.80 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.41
> 08:55:15 PM 69 41.38 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.87
> 08:55:15 PM 70 42.17 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.08
> 08:55:15 PM 71 41.26 0.00 0.81 0.00 0.00 0.00 0.01 0.00 0.00 57.92
>
>
> 1> rp(recon:scheduler_usage(1000)).
> [{1,0.8581586716045433},
> {2,0.8607597847736659},
> {3,0.8501543449289042},
> {4,0.8619718426658023},
> {5,0.8660226578705742},
> {6,0.8555763933364806},
> {7,0.8557195972097537},
> {8,0.8668284625654983},
> {9,0.858588675506532},
> {10,0.8650501369640686},
> {11,0.8615557214275474},
> {12,0.8523392789683996},
> {13,0.8469329348407398},
> {14,0.8615637534991641},
> {15,0.8591930739986557},
> {16,0.8578596164201188},
> {17,0.8549696086049403},
> {18,0.8677098230685564},
> {19,0.8556328340054443},
> {20,0.8642542459479063},
> {21,0.8670655179064826},
> {22,0.8562638308551557},
> {23,0.8511202058055239},
> {24,0.8697339264798696},
> {25,0.8594823462601853},
> {26,0.8634550008001265},
> {27,0.8568590130172198},
> {28,0.864025757952233},
> {29,0.8575614770622326},
> {30,0.8611760239143281},
> {31,0.8589789125669829},
> {32,0.8537754468447621},
> {33,0.8675538912040871},
> {34,0.8607522803519515},
> {35,0.8639389099445753},
> {36,0.8743759133171144},
> {37,0.9526507413773687},
> {38,0.948925611917577},
> {39,0.9511904612810083},
> {40,0.9581015421004746},
> {41,0.9534977201467516},
> {42,0.0},
> {43,0.0},
> {44,0.0},
> {45,0.9522518499948341},
> {46,0.9453306755503674},
> {47,0.9586374601140047},
> {48,0.9574611845671873},
> {49,0.950914304057791},
> {50,0.9582124294398642},
> {51,0.9448074463900029},
> {52,0.9474771999511407},
> {53,0.9450778098401229},
> {54,0.0},
> {55,0.9478167911711937},
> {56,0.9612587860764551},
> {57,0.943468696358506},
> {58,0.9509498642900455},
> {59,0.9603595134495414},
> {60,0.9517397912369205},
> {61,0.0},
> {62,0.9587029756970658},
> {63,0.0},
> {64,0.0},
> {65,0.9526167902065421},
> {66,0.0},
> {67,0.9475668680715805},
> {68,0.0},
> {69,0.9630301804839606},
> {70,0.9475359592879946},
> {71,0.9541718747681374},
> {72,0.0}]
>
>
> Thanks for your time,
>
> Jesse Stimpson
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Jonas,
Below is an excerpt from msacc for one scheduler while the system is under moderate load (not saturated as in my previous data). On a periodic basis, the schedulers under the 2027 patch have a significantly higher utilization in the emulator column. At times I saw this spike occur every 5 minutes on the nose. Other times it was less predictable. But it typically occurs on the order of minutes (3 - 7 minutes). The full output is included as well.
scheduler(36) 0.68% 0.22% 35.16% 0.55% 55.38% 1.35% 6.66% scheduler(36) 0.39% 0.08% 76.21% 0.48% 19.52% 1.03% 2.29%
I've not been able to get the thread id for a scheduler yet to run perf. There is a past mailing list thread suggesting use of "ps -Leo pid,tid,comm | grep beam", but this is not returning the thread ids on my current system.
Average thread real-time : 1000868 us Accumulated system run-time : 52140693 us Average scheduler run-time : 929256 us
Thread aux check_io emulator gc other port sleep
Stats per thread: async( 0) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 1) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 2) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 3) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 4) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 5) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 6) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 7) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 8) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 9) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(10) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(11) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(12) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(13) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(14) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(15) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(16) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(17) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(18) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(19) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(20) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(21) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(22) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(23) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(24) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(25) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(26) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(27) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(28) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(29) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(30) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(31) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(32) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(33) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(34) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(35) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(36) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(37) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(38) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(39) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(40) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(41) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(42) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(43) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(44) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(45) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(46) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(47) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(48) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(49) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(50) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(51) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(52) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(53) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(54) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(55) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(56) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(57) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(58) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(59) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(60) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(61) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(62) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(63) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(64) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(65) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(66) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(67) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(68) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(69) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(70) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(71) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(72) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(73) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(74) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(75) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(76) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(77) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(78) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(79) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(80) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(81) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(82) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(83) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(84) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(85) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(86) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(87) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(88) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(89) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(90) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(91) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(92) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(93) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(94) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(95) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(96) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(97) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(98) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(99) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% aux( 1) 3.61% 1.11% 0.00% 0.00% 0.03% 0.00% 95.24% dirty_cpu_( 1) 0.00% 0.00% 35.00% 0.00% 19.49% 0.00% 45.51% dirty_cpu_( 2) 0.00% 0.00% 34.84% 0.00% 18.28% 0.00% 46.89% dirty_cpu_( 3) 0.00% 0.00% 35.72% 0.00% 19.16% 0.00% 45.11% dirty_cpu_( 4) 0.00% 0.00% 30.97% 0.00% 16.72% 0.00% 52.30% dirty_cpu_( 5) 0.00% 0.00% 30.20% 0.00% 18.21% 0.00% 51.58% dirty_cpu_( 6) 0.00% 0.00% 33.56% 0.00% 17.62% 0.00% 48.82% dirty_cpu_( 7) 0.00% 0.00% 33.94% 0.00% 17.94% 0.00% 48.12% dirty_cpu_( 8) 0.00% 0.00% 32.99% 0.00% 18.24% 0.00% 48.77% dirty_cpu_( 9) 0.00% 0.00% 31.68% 0.00% 17.47% 0.00% 50.86% dirty_cpu_(10) 0.00% 0.00% 34.06% 0.00% 19.18% 0.00% 46.75% dirty_cpu_(11) 0.00% 0.00% 33.91% 0.00% 17.87% 0.00% 48.23% dirty_cpu_(12) 0.00% 0.00% 30.02% 0.00% 19.52% 0.00% 50.46% dirty_cpu_(13) 0.00% 0.00% 32.33% 0.00% 18.73% 0.00% 48.94% dirty_cpu_(14) 0.00% 0.00% 36.54% 0.00% 19.58% 0.00% 43.88% dirty_cpu_(15) 0.00% 0.00% 37.19% 0.00% 18.62% 0.00% 44.19% dirty_cpu_(16) 0.00% 0.00% 30.57% 0.00% 17.45% 0.00% 51.99% dirty_cpu_(17) 0.00% 0.00% 30.65% 0.00% 18.08% 0.00% 51.28% dirty_cpu_(18) 0.00% 0.00% 36.95% 0.00% 19.01% 0.00% 44.04% dirty_cpu_(19) 0.00% 0.00% 34.25% 0.00% 18.26% 0.00% 47.49% dirty_cpu_(20) 0.00% 0.00% 34.76% 0.00% 17.07% 0.00% 48.17% dirty_cpu_(21) 0.00% 0.00% 34.37% 0.00% 18.78% 0.00% 46.85% dirty_cpu_(22) 0.00% 0.00% 29.20% 0.00% 17.98% 0.00% 52.82% dirty_cpu_(23) 0.00% 0.00% 32.41% 0.00% 18.01% 0.00% 49.58% dirty_cpu_(24) 0.00% 0.00% 32.83% 0.00% 17.95% 0.00% 49.23% dirty_cpu_(25) 0.00% 0.00% 34.42% 0.00% 21.18% 0.00% 44.40% dirty_cpu_(26) 0.00% 0.00% 34.99% 0.00% 18.62% 0.00% 46.39% dirty_cpu_(27) 0.00% 0.00% 31.30% 0.00% 17.76% 0.00% 50.94% dirty_cpu_(28) 0.00% 0.00% 32.23% 0.00% 17.42% 0.00% 50.35% dirty_cpu_(29) 0.00% 0.00% 33.25% 0.00% 18.83% 0.00% 47.92% dirty_cpu_(30) 0.00% 0.00% 34.15% 0.00% 18.93% 0.00% 46.93% dirty_cpu_(31) 0.00% 0.00% 33.88% 0.00% 17.94% 0.00% 48.17% dirty_cpu_(32) 0.00% 0.00% 35.30% 0.00% 17.34% 0.00% 47.36% dirty_cpu_(33) 0.00% 0.00% 34.17% 0.00% 17.96% 0.00% 47.87% dirty_cpu_(34) 0.00% 0.00% 32.21% 0.00% 18.94% 0.00% 48.85% dirty_cpu_(35) 0.00% 0.00% 35.37% 0.00% 19.08% 0.00% 45.56% dirty_cpu_(36) 0.00% 0.00% 31.21% 0.00% 17.71% 0.00% 51.07% dirty_io_s( 1) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 2) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 3) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 4) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 5) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 6) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 7) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 8) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 9) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s(10) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% poll( 0) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% scheduler( 1) 0.72% 0.18% 36.27% 0.51% 55.29% 1.59% 5.44% scheduler( 2) 0.62% 0.19% 33.97% 0.55% 54.90% 1.49% 8.27% scheduler( 3) 0.64% 0.22% 34.90% 0.53% 55.50% 1.43% 6.78% scheduler( 4) 0.56% 0.16% 31.66% 0.49% 55.55% 1.33% 10.25% scheduler( 5) 0.71% 0.20% 35.66% 0.58% 54.59% 1.49% 6.77% scheduler( 6) 0.59% 0.23% 31.19% 0.48% 57.81% 1.28% 8.42% scheduler( 7) 0.65% 0.23% 34.66% 0.56% 56.09% 1.60% 6.22% scheduler( 8) 0.78% 0.23% 35.04% 0.59% 55.32% 1.58% 6.47% scheduler( 9) 0.64% 0.21% 35.12% 0.47% 56.39% 1.50% 5.66% scheduler(10) 0.80% 0.21% 37.18% 0.53% 54.64% 1.59% 5.04% scheduler(11) 0.63% 0.20% 31.56% 0.53% 57.86% 1.48% 7.74% scheduler(12) 0.69% 0.22% 37.50% 0.57% 54.38% 1.75% 4.89% scheduler(13) 0.78% 0.21% 35.19% 0.55% 55.96% 1.54% 5.76% scheduler(14) 0.70% 0.22% 31.07% 0.48% 54.35% 1.30% 11.88% scheduler(15) 0.64% 0.23% 36.66% 0.51% 54.74% 1.66% 5.56% scheduler(16) 0.68% 0.18% 33.76% 0.50% 55.47% 1.28% 8.12% scheduler(17) 0.55% 0.28% 27.34% 0.43% 57.13% 1.24% 13.03% scheduler(18) 0.64% 0.18% 32.85% 0.47% 56.12% 1.32% 8.42% scheduler(19) 0.76% 0.24% 35.84% 0.53% 55.13% 1.56% 5.94% scheduler(20) 0.57% 0.19% 27.58% 0.42% 57.17% 1.12% 12.96% scheduler(21) 0.67% 0.20% 33.02% 0.44% 56.36% 1.36% 7.95% scheduler(22) 0.81% 0.23% 34.44% 0.54% 56.18% 1.35% 6.46% scheduler(23) 0.72% 0.22% 36.49% 0.53% 55.47% 1.53% 5.04% scheduler(24) 0.65% 0.24% 34.68% 0.45% 55.89% 1.44% 6.65% scheduler(25) 0.71% 0.21% 35.66% 0.46% 55.12% 1.65% 6.18% scheduler(26) 0.73% 0.20% 33.23% 0.46% 56.45% 1.40% 7.54% scheduler(27) 0.71% 0.20% 35.40% 0.60% 54.82% 1.47% 6.79% scheduler(28) 0.61% 0.18% 31.44% 0.48% 58.58% 1.37% 7.34% scheduler(29) 0.73% 0.22% 36.53% 0.52% 55.65% 1.61% 4.75% scheduler(30) 0.71% 0.17% 36.66% 0.48% 55.07% 1.49% 5.44% scheduler(31) 0.76% 0.21% 37.13% 0.50% 54.88% 1.51% 5.02% scheduler(32) 0.75% 0.17% 32.85% 0.49% 56.76% 1.35% 7.62% scheduler(33) 0.72% 0.23% 34.21% 0.54% 54.55% 1.42% 8.34% scheduler(34) 0.73% 0.19% 35.86% 0.58% 55.21% 1.41% 6.03% scheduler(35) 0.62% 0.19% 36.14% 0.51% 55.15% 1.57% 5.82% scheduler(36) 0.68% 0.22% 35.16% 0.55% 55.38% 1.35% 6.66%
Stats per type: async 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% aux 3.61% 1.11% 0.00% 0.00% 0.03% 0.00% 95.24% dirty_cpu_sche 0.00% 0.00% 33.37% 0.00% 18.36% 0.00% 48.27% dirty_io_sched 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% poll 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% scheduler 0.69% 0.21% 34.27% 0.51% 55.72% 1.46% 7.15%
Average thread real-time : 1001205 us Accumulated system run-time : 53032153 us Average scheduler run-time : 976005 us
Thread aux check_io emulator gc other port sleep
Stats per thread: async( 0) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 1) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 2) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 3) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 4) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 5) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 6) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 7) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 8) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async( 9) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(10) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(11) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(12) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(13) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(14) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(15) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(16) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(17) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(18) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(19) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(20) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(21) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(22) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(23) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(24) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(25) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(26) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(27) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(28) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(29) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(30) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(31) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(32) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(33) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(34) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(35) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(36) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(37) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(38) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(39) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(40) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(41) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(42) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(43) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(44) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(45) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(46) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(47) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(48) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(49) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(50) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(51) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(52) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(53) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(54) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(55) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(56) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(57) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(58) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(59) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(60) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(61) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(62) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(63) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(64) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(65) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(66) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(67) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(68) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(69) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(70) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(71) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(72) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(73) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(74) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(75) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(76) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(77) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(78) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(79) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(80) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(81) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(82) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(83) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(84) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(85) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(86) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(87) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(88) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(89) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(90) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(91) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(92) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(93) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(94) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(95) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(96) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(97) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(98) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(99) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% async(**) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% aux( 1) 4.15% 0.43% 0.00% 0.00% 0.21% 0.00% 95.21% dirty_cpu_( 1) 0.00% 0.00% 27.65% 0.00% 19.43% 0.00% 52.92% dirty_cpu_( 2) 0.00% 0.00% 33.04% 0.00% 21.14% 0.00% 45.82% dirty_cpu_( 3) 0.00% 0.00% 27.82% 0.00% 18.95% 0.00% 53.23% dirty_cpu_( 4) 0.00% 0.00% 28.83% 0.00% 18.41% 0.00% 52.76% dirty_cpu_( 5) 0.00% 0.00% 31.85% 0.00% 19.13% 0.00% 49.02% dirty_cpu_( 6) 0.00% 0.00% 37.99% 0.00% 22.95% 0.00% 39.06% dirty_cpu_( 7) 0.00% 0.00% 30.23% 0.00% 19.39% 0.00% 50.38% dirty_cpu_( 8) 0.00% 0.00% 27.41% 0.00% 19.20% 0.00% 53.39% dirty_cpu_( 9) 0.00% 0.00% 25.97% 0.00% 17.06% 0.00% 56.97% dirty_cpu_(10) 0.00% 0.00% 33.72% 0.00% 22.95% 0.00% 43.33% dirty_cpu_(11) 0.00% 0.00% 29.36% 0.00% 17.68% 0.00% 52.96% dirty_cpu_(12) 0.00% 0.00% 31.76% 0.00% 21.85% 0.00% 46.39% dirty_cpu_(13) 0.00% 0.00% 29.66% 0.00% 16.06% 0.00% 54.28% dirty_cpu_(14) 0.00% 0.00% 28.40% 0.00% 19.99% 0.00% 51.61% dirty_cpu_(15) 0.00% 0.00% 30.92% 0.00% 17.43% 0.00% 51.65% dirty_cpu_(16) 0.00% 0.00% 26.79% 0.00% 16.10% 0.00% 57.11% dirty_cpu_(17) 0.00% 0.00% 27.51% 0.00% 17.93% 0.00% 54.56% dirty_cpu_(18) 0.00% 0.00% 29.95% 0.00% 17.58% 0.00% 52.47% dirty_cpu_(19) 0.00% 0.00% 32.60% 0.00% 20.00% 0.00% 47.40% dirty_cpu_(20) 0.00% 0.00% 28.34% 0.00% 16.17% 0.00% 55.49% dirty_cpu_(21) 0.00% 0.00% 30.26% 0.00% 16.87% 0.00% 52.87% dirty_cpu_(22) 0.00% 0.00% 32.03% 0.00% 19.08% 0.00% 48.88% dirty_cpu_(23) 0.00% 0.00% 28.59% 0.00% 19.82% 0.00% 51.59% dirty_cpu_(24) 0.00% 0.00% 29.55% 0.00% 17.38% 0.00% 53.07% dirty_cpu_(25) 0.00% 0.00% 30.51% 0.00% 18.91% 0.00% 50.58% dirty_cpu_(26) 0.00% 0.00% 29.72% 0.00% 17.58% 0.00% 52.70% dirty_cpu_(27) 0.00% 0.00% 28.25% 0.00% 18.94% 0.00% 52.81% dirty_cpu_(28) 0.00% 0.00% 33.06% 0.00% 19.75% 0.00% 47.19% dirty_cpu_(29) 0.00% 0.00% 27.28% 0.00% 19.40% 0.00% 53.32% dirty_cpu_(30) 0.00% 0.00% 34.10% 0.00% 22.20% 0.00% 43.70% dirty_cpu_(31) 0.00% 0.00% 32.76% 0.00% 21.75% 0.00% 45.49% dirty_cpu_(32) 0.00% 0.00% 30.09% 0.00% 22.01% 0.00% 47.90% dirty_cpu_(33) 0.00% 0.00% 31.58% 0.00% 20.70% 0.00% 47.72% dirty_cpu_(34) 0.00% 0.00% 29.87% 0.00% 20.26% 0.00% 49.86% dirty_cpu_(35) 0.00% 0.00% 31.85% 0.00% 21.93% 0.00% 46.21% dirty_cpu_(36) 0.00% 0.00% 28.77% 0.00% 18.50% 0.00% 52.73% dirty_io_s( 1) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 2) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 3) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 4) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 5) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 6) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 7) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 8) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s( 9) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% dirty_io_s(10) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% poll( 0) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% scheduler( 1) 0.40% 0.09% 76.76% 0.50% 18.81% 1.15% 2.30% scheduler( 2) 0.36% 0.14% 77.57% 0.49% 18.74% 1.08% 1.61% scheduler( 3) 0.35% 0.08% 76.81% 0.46% 18.96% 1.03% 2.32% scheduler( 4) 0.36% 0.08% 78.06% 0.40% 18.20% 1.11% 1.78% scheduler( 5) 0.33% 0.07% 76.77% 0.40% 19.36% 0.88% 2.19% scheduler( 6) 0.31% 0.06% 74.23% 0.38% 19.66% 0.81% 4.54% scheduler( 7) 0.40% 0.08% 76.56% 0.48% 18.86% 1.25% 2.36% scheduler( 8) 0.48% 0.07% 75.85% 0.50% 19.36% 1.40% 2.34% scheduler( 9) 0.36% 0.07% 76.28% 0.52% 19.21% 1.39% 2.17% scheduler(10) 0.32% 0.07% 75.76% 0.43% 19.01% 1.19% 3.23% scheduler(11) 0.35% 0.07% 76.43% 0.52% 19.00% 1.19% 2.43% scheduler(12) 0.31% 0.06% 75.31% 0.41% 19.55% 1.02% 3.35% scheduler(13) 0.38% 0.07% 77.07% 0.59% 18.84% 1.33% 1.73% scheduler(14) 0.41% 0.18% 77.68% 0.48% 18.25% 1.34% 1.66% scheduler(15) 0.34% 0.08% 76.80% 0.44% 19.12% 1.05% 2.16% scheduler(16) 0.38% 0.06% 77.68% 0.44% 18.22% 1.13% 2.08% scheduler(17) 0.36% 0.06% 76.70% 0.49% 18.93% 1.19% 2.27% scheduler(18) 0.40% 0.08% 77.05% 0.44% 18.97% 1.25% 1.82% scheduler(19) 0.35% 0.07% 75.97% 0.45% 19.40% 1.07% 2.69% scheduler(20) 0.37% 0.07% 76.11% 0.46% 19.42% 1.14% 2.43% scheduler(21) 0.37% 0.08% 76.33% 0.47% 19.34% 1.01% 2.40% scheduler(22) 0.46% 0.08% 77.41% 0.50% 18.75% 0.97% 1.83% scheduler(23) 0.43% 0.08% 75.50% 0.51% 19.85% 1.13% 2.51% scheduler(24) 0.39% 0.07% 74.49% 0.49% 20.29% 1.07% 3.20% scheduler(25) 0.36% 0.07% 74.69% 0.48% 19.32% 0.98% 4.10% scheduler(26) 0.35% 0.07% 75.96% 0.41% 18.79% 0.98% 3.44% scheduler(27) 0.36% 0.06% 76.86% 0.43% 19.18% 0.96% 2.15% scheduler(28) 0.32% 0.08% 75.08% 0.46% 19.71% 1.02% 3.34% scheduler(29) 0.43% 0.14% 77.10% 0.43% 18.80% 1.22% 1.88% scheduler(30) 0.35% 0.06% 76.45% 0.47% 18.64% 1.11% 2.92% scheduler(31) 0.38% 0.08% 78.35% 0.52% 18.21% 1.17% 1.29% scheduler(32) 0.37% 0.06% 74.47% 0.50% 19.71% 1.34% 3.56% scheduler(33) 0.38% 0.06% 77.12% 0.44% 18.59% 1.12% 2.29% scheduler(34) 0.46% 0.07% 75.58% 0.51% 19.87% 0.93% 2.58% scheduler(35) 0.38% 0.13% 75.24% 0.46% 19.68% 1.03% 3.07% scheduler(36) 0.39% 0.08% 76.21% 0.48% 19.52% 1.03% 2.29%
Stats per type: async 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% aux 4.15% 0.43% 0.00% 0.00% 0.21% 0.00% 95.21% dirty_cpu_sche 0.00% 0.00% 30.22% 0.00% 19.29% 0.00% 50.48% dirty_io_sched 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% poll 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% scheduler 0.38% 0.08% 76.34% 0.47% 19.11% 1.11% 2.51%
Hi Jesse, we have not observe the behaviour you see. Can you share microstate accounting stats for the different cases? Maybe that could shed some light on what the schedulers are doing? 1> msacc:start(1000), msacc:print().
And maybe also running perf top as root on one of the scheduler threads that are fully utilised can be interesting. # perf top --tid <traditional scheduler thread id>
/Jonas On Wed, Jan 9, 2019 at 10:52 PM Jesse Stimpson < [hidden email]> wrote: Rikard and Jonas,
Thank you for your response. With the patch at OTP PR-2027, we do observe that the dirty CPU schedulers do not drop to zero utilization, but in addition we see that the performance of the traditional schedulers suffers. Specifically, when we apply our load test, scheduler usage toggles in between the following two states. I've snipped the output of similar values for readability.
[{1,0.8933020805456553}, ... {36,0.8882068476700309}, {37,0.4932569692588096}, ... {72,0.49842881346116297}]
and
[{1,1.0}, ... {36,1.0}, {37,0.42909590032946493}, ... {72,0.4641339106204525}]
When applying this same load to the un-patched Erlang 21.2, the traditional schedulers are utilized at about 75%. So it appears that the 2027 patch may put undesirable pressure on the schedulers as a trade-off for consistent performance from the dirty CPU schedulers. Are you observing a similar response on your system?
Thanks, Jesse
Hi,
This sounds like the issue I bumped into.
After some time, many of the dirty schedulers are not performing any work anymore.
Which was related to a corrupted list of sleeping dirty schedulers.
I haven’t found what triggered it in my case (maybe a GC?). For me it took approx. 1 day before it happened.
Doing Multi Scheduler Block can trigger the problem with the list of sleeping dirty schedulers which gets corrupted.
The sleepers list can contain only one or some schedulers. That list is later used to wake up schedulers when there is work to do.
For each enqueued dirty-cpu or dirty-io job, it wakes the first scheduler in the sleepers list.
But if this list only point to a few schedulers only those are woken up to pick jobs from the queue.
Since fixing that specific problem I have not seen the collapse of dirty schedulers anymore.
It would be great if you can test and see if it fixes your problem as well, https://github.com/erlang/otp/pull/2027
Any feedback on the bug fix is appreciated.
/Jonas
> On 8 Jan 2019, at 17:13, Jesse Stimpson <[hidden email]> wrote:
>
> Hello,
>
> We run a large scale WebRTC-based VOIP application, and currently we are evaluating whether or not dirty CPU schedulers can help us get more consistent performance with some critical NIFs. We've created a test that mimics our production load and have run it on AWS EC2 instances of various sizes. When dirty CPU schedulers are working as expected, we do see measurable improvement in performance, but unfortunately we have seen that they can drop to 0 utilization. They do not recover from this state.
>
> The timings from 2 of our 12 NIFs are in the vicinity of the 1 msec rule of thumb, so we have experimented with them both on and off dirty CPU schedulers. We've left the other NIFs on the traditional schedulers. More detailed NIF timings can be found below.
>
> At the end of this message is the concerning data (from recon:scheduler_usage/1).We have schedulers 1-36 as the traditional schedulers bound to the first hardware thread of each core, and schedulers 37-72 are dirty CPU schedulers. Based on mpstat it appears the dirty CPU schedulers are not bound to threads. After about 10 minutes of running the load test, schedulers 42, 43, 44, 61, 63, 64, 66, 68, 72 went to at 0 utilization and stayed there indefinitely. As time goes on, more dirty CPU schedulers enter this 0 utilization state. Schedulers 1-36 do not exhibit this behaviour. We have reproduced this consistently on other EC2 c5 instances of various sizes.
>
> We have tried various beam adjustments, but we have seen the same behaviour each time. Some examples of things we have tried:
> +sbwt very_long +sbwtdcpu very_long +swt very_low +swtdcpu very_low
> +sfwi 500
>
> Is there anything else that we can try to avoid this 0-utilization issue? We are happy to provide more data if needed.
>
> NIF: webrtc_acm_add_data_10ms
> median: 606 usec
> 90 percentile: 1230 usec
> 99 percentile: 2285 usec
> 99.9 percentile: 3223 usec
>
> NIF: webrtc_acm_playout_data_10ms
> median: 98 usec
> 90 percentile: 371 usec
> 99 percentile: 410 usec
> 99.9 percentile: 449 usec
>
> EC2 instance: c5.18xlarge
>
> OS: Linux 4.4.0-1073-aws #83-Ubuntu SMP Sat Nov 17 00:26:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> beam flags: +P 4194304 +sbt db +SP 50:50 +sub true +A 128 +K true
>
> Erlang banner: Erlang/OTP 21 [erts-10.2] [source] [64-bit] [smp:36:36] [ds:36:36:10] [async-threads:128] [hipe]
>
> cpu_topology:
> [{node,[{processor,[{core,[{thread,{logical,0}},
> {thread,{logical,36}}]},
> {core,[{thread,{logical,1}},{thread,{logical,37}}]},
> {core,[{thread,{logical,2}},{thread,{logical,38}}]},
> {core,[{thread,{logical,3}},{thread,{logical,39}}]},
> {core,[{thread,{logical,4}},{thread,{logical,40}}]},
> {core,[{thread,{logical,5}},{thread,{logical,41}}]},
> {core,[{thread,{logical,6}},{thread,{logical,42}}]},
> {core,[{thread,{logical,7}},{thread,{logical,43}}]},
> {core,[{thread,{logical,8}},{thread,{logical,44}}]},
> {core,[{thread,{logical,9}},{thread,{logical,45}}]},
> {core,[{thread,{logical,10}},{thread,{logical,46}}]},
> {core,[{thread,{logical,11}},{thread,{logical,47}}]},
> {core,[{thread,{logical,12}},{thread,{logical,48}}]},
> {core,[{thread,{logical,13}},{thread,{logical,49}}]},
> {core,[{thread,{logical,14}},{thread,{logical,50}}]},
> {core,[{thread,{logical,15}},{thread,{logical,51}}]},
> {core,[{thread,{logical,16}},{thread,{logical,52}}]},
> {core,[{thread,{logical,17}},{thread,{logical,53}}]}]}]},
> {node,[{processor,[{core,[{thread,{logical,18}},
> {thread,{logical,54}}]},
> {core,[{thread,{logical,19}},{thread,{logical,55}}]},
> {core,[{thread,{logical,20}},{thread,{logical,56}}]},
> {core,[{thread,{logical,21}},{thread,{logical,57}}]},
> {core,[{thread,{logical,22}},{thread,{logical,58}}]},
> {core,[{thread,{logical,23}},{thread,{logical,59}}]},
> {core,[{thread,{logical,24}},{thread,{logical,60}}]},
> {core,[{thread,{logical,25}},{thread,{logical,61}}]},
> {core,[{thread,{logical,26}},{thread,{logical,62}}]},
> {core,[{thread,{logical,27}},{thread,{logical,63}}]},
> {core,[{thread,{logical,28}},{thread,{logical,64}}]},
> {core,[{thread,{logical,29}},{thread,{logical,65}}]},
> {core,[{thread,{logical,30}},{thread,{logical,66}}]},
> {core,[{thread,{logical,31}},{thread,{logical,67}}]},
> {core,[{thread,{logical,32}},{thread,{logical,68}}]},
> {core,[{thread,{logical,33}},{thread,{logical,69}}]},
> {core,[{thread,{logical,34}},{thread,{logical,70}}]},
> {core,[{thread,{logical,35}},{thread,{logical,71}}]}]}]}]
>
> scheduler_bindings: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
> 23,24,25,26,27,28,29,30,31,32,33,34,35}
>
> CPU utilization (from mpstat) during our load test:
> Linux 4.4.0-1073-aws (isw-core-x-qa-awsoh02) 01/04/2019 _x86_64_ (72 CPU)
>
> 08:55:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 08:55:15 PM all 51.58 0.00 1.64 0.01 0.00 0.42 0.00 0.00 0.00 46.36
> 08:55:15 PM 0 62.34 0.00 2.69 0.06 0.00 0.00 0.00 0.00 0.00 34.90
> 08:55:15 PM 1 62.84 0.00 2.58 0.02 0.00 0.00 0.00 0.00 0.00 34.55
> 08:55:15 PM 2 62.04 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.44
> 08:55:15 PM 3 62.17 0.00 2.62 0.00 0.00 0.00 0.00 0.00 0.00 35.21
> 08:55:15 PM 4 62.14 0.00 2.54 0.02 0.00 0.00 0.00 0.00 0.00 35.30
> 08:55:15 PM 5 62.06 0.00 2.51 0.00 0.00 0.00 0.00 0.00 0.00 35.43
> 08:55:15 PM 6 61.93 0.00 2.58 0.01 0.00 0.00 0.02 0.00 0.00 35.46
> 08:55:15 PM 7 61.99 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 8 62.01 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.47
> 08:55:15 PM 9 61.88 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.63
> 08:55:15 PM 10 62.18 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.34
> 08:55:15 PM 11 61.89 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.64
> 08:55:15 PM 12 61.98 0.00 2.50 0.00 0.00 0.00 0.00 0.00 0.00 35.52
> 08:55:15 PM 13 61.80 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 35.66
> 08:55:15 PM 14 61.96 0.00 2.56 0.00 0.00 0.00 0.00 0.00 0.00 35.48
> 08:55:15 PM 15 62.11 0.00 2.57 0.00 0.00 0.00 0.00 0.00 0.00 35.31
> 08:55:15 PM 16 62.03 0.00 2.55 0.00 0.00 0.00 0.00 0.00 0.00 35.42
> 08:55:15 PM 17 61.90 0.00 2.53 0.00 0.00 0.00 0.00 0.00 0.00 35.57
> 08:55:15 PM 18 58.36 0.00 2.41 0.09 0.00 3.72 0.00 0.00 0.00 35.43
> 08:55:15 PM 19 58.29 0.00 2.45 0.02 0.00 4.00 0.00 0.00 0.00 35.24
> 08:55:15 PM 20 58.45 0.00 2.36 0.02 0.00 3.73 0.00 0.00 0.00 35.44
> 08:55:15 PM 21 58.56 0.00 2.34 0.04 0.00 3.54 0.00 0.00 0.00 35.51
> 08:55:15 PM 22 58.12 0.00 2.46 0.00 0.00 3.84 0.00 0.00 0.00 35.58
> 08:55:15 PM 23 58.24 0.00 2.36 0.01 0.00 3.82 0.00 0.00 0.00 35.57
> 08:55:15 PM 24 58.84 0.00 2.42 0.00 0.00 3.60 0.00 0.00 0.00 35.14
> 08:55:15 PM 25 58.35 0.00 2.43 0.00 0.00 3.72 0.00 0.00 0.00 35.51
> 08:55:15 PM 26 62.09 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 27 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 28 62.15 0.00 2.48 0.00 0.00 0.00 0.00 0.00 0.00 35.37
> 08:55:15 PM 29 62.09 0.00 2.46 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 30 62.74 0.00 2.45 0.00 0.00 0.00 0.00 0.00 0.00 34.81
> 08:55:15 PM 31 62.14 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.39
> 08:55:15 PM 32 62.35 0.00 2.52 0.00 0.00 0.00 0.00 0.00 0.00 35.13
> 08:55:15 PM 33 62.13 0.00 2.42 0.00 0.00 0.00 0.00 0.00 0.00 35.46
> 08:55:15 PM 34 62.08 0.00 2.47 0.00 0.00 0.00 0.00 0.00 0.00 35.45
> 08:55:15 PM 35 62.15 0.00 2.49 0.00 0.00 0.00 0.00 0.00 0.00 35.35
> 08:55:15 PM 36 43.19 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.01
> 08:55:15 PM 37 43.09 0.00 0.83 0.01 0.00 0.00 0.00 0.00 0.00 56.06
> 08:55:15 PM 38 42.59 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.63
> 08:55:15 PM 39 42.60 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.58
> 08:55:15 PM 40 42.94 0.00 0.76 0.02 0.00 0.00 0.00 0.00 0.00 56.28
> 08:55:15 PM 41 42.11 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 42 43.26 0.00 0.85 0.01 0.00 0.00 0.00 0.00 0.00 55.88
> 08:55:15 PM 43 42.23 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 44 42.24 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 56.95
> 08:55:15 PM 45 41.65 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.55
> 08:55:15 PM 46 41.76 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.46
> 08:55:15 PM 47 42.66 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 56.52
> 08:55:15 PM 48 41.83 0.00 0.76 0.01 0.00 0.00 0.00 0.00 0.00 57.40
> 08:55:15 PM 49 42.04 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.13
> 08:55:15 PM 50 41.55 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 51 42.37 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 56.84
> 08:55:15 PM 52 41.66 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 57.57
> 08:55:15 PM 53 41.77 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.45
> 08:55:15 PM 54 40.33 0.00 0.88 0.01 0.00 0.00 0.00 0.00 0.00 58.79
> 08:55:15 PM 55 41.62 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 57.59
> 08:55:15 PM 56 41.51 0.00 0.82 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 57 40.35 0.00 0.76 0.00 0.00 0.00 0.00 0.00 0.00 58.89
> 08:55:15 PM 58 41.92 0.00 0.79 0.01 0.00 0.00 0.00 0.00 0.00 57.28
> 08:55:15 PM 59 41.48 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 57.72
> 08:55:15 PM 60 40.83 0.00 0.75 0.01 0.00 0.00 0.00 0.00 0.00 58.41
> 08:55:15 PM 61 40.95 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 58.26
> 08:55:15 PM 62 40.85 0.00 0.80 0.01 0.00 0.00 0.00 0.00 0.00 58.35
> 08:55:15 PM 63 42.11 0.00 0.77 0.01 0.00 0.00 0.00 0.00 0.00 57.12
> 08:55:15 PM 64 42.22 0.00 0.78 0.01 0.00 0.00 0.00 0.00 0.00 56.99
> 08:55:15 PM 65 41.55 0.00 0.78 0.00 0.00 0.00 0.00 0.00 0.00 57.67
> 08:55:15 PM 66 41.68 0.00 0.81 0.00 0.00 0.00 0.00 0.00 0.00 57.51
> 08:55:15 PM 67 41.06 0.00 0.87 0.01 0.00 0.00 0.00 0.00 0.00 58.07
> 08:55:15 PM 68 41.80 0.00 0.79 0.00 0.00 0.00 0.00 0.00 0.00 57.41
> 08:55:15 PM 69 41.38 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.87
> 08:55:15 PM 70 42.17 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 57.08
> 08:55:15 PM 71 41.26 0.00 0.81 0.00 0.00 0.00 0.01 0.00 0.00 57.92
>
>
> 1> rp(recon:scheduler_usage(1000)).
> [{1,0.8581586716045433},
> {2,0.8607597847736659},
> {3,0.8501543449289042},
> {4,0.8619718426658023},
> {5,0.8660226578705742},
> {6,0.8555763933364806},
> {7,0.8557195972097537},
> {8,0.8668284625654983},
> {9,0.858588675506532},
> {10,0.8650501369640686},
> {11,0.8615557214275474},
> {12,0.8523392789683996},
> {13,0.8469329348407398},
> {14,0.8615637534991641},
> {15,0.8591930739986557},
> {16,0.8578596164201188},
> {17,0.8549696086049403},
> {18,0.8677098230685564},
> {19,0.8556328340054443},
> {20,0.8642542459479063},
> {21,0.8670655179064826},
> {22,0.8562638308551557},
> {23,0.8511202058055239},
> {24,0.8697339264798696},
> {25,0.8594823462601853},
> {26,0.8634550008001265},
> {27,0.8568590130172198},
> {28,0.864025757952233},
> {29,0.8575614770622326},
> {30,0.8611760239143281},
> {31,0.8589789125669829},
> {32,0.8537754468447621},
> {33,0.8675538912040871},
> {34,0.8607522803519515},
> {35,0.8639389099445753},
> {36,0.8743759133171144},
> {37,0.9526507413773687},
> {38,0.948925611917577},
> {39,0.9511904612810083},
> {40,0.9581015421004746},
> {41,0.9534977201467516},
> {42,0.0},
> {43,0.0},
> {44,0.0},
> {45,0.9522518499948341},
> {46,0.9453306755503674},
> {47,0.9586374601140047},
> {48,0.9574611845671873},
> {49,0.950914304057791},
> {50,0.9582124294398642},
> {51,0.9448074463900029},
> {52,0.9474771999511407},
> {53,0.9450778098401229},
> {54,0.0},
> {55,0.9478167911711937},
> {56,0.9612587860764551},
> {57,0.943468696358506},
> {58,0.9509498642900455},
> {59,0.9603595134495414},
> {60,0.9517397912369205},
> {61,0.0},
> {62,0.9587029756970658},
> {63,0.0},
> {64,0.0},
> {65,0.9526167902065421},
> {66,0.0},
> {67,0.9475668680715805},
> {68,0.0},
> {69,0.9630301804839606},
> {70,0.9475359592879946},
> {71,0.9541718747681374},
> {72,0.0}]
>
>
> Thanks for your time,
>
> Jesse Stimpson
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
-- 
| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
My apologies as I've been distracted by other matters recently. I hope to be able to report back within the next few days.
Thanks
Did Rickard’s patch (PR-2093) fixed your Dirty scheduler issue?
Best /Frank
-- 
| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Behavior of the schedulers appears to have the same issue with 2093 patch.
But I did notice something new in the msacc output. There is a very brief period, at approx the same time as the normal schedulers usage spikes, where all the dirty cpu schedulers have a significant sleep time. I've included timestamped excerpts below, starting with the increase in dirty cpu sleep, and ending with a "steady state" utilization.
Thread aux check_io emulator gc other port sleep
{{2019,1,15},{21,57,34}} dirty_cpu_( 1) 0.00% 0.00% 26.80% 0.00% 22.75% 0.00% 50.45% scheduler( 1) 0.17% 0.00% 98.22% 0.50% 0.18% 0.77% 0.16%
{{2019,1,15},{21,57,36}} dirty_cpu_( 1) 0.00% 0.00% 76.92% 0.00% 17.73% 0.00% 5.35% scheduler( 1) 0.61% 0.00% 94.20% 1.23% 0.61% 3.35% 0.00%
{{2019,1,15},{21,57,38}} dirty_cpu_( 1) 0.00% 0.00% 87.16% 0.00% 6.52% 0.00% 6.31% scheduler( 1) 0.94% 0.14% 93.72% 1.51% 0.63% 2.84% 0.22%
{{2019,1,15},{21,57,39}} dirty_cpu_( 1) 0.00% 0.00% 87.92% 0.00% 6.85% 0.00% 5.23% scheduler( 1) 0.98% 0.14% 90.88% 1.32% 3.71% 2.71% 0.26%
{{2019,1,15},{21,57,40}} dirty_cpu_( 1) 0.00% 0.00% 67.47% 0.00% 19.51% 0.00% 13.02% scheduler( 1) 1.09% 0.17% 81.95% 1.07% 12.01% 3.08% 0.64%
{{2019,1,15},{21,57,41}} dirty_cpu_( 1) 0.00% 0.00% 64.25% 0.00% 18.86% 0.00% 16.88% scheduler( 1) 0.99% 0.28% 74.67% 0.98% 19.56% 2.80% 0.71%
Thank you for the continuing support!
Jesse
On Tue, Jan 15, 2019 at 4:14 PM Jesse Stimpson < [hidden email]> wrote: My apologies as I've been distracted by other matters recently. I hope to be able to report back within the next few days.
Thanks
Did Rickard’s patch (PR-2093) fixed your Dirty scheduler issue?
Best /Frank
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
-- 
| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
On 2019-01-15 23:11, Jesse Stimpson wrote:
> Behavior of the schedulers appears to have the same issue with 2093 patch.
>
> But I did notice something new in the msacc output. There is a very
> brief period, at approx the same time as the normal schedulers usage
> spikes, where all the dirty cpu schedulers have a significant sleep
> time. I've included timestamped excerpts below, starting with the
> increase in dirty cpu sleep, and ending with a "steady state" utilization.
>
We just released OTP-21.2.3 containing PR-2093.
I don't think PR-2093 cause the spikes. This change does not affect how
work is moved between normal and dirty schedulers, only prevents the
"loss" of dirty schedulers.
If a process is scheduled on a dirty scheduler it wont make progress
until it has executed on a dirty scheduler and vice versa (for normal
schedulers). This is the same both before and after PR-2093. Since dirty
schedulers aren't "lost" after PR-2093 progress of such processes will
happen earlier which of course change the behavior, but that is due to
the work load.
Regards,
Rickard
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Thanks Rickard.
@Jonas: did the PR2093 help in your case?
/Frank On 2019-01-15 23:11, Jesse Stimpson wrote:
> Behavior of the schedulers appears to have the same issue with 2093 patch.
>
> But I did notice something new in the msacc output. There is a very
> brief period, at approx the same time as the normal schedulers usage
> spikes, where all the dirty cpu schedulers have a significant sleep
> time. I've included timestamped excerpts below, starting with the
> increase in dirty cpu sleep, and ending with a "steady state" utilization.
>
We just released OTP-21.2.3 containing PR-2093.
I don't think PR-2093 cause the spikes. This change does not affect how
work is moved between normal and dirty schedulers, only prevents the
"loss" of dirty schedulers.
If a process is scheduled on a dirty scheduler it wont make progress
until it has executed on a dirty scheduler and vice versa (for normal
schedulers). This is the same both before and after PR-2093. Since dirty
schedulers aren't "lost" after PR-2093 progress of such processes will
happen earlier which of course change the behavior, but that is due to
the work load.
Regards,
Rickard
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
It's possible that during our tests the utilization spike was masked by the collapse issue fixed in the recent PRs. Is there any other analysis I can provide on the utilization spike/sleep behavior we're seeing, or any other debugging or code reading you recommend? As far as I can tell, there's nothing about our workload that would cause periodic behavior like this. The application is slinging RTP audio via udp to remote endpoints at a 20 msec ptime. Each function call for the NIF in question adds 10 msec of audio to the WebRTC buffer.
As always, the level of support from the Erlang community is second to none. Thanks to all for your time!
Jesse
On 2019-01-15 23:11, Jesse Stimpson wrote:
> Behavior of the schedulers appears to have the same issue with 2093 patch.
>
> But I did notice something new in the msacc output. There is a very
> brief period, at approx the same time as the normal schedulers usage
> spikes, where all the dirty cpu schedulers have a significant sleep
> time. I've included timestamped excerpts below, starting with the
> increase in dirty cpu sleep, and ending with a "steady state" utilization.
>
We just released OTP-21.2.3 containing PR-2093.
I don't think PR-2093 cause the spikes. This change does not affect how
work is moved between normal and dirty schedulers, only prevents the
"loss" of dirty schedulers.
If a process is scheduled on a dirty scheduler it wont make progress
until it has executed on a dirty scheduler and vice versa (for normal
schedulers). This is the same both before and after PR-2093. Since dirty
schedulers aren't "lost" after PR-2093 progress of such processes will
happen earlier which of course change the behavior, but that is due to
the work load.
Regards,
Rickard
-- 
| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Glad to hear your Dirty schedulers collapse issue was solved by Rickard’s PR.
The 21.2.3 release fixed a similar problem we had with Dirty Schedulers.
Jesse: can’t you rewrite your Dirty NIF as a Yielding one? There’s a nice slide deck from “Andrew Bennett” describing this technique (page 58): Best /Frank
It's possible that during our tests the utilization spike was masked by the collapse issue fixed in the recent PRs. Is there any other analysis I can provide on the utilization spike/sleep behavior we're seeing, or any other debugging or code reading you recommend? As far as I can tell, there's nothing about our workload that would cause periodic behavior like this. The application is slinging RTP audio via udp to remote endpoints at a 20 msec ptime. Each function call for the NIF in question adds 10 msec of audio to the WebRTC buffer.
As always, the level of support from the Erlang community is second to none. Thanks to all for your time!
On 2019-01-15 23:11, Jesse Stimpson wrote:
> Behavior of the schedulers appears to have the same issue with 2093 patch.
>
> But I did notice something new in the msacc output. There is a very
> brief period, at approx the same time as the normal schedulers usage
> spikes, where all the dirty cpu schedulers have a significant sleep
> time. I've included timestamped excerpts below, starting with the
> increase in dirty cpu sleep, and ending with a "steady state" utilization.
>
We just released OTP-21.2.3 containing PR-2093.
I don't think PR-2093 cause the spikes. This change does not affect how
work is moved between normal and dirty schedulers, only prevents the
"loss" of dirty schedulers.
If a process is scheduled on a dirty scheduler it wont make progress
until it has executed on a dirty scheduler and vice versa (for normal
schedulers). This is the same both before and after PR-2093. Since dirty
schedulers aren't "lost" after PR-2093 progress of such processes will
happen earlier which of course change the behavior, but that is due to
the work load.
Regards,
Rickard
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
why do you do it via nif? On Wed, Jan 16, 2019 at 6:00 PM Jesse Stimpson < [hidden email]> wrote: It's possible that during our tests the utilization spike was masked by the collapse issue fixed in the recent PRs. Is there any other analysis I can provide on the utilization spike/sleep behavior we're seeing, or any other debugging or code reading you recommend? As far as I can tell, there's nothing about our workload that would cause periodic behavior like this. The application is slinging RTP audio via udp to remote endpoints at a 20 msec ptime. Each function call for the NIF in question adds 10 msec of audio to the WebRTC buffer.
As always, the level of support from the Erlang community is second to none. Thanks to all for your time!
Jesse
On 2019-01-15 23:11, Jesse Stimpson wrote:
> Behavior of the schedulers appears to have the same issue with 2093 patch.
>
> But I did notice something new in the msacc output. There is a very
> brief period, at approx the same time as the normal schedulers usage
> spikes, where all the dirty cpu schedulers have a significant sleep
> time. I've included timestamped excerpts below, starting with the
> increase in dirty cpu sleep, and ending with a "steady state" utilization.
>
We just released OTP-21.2.3 containing PR-2093.
I don't think PR-2093 cause the spikes. This change does not affect how
work is moved between normal and dirty schedulers, only prevents the
"loss" of dirty schedulers.
If a process is scheduled on a dirty scheduler it wont make progress
until it has executed on a dirty scheduler and vice versa (for normal
schedulers). This is the same both before and after PR-2093. Since dirty
schedulers aren't "lost" after PR-2093 progress of such processes will
happen earlier which of course change the behavior, but that is due to
the work load.
Regards,
Rickard
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
To clarify about our workload, the NIF execution itself is around 1 msec, the data on which is operates represents 10 msec of audio. Apologies if my last message was unclear.
Out of convenience, we're using the open source WebRTC project to take advantage of their built in PLC, FEC, Opus, etc. The project is written in C++, so we have integrated with it via NIF. Unfortunately, re-writing using a yielding NIF, or re-writing to Erlang is not as straightforward as we would like. Although I admit it would alleviate our scheduling issues.
We'll continue our testing with 21.2.3 and report back if there appear to be any other leads.
Thanks, Jesse why do you do it via nif?
On Wed, Jan 16, 2019 at 6:00 PM Jesse Stimpson < [hidden email]> wrote: It's possible that during our tests the utilization spike was masked by the collapse issue fixed in the recent PRs. Is there any other analysis I can provide on the utilization spike/sleep behavior we're seeing, or any other debugging or code reading you recommend? As far as I can tell, there's nothing about our workload that would cause periodic behavior like this. The application is slinging RTP audio via udp to remote endpoints at a 20 msec ptime. Each function call for the NIF in question adds 10 msec of audio to the WebRTC buffer.
As always, the level of support from the Erlang community is second to none. Thanks to all for your time!
Jesse
On 2019-01-15 23:11, Jesse Stimpson wrote:
> Behavior of the schedulers appears to have the same issue with 2093 patch.
>
> But I did notice something new in the msacc output. There is a very
> brief period, at approx the same time as the normal schedulers usage
> spikes, where all the dirty cpu schedulers have a significant sleep
> time. I've included timestamped excerpts below, starting with the
> increase in dirty cpu sleep, and ending with a "steady state" utilization.
>
We just released OTP-21.2.3 containing PR-2093.
I don't think PR-2093 cause the spikes. This change does not affect how
work is moved between normal and dirty schedulers, only prevents the
"loss" of dirty schedulers.
If a process is scheduled on a dirty scheduler it wont make progress
until it has executed on a dirty scheduler and vice versa (for normal
schedulers). This is the same both before and after PR-2093. Since dirty
schedulers aren't "lost" after PR-2093 progress of such processes will
happen earlier which of course change the behavior, but that is due to
the work load.
Regards,
Rickard
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
-- 
| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
After more testing with 21.2.3, it appears that the behavior we are seeing with respect to dirty schedulers going to sleep occurs only when the dirty scheduler runs on a cpu hyperthread. On our test instance, I disabled hyperthreading and used +SP 50:50, and the problem has gone away. This allows our NIF workload and non-NIF workload an even attribution of CPU resources, which works well for our purposes.
It's unclear to us why hyperthreading causes this behavior, but we do wonder if it may have something to do with AWS virtualization. In any case, thanks to everyone for the assistance!
Best, Jesse On Wed, Jan 23, 2019 at 1:14 PM Jesse Stimpson < [hidden email]> wrote: To clarify about our workload, the NIF execution itself is around 1 msec, the data on which is operates represents 10 msec of audio. Apologies if my last message was unclear.
Out of convenience, we're using the open source WebRTC project to take advantage of their built in PLC, FEC, Opus, etc. The project is written in C++, so we have integrated with it via NIF. Unfortunately, re-writing using a yielding NIF, or re-writing to Erlang is not as straightforward as we would like. Although I admit it would alleviate our scheduling issues.
We'll continue our testing with 21.2.3 and report back if there appear to be any other leads.
Thanks, Jesse
why do you do it via nif?
On Wed, Jan 16, 2019 at 6:00 PM Jesse Stimpson < [hidden email]> wrote: It's possible that during our tests the utilization spike was masked by the collapse issue fixed in the recent PRs. Is there any other analysis I can provide on the utilization spike/sleep behavior we're seeing, or any other debugging or code reading you recommend? As far as I can tell, there's nothing about our workload that would cause periodic behavior like this. The application is slinging RTP audio via udp to remote endpoints at a 20 msec ptime. Each function call for the NIF in question adds 10 msec of audio to the WebRTC buffer.
As always, the level of support from the Erlang community is second to none. Thanks to all for your time!
Jesse
On 2019-01-15 23:11, Jesse Stimpson wrote:
> Behavior of the schedulers appears to have the same issue with 2093 patch.
>
> But I did notice something new in the msacc output. There is a very
> brief period, at approx the same time as the normal schedulers usage
> spikes, where all the dirty cpu schedulers have a significant sleep
> time. I've included timestamped excerpts below, starting with the
> increase in dirty cpu sleep, and ending with a "steady state" utilization.
>
We just released OTP-21.2.3 containing PR-2093.
I don't think PR-2093 cause the spikes. This change does not affect how
work is moved between normal and dirty schedulers, only prevents the
"loss" of dirty schedulers.
If a process is scheduled on a dirty scheduler it wont make progress
until it has executed on a dirty scheduler and vice versa (for normal
schedulers). This is the same both before and after PR-2093. Since dirty
schedulers aren't "lost" after PR-2093 progress of such processes will
happen earlier which of course change the behavior, but that is due to
the work load.
Regards,
Rickard
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
--

| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
-- 
| Jesse Stimpson Site Reliability Engineering m: <a href="tel:(919)%20995-0424" value="+19192971077" style="color:rgb(17,85,204)" target="_blank">9199950424 RepublicWireless.com |
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|