Interesting benchmark performance (was RE: Pitiful benchmark perf ormance)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Interesting benchmark performance (was RE: Pitiful benchmark perf ormance)

Sean Hinde-2
> Maybe the time has come to adaptively adjust the number of reductions
> before yielding (ie, rescheduling)?

Perhaps :)

> Here is a portable, straightforward approach: instead of compiling in
> a constant number of reductions, the number of remaining reductions
> should be loaded from the process structure when checking for
> yielding.

As far as I can make out the current behaviour based on number of reductions
generally works very well in the situation where each time the system is
called upon to do some work there are at least 2000 to be done. One possible
improvement (which I mailed before but it got lost) could be:

<suggestion>

The main emulator loop appears to be the code at the very end of sys.c. This
calls schedule() and waits for it to return before any i/o is checked.

schedule() in erl_process.c will *only* return when INPUT_REDUCTIONS have
all been used up.

As far as I can make out the only reason the echo/server benchmark doesn't
deadlock is that there is enough going on in the background in an otherwise
quiescent system to eventually use up the 2000 function calls required to
check i/o.

Maybe schedule() might be modified so that in addition to returning after
INPUT_REDUCTIONS it keeps track of activity in each pass through all
processes and returns if there has been no activity detected for a whole
pass. This should ensure responsiveness is not compromised while not adding
significantly to the context switching load.

<suggestion/>

Since then I have tried to make the echo server benchmark run faster by
adjusting the INPUT_REDUCTIONS and CONTEXT_REDS values and recompiling the
emulator with absolutely no success. The only thing I have found to work is
running a busy loop in the background!!!


> The interesting part, then, is deciding how many reductions you get
> when you're scheduled. A simple approach is to permit the system to
> set the reductions-per-yield at runtime (per process or for the entire
> node), by using a BIF. But this must be supplemented by some way to
> measure activity, so that the decision can be made systematically.
> (Alternatively, one could take the approach that reductions-per-yield
> is set _only_ inside the runtime, to avoid messing around with BIFs.)

The simplest method could be to make it a beam startup parameter rather than
a constant. Then for weird applications the value could easily be tuned for
max performance.

> A second, orthogonal, topic to consider is how well a "reduction"
> corresponds to a time tick. A reduction can vary quite a bit in the
> amount of time it requires, because of BIFs: today, there are ways to
> bump reductions when a long-running BIF begins.
>
> Another approach to yielding might be to measure the available time
> slice in hardware cycles, rather than procedure calls. All desktop
> processors have cycle counters, for example, so it is viable for a
> wide range of systems. Unfortunately, the counters are often somewhat
> messy to work with.

It could be an interesting excercise. There appears to have been a large
amout of effort put into all the BIFs to try to make them as fair as
possible. Possibly this is heavier than making them all interruptible based
on a time slice??

Still perplexed

- Sean



NOTICE AND DISCLAIMER:
This email (including attachments) is confidential.  If you have received
this email in error please notify the sender immediately and delete this
email from your system without copying or disseminating it or placing any
reliance upon its contents.  We cannot accept liability for any breaches of
confidence arising through use of email.  Any opinions expressed in this
email (including attachments) are those of the author and do not necessarily
reflect our opinions.  We will not accept responsibility for any commitments
made by our employees outside the scope of our business.  We do not warrant
the accuracy or completeness of such information.



Reply | Threaded
Open this post in threaded view
|

Interesting benchmark performance (was RE: Pitiful benchmark perf ormance)

Per Hedeland-4
Sean Hinde <Sean.Hinde> wrote:
>The main emulator loop appears to be the code at the very end of sys.c. This
>calls schedule() and waits for it to return before any i/o is checked.
>
>schedule() in erl_process.c will *only* return when INPUT_REDUCTIONS have
>all been used up.

Of course not, that would be totally broken - when no processes are
runnable, it will return immediately:

        case 0:                 /* No process at all */
            return 0;

This will happen when all processes are sitting in 'receive', which is
in some sense the "normal" case. Note also that this will cause the
subsequent poll()/select() to be *blocking* (until the next scheduled
timeout, basically) instead of polling.

>As far as I can make out the only reason the echo/server benchmark doesn't
>deadlock is that there is enough going on in the background in an otherwise
>quiescent system to eventually use up the 2000 function calls required to
>check i/o.

Unless your code is doing it, there is basically nothing "going on in
the background". I.e. if there is nothing to do, the whole runtime will
sit in a blocking poll()/select() call (you can easily verify this by
using 'strace' or equivalent on the beam process), which should provide
optimal responsiveness to I/O.

So, while the improved benchmark performance with a background busy loop
is "interesting", it seems to me that the explanations so far are on the
wrong track - unfortunately I don't have a right track to suggest, but
by all means keep looking.:-) I'm pretty sure it isn't due to any
intentional optimization for a "busy system", though.

--Per Hedeland
per


Reply | Threaded
Open this post in threaded view
|

Interesting benchmark performance (was RE: Pitiful benchmark perf ormance)

Thomas Lindgren-2
In reply to this post by Sean Hinde-2

> The simplest method could be to make it a beam startup parameter rather than
> a constant. Then for weird applications the value could easily be tuned for
> max performance.

This could be sufficient if the system behaves in roughly the same way
all the time (eg, amount of concurrency or I/O is roughly the same),
though tuning (and possibly retuning) the system could be a chore if
you have to restart it to change the magic number.

(If performance is reasonably robust over #reductions, the need to
change the number is probably small, though.)

> It could be an interesting excercise. There appears to have been a large
> amout of effort put into all the BIFs to try to make them as fair as
> possible. Possibly this is heavier than making them all interruptible based
> on a time slice??

Basically, the problem is that a reduction need not be well-correlated
to a fixed elapsed time. When and if this is a problem, a cycle-based
time slice could be (more) useful.

Having BIFs being non-interruptible (or composed of "atomic pieces")
is however a nice property, and I think we should keep it that way as
far as possible. Basing rescheduling on elapsed cycles means we will
at least switch "soon after" completing a long-running BIF, rather
than completing a possibly large number of (costly?) reductions.

Digression: one approach for dynamically modifying the schedule in
regular Erlang could be the following. Assume we have a BIF
CostlyOp(). We can then wrap it as:

       ...
       T1 = erlang:now(),
       CostlyOp(),
       T2 = erlang:now(),
       erlang:bump_reductions(time_to_reds(time_sub(T2,T1))),
       ...

Where time_sub computes the elapsed time between T1 and T2, and
time_to_reds/1 converts elapsed time into 'average reductions'. The
drawbacks are code bloat, extra execution cost, imprecision (eg, no
knowledge of how much time remains in this slice) and the need to
manually place these wrappers.

                         Thomas
--
Thomas Lindgren thomas+junk
Alteon WebSystems