Mysterious gen_server timeouts in MIX

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Mysterious gen_server timeouts in MIX

John R. Ashmun
I have implemented an emulator for MIX, Donald Knuth's notional 1960's
computer, as an Erlang application, as a way of learning to use Erlang.

As part of the emulator's startup currently, MIX loads Knuth's Program P
into its memory, then jumps to the starting location of Program P.  Program
P calculates the first 500 prime numbers, pops them into a table, and then
prints 10 prime numbers at a time on what it considers to be its line
printer.

The emulator usually runs fine, and when it does, it carries out Program P
perfectly.

Sometimes, however, things get stuck, and I have been unable to learn what
goes wrong.  When there is a problem, I usually see these two error reports:

=ERROR REPORT==== 1-Nov-2013::11:33:54===
** Generic server <0.38.0> terminating
** Last message in was timeout
** When server state == []
** Reason for termination ==
** {timeout,{gen_server,call,[io_controller,{wait_until_not_busy,18}]}}

=ERROR REPORT==== 1-Nov-2013::11:33:54===
** Generic server <0.4095.0> terminating
** Last message in was {'$gen_cast',{write,1995}}
** When Server state == {state,18,<0.4058.0>,24}
** Reason for termination ==
** {timeout,{gen_server,call,[io_controller,{set_ready,18}]}}

The first one seems clearly to tell me that MIX's io_controller
gen_server's API wait_until_not_busy( ) function sent the message that it
should have, but the handle_call( ) never found the readiness status for
MIX device 18, its line printer, not to be busy, and then the gen_server
was timed out.  This should never happen, but then there is the second
report to consider.

The second report confuses me.  Program P attempts to write a title (FIRST
FIVE HUNDRED PRIMES) to MIX's line printer.  This is what the $gen_cast is
attempting to do (1995 is the MIX address of the first character of the
title).  A MIX I/O operation (in this case, the write) begins by waiting
until the I/O unit that's addressed is not busy.  Apparently that's not
true over the timeout period on this run; here is the mystery:  the
emulator's I/O device gen_servers are all initialized with their readiness
state set to ready, their readiness is changed to busy only by a request to
perform an I/O operation, and the operation is actually carried out by an
io_operation gen_server that is started by the I/O device gen_server.  That
io_operation gen_server is the entity that requests the io_controller to
set the readiness of the device back to ready, once the io_operation has
performed the I/O, or the output, in this case, to the file that plays the
role of the line printer, in this case.  It shouldn't be possible for
io_controller:set_ready( Device ) to be called before io_controller:write(
Device, Address ).  Is this what the error report is telling me happened?

All right, you say, you've programmed your emulator incompetently, and I
suppose I may have.  Why, then, does all this work perfectly, over and over
again, on another machine, or even on the same machine, before going into
failure mode?  (I have been unable to identify anything causing things to
start failing nor to start working, and I'm not making changes other than
sometimes running using application:start( 'MIX' ) and sometimes booting
MIX as an Erlang release -- sometimes compiling using +debug_info (or not)
has seemed to change from success to failure (or, equally likely, the
opposite).

I need advice, please.

Regards,
John Ashmun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131101/cd0db89d/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Mysterious gen_server timeouts in MIX

Bob Ippolito-2
gen_server has a default timeout of 2 seconds, so if your call takes long
enough for whatever reason this can happen. Try changing all if your
gen_server calls to explicitly specify a longer or infinite timeout.

On Friday, November 1, 2013, John R. Ashmun wrote:

> I have implemented an emulator for MIX, Donald Knuth's notional 1960's
> computer, as an Erlang application, as a way of learning to use Erlang.
>
> As part of the emulator's startup currently, MIX loads Knuth's Program P
> into its memory, then jumps to the starting location of Program P.  Program
> P calculates the first 500 prime numbers, pops them into a table, and then
> prints 10 prime numbers at a time on what it considers to be its line
> printer.
>
> The emulator usually runs fine, and when it does, it carries out Program P
> perfectly.
>
> Sometimes, however, things get stuck, and I have been unable to learn what
> goes wrong.  When there is a problem, I usually see these two error reports:
>
> =ERROR REPORT==== 1-Nov-2013::11:33:54===
> ** Generic server <0.38.0> terminating
> ** Last message in was timeout
> ** When server state == []
> ** Reason for termination ==
> ** {timeout,{gen_server,call,[io_controller,{wait_until_not_busy,18}]}}
>
> =ERROR REPORT==== 1-Nov-2013::11:33:54===
> ** Generic server <0.4095.0> terminating
> ** Last message in was {'$gen_cast',{write,1995}}
> ** When Server state == {state,18,<0.4058.0>,24}
> ** Reason for termination ==
> ** {timeout,{gen_server,call,[io_controller,{set_ready,18}]}}
>
> The first one seems clearly to tell me that MIX's io_controller
> gen_server's API wait_until_not_busy( ) function sent the message that it
> should have, but the handle_call( ) never found the readiness status for
> MIX device 18, its line printer, not to be busy, and then the gen_server
> was timed out.  This should never happen, but then there is the second
> report to consider.
>
> The second report confuses me.  Program P attempts to write a title (FIRST
> FIVE HUNDRED PRIMES) to MIX's line printer.  This is what the $gen_cast is
> attempting to do (1995 is the MIX address of the first character of the
> title).  A MIX I/O operation (in this case, the write) begins by waiting
> until the I/O unit that's addressed is not busy.  Apparently that's not
> true over the timeout period on this run; here is the mystery:  the
> emulator's I/O device gen_servers are all initialized with their readiness
> state set to ready, their readiness is changed to busy only by a request to
> perform an I/O operation, and the operation is actually carried out by an
> io_operation gen_server that is started by the I/O device gen_server.  That
> io_operation gen_server is the entity that requests the io_controller to
> set the readiness of the device back to ready, once the io_operation has
> performed the I/O, or the output, in this case, to the file that plays the
> role of the line printer, in this case.  It shouldn't be possible for
> io_controller:set_ready( Device ) to be called before io_controller:write(
> Device, Address ).  Is this what the error report is telling me happened?
>
> All right, you say, you've programmed your emulator incompetently, and I
> suppose I may have.  Why, then, does all this work perfectly, over and over
> again, on another machine, or even on the same machine, before going into
> failure mode?  (I have been unable to identify anything causing things to
> start failing nor to start working, and I'm not making changes other than
> sometimes running using application:start( 'MIX' ) and sometimes booting
> MIX as an Erlang release -- sometimes compiling using +debug_info (or not)
> has seemed to change from success to failure (or, equally likely, the
> opposite).
>
> I need advice, please.
>
> Regards,
> John Ashmun
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131101/5e861eb1/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Mysterious gen_server timeouts in MIX

Vance Shipley-2
On Fri, Nov 01, 2013 at 10:08:01PM -0700, Bob Ippolito wrote:
}  gen_server has a default timeout of 2 seconds,

The default is actually 5 seconds.


--
        -Vance

Reply | Threaded
Open this post in threaded view
|

Mysterious gen_server timeouts in MIX

Vance Shipley-2
In reply to this post by John R. Ashmun
On Fri, Nov 01, 2013 at 09:21:29PM -0700, John R. Ashmun wrote:
}  sometimes running using application:start( 'MIX' ) and sometimes booting
}  MIX as an Erlang release -- sometimes compiling using +debug_info (or not)
}  has seemed to change from success to failure (or, equally likely, the
}  opposite).

If your program is correct it's just too busy to respond before the
timeout.  In my experience however when you get to the point where a
message roundtrip is over 5 seconds you are in bad design territory.

You may have a race condition.  You say you have several gen_server
behaviour processes communicating with each other.  Hopefully this
communication is accomplished asynchronously.  Whenever you use a
receive statement, which you effectively do when you call a function
like gen_server:call/2, your process stops being a generic behaviour
until the call is completed.  No system messages or any others will
be processed during this time.

Here is an example of a pathological program which will eventually
deadlock:

-module(race).
-export([init/1, handle_info/2, handle_call/3, terminate/2]).
-record(state, {name, count}).

init([Name]) ->
        State = #state{name = Name, count = 0},
        random:seed(now()),
        Timeout = random:uniform(4000) + 1000,
        {ok, State, Timeout}.

handle_info(timeout, #state{name = Name, count = Count} = State) ->
        NewCount = gen_server:call(Name, Count),
        {noreply, State#state{count = NewCount}, random:uniform(1000)}.

handle_call(N, _From, State) ->
        {reply, N + 1, State, 0}.

terminate(_Reason, _State) ->
        ok.

Run it like this:

     1> gen_server:start({local, ping}, race, [pong], []),
     1> gen_server:start({local, pong}, race, [ping], []).
     {ok,<0.35.0>}

... and some time later:

     =ERROR REPORT==== 2-Nov-2013::14:47:10 ===
     ** Generic server pong terminating
     ** Last message in was timeout
     ** When Server state == {state,ping,103446400}
     ** Reason for termination ==
     ** {timeout,{gen_server,call,[ping,103446400]}}
     
     =ERROR REPORT==== 2-Nov-2013::14:48:55 ===
     ** Generic server ping terminating
     ** Last message in was timeout
     ** When Server state == {state,pong,103446400}
     ** Reason for termination ==
     ** {{timeout,{gen_server,call,[ping,103446400]}},
         {gen_server,call,[pong,103446400]}}

Here the processes have called each other a hundred million times
before deadlocking.

--
        -Vance


Reply | Threaded
Open this post in threaded view
|

Mysterious gen_server timeouts in MIX

Vance Shipley-2
On Sat, Nov 02, 2013 at 03:15:11PM +0530, Vance Shipley wrote:
}  Here is an example of a pathological program which will eventually
}  deadlock:
...
}  handle_call(N, _From, State) ->
}   {reply, N + 1, State, 0}.

Small correction, the previous example probably timed out because my
MacBookPro went to sleep.  The small change here demonstrates the
problem quicker and more reliably:

        handle_call(N, _From, State) ->
                {reply, N + 1, State, random:uniform(100)}.

The small random timeout greatly increases the chance of entering a race.

--
        -Vance

Reply | Threaded
Open this post in threaded view
|

Mysterious gen_server timeouts in MIX

John R. Ashmun
In reply to this post by Bob Ippolito-2
Thanks, Bob.

I blush to write that I had considered this approach, but then I decided to
try to learn why a gen_server timeout could ever happen to the emulator
before making any changes.

This is an emulation of a really simple CPU:  It fetches an instruction,
increments its program counter, carries out the instruction, and repeats.
There are not a lot of moving parts, and everything except I/O instructions
is synchronous.

Regards,
John Ashmun



On Fri, Nov 1, 2013 at 10:08 PM, Bob Ippolito <bob> wrote:

> gen_server has a default timeout of 2 seconds, so if your call takes long
> enough for whatever reason this can happen. Try changing all if your
> gen_server calls to explicitly specify a longer or infinite timeout.
>
> On Friday, November 1, 2013, John R. Ashmun wrote:
>
>> I have implemented an emulator for MIX, Donald Knuth's notional 1960's
>> computer, as an Erlang application, as a way of learning to use Erlang.
>>
>> As part of the emulator's startup currently, MIX loads Knuth's Program P
>> into its memory, then jumps to the starting location of Program P.  Program
>> P calculates the first 500 prime numbers, pops them into a table, and then
>> prints 10 prime numbers at a time on what it considers to be its line
>> printer.
>>
>> The emulator usually runs fine, and when it does, it carries out Program
>> P perfectly.
>>
>> Sometimes, however, things get stuck, and I have been unable to learn
>> what goes wrong.  When there is a problem, I usually see these two error
>> reports:
>>
>> =ERROR REPORT==== 1-Nov-2013::11:33:54===
>> ** Generic server <0.38.0> terminating
>> ** Last message in was timeout
>> ** When server state == []
>> ** Reason for termination ==
>> ** {timeout,{gen_server,call,[io_controller,{wait_until_not_busy,18}]}}
>>
>> =ERROR REPORT==== 1-Nov-2013::11:33:54===
>> ** Generic server <0.4095.0> terminating
>> ** Last message in was {'$gen_cast',{write,1995}}
>> ** When Server state == {state,18,<0.4058.0>,24}
>> ** Reason for termination ==
>> ** {timeout,{gen_server,call,[io_controller,{set_ready,18}]}}
>>
>> The first one seems clearly to tell me that MIX's io_controller
>> gen_server's API wait_until_not_busy( ) function sent the message that it
>> should have, but the handle_call( ) never found the readiness status for
>> MIX device 18, its line printer, not to be busy, and then the gen_server
>> was timed out.  This should never happen, but then there is the second
>> report to consider.
>>
>> The second report confuses me.  Program P attempts to write a title
>> (FIRST FIVE HUNDRED PRIMES) to MIX's line printer.  This is what the
>> $gen_cast is attempting to do (1995 is the MIX address of the first
>> character of the title).  A MIX I/O operation (in this case, the write)
>> begins by waiting until the I/O unit that's addressed is not busy.
>> Apparently that's not true over the timeout period on this run; here is the
>> mystery:  the emulator's I/O device gen_servers are all initialized with
>> their readiness state set to ready, their readiness is changed to busy only
>> by a request to perform an I/O operation, and the operation is actually
>> carried out by an io_operation gen_server that is started by the I/O device
>> gen_server.  That io_operation gen_server is the entity that requests the
>> io_controller to set the readiness of the device back to ready, once the
>> io_operation has performed the I/O, or the output, in this case, to the
>> file that plays the role of the line printer, in this case.  It shouldn't
>> be possible for io_controller:set_ready( Device ) to be called before
>> io_controller:write( Device, Address ).  Is this what the error report is
>> telling me happened?
>>
>> All right, you say, you've programmed your emulator incompetently, and I
>> suppose I may have.  Why, then, does all this work perfectly, over and over
>> again, on another machine, or even on the same machine, before going into
>> failure mode?  (I have been unable to identify anything causing things to
>> start failing nor to start working, and I'm not making changes other than
>> sometimes running using application:start( 'MIX' ) and sometimes booting
>> MIX as an Erlang release -- sometimes compiling using +debug_info (or not)
>> has seemed to change from success to failure (or, equally likely, the
>> opposite).
>>
>> I need advice, please.
>>
>> Regards,
>> John Ashmun
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131103/5c4247f5/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Mysterious gen_server timeouts in MIX

John R. Ashmun
In reply to this post by Vance Shipley-2
Thank you, Vance.

I have some experience with multithreaded applications and avoiding race
conditions.  I appreciate your providing examples.

I probably ought to have provided a few more details, and I would like to
stress that I see the Erlang VM/MIX emulator combination go into and come
out of the regime of failing via gen_server timeouts without my having made
any changes at all to my Erlang source files:  I haven't changed anything
in any of them since October 24.

I see or have seen this baffling behavior on two machines using werl:

HP Pavilion desktop with AMD A8-3800 APU w/8Gb RAM -- Microsoft Windows 7
Home Premium & erts5.9

Newer Dell laptop with Intel(R) Core(TM) i3-3217U w/6Gb RAM -- Microsoft
Windows 8 & erts5.10.2

At the moment, the emulator runs as intended on the desktop.  As I
mentioned, it runs Program P, and in so doing, it consumes roughly 4
seconds on a wall clock.  It has been doing this on this machine only since
October 24, which was when I commented out several io:format( ) calls which
had been announcing what instruction it was currently executing.  Before I
commented those out, it would run successfully sometimes and time out
sometimes.  It hasn't exhibited the timeouts since then.  I run it here
using -boot & -config options when starting werl.

On the laptop, when the emulation of Program P succeeds, it consumes
roughly 6 seconds on the wall clock.  It began running successfully on
October 26, without my having made any changes to sources, after having
consistently failed with timeouts for two or three days of desultory
attempts.  For no very good reason, and because I couldn't think of any
source code changes I could usefully make, I simply recompiled the sources
using +debug_info, and it subsequently succeeded.  I then took a trip to
Las Vegas, Nevada, and when I came back, seeing that the session windows
were still open, I ran the emulator again.  On that attempt, the timeouts
happened.  Again for no very good reason, I recompiled things, this time
without +debug_info, and now the emulator doesn't time out.  On the laptop
I start it using application:start( 'MIX' ). in the Erlang shell; I haven't
written the .rel file for this installation yet.

It looks to me as if the Erlang VM is starving for a resource sometimes.
Anthony Molinaro recently reported here a mysterious freeze while trying to
access a file.  The MIX emulator opens a file for each of its 20 I/O
devices.  I don't know that I am encountering something similar, not am I
certain what such a freeze would imply about the behavior of this emulator,
although it could lead to a gen_server timeout, I imagine.

Regards,
John Ashmun


On Sat, Nov 2, 2013 at 4:35 AM, Vance Shipley <vances> wrote:

> On Sat, Nov 02, 2013 at 03:15:11PM +0530, Vance Shipley wrote:
> }  Here is an example of a pathological program which will eventually
> }  deadlock:
> ...
> }  handle_call(N, _From, State) ->
> }       {reply, N + 1, State, 0}.
>
> Small correction, the previous example probably timed out because my
> MacBookPro went to sleep.  The small change here demonstrates the
> problem quicker and more reliably:
>
>         handle_call(N, _From, State) ->
>                 {reply, N + 1, State, random:uniform(100)}.
>
> The small random timeout greatly increases the chance of entering a race.
>
> --
>         -Vance
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131103/31f96b48/attachment.html>