OTP 22.1 socket.erl somehow breaks message delivery or scheduler

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Andreas Schultz-3
Hi,

After converting an application to socket.erl in OTP 22.1, the test suites started to fail with random timeouts. It took me a while to figure out that gen_server:calls arrived multiple seconds late in the server.

I have a demonstration at https://gist.github.com/RoadRunnr/311a7679fff6fbdf367c455b960f1ba8. It implements a simple UDP echo server with socket.erl. The client uses gen_udp to send messages and wait for the response.
The client also sends Erlang ping message to the server and expects to get a pong answer back. The socket.erl based server is supposed to not block (and as far as I can tell, it does not), it therefore should be able to answer the Erlang ping message all the time.
There are also some simple busy loop process running to get some load. Without them the problem is not reproducible.

The sample is failing in about 20% off the test runs, when it does the output is something like:

$ ~/stest.escript
Server Pid <0.78.0>
Server Addr #{addr => {127,0,0,1},family => inet,port => 38959}
ping timeout
           round trip    Clnt/Srvr    Srvr/Clnt     ProcPing
...
      85: ******** ns, ******** ns,    57675 ns,    42332 ns
...

The failure happens because a 'ping' message is not see in time by the receive clause in the server process. It seems that either the process is not scheduled for some time (multiple seconds), or the scanning of the mailbox is missing the message.

I have ruled out that the UDP messages are being dropped, otherwise the clients gen_udp:recv would never return.

Does anyone have a clue what might cause this? Or point out where my sample is broken.

Many thanks
Andreas

--

Andreas Schultz

 


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Björn Gustavsson-4
Thanks for reporting this issue.

We have investigated the issue and found that it is a bug in the compiler.

https://bugs.erlang.org/browse/ERL-1076

https://github.com/erlang/otp/pull/2434

/Björn

On Wed, Oct 23, 2019 at 3:53 PM Andreas Schultz
<[hidden email]> wrote:

>
> Hi,
>
> After converting an application to socket.erl in OTP 22.1, the test suites started to fail with random timeouts. It took me a while to figure out that gen_server:calls arrived multiple seconds late in the server.
>
> I have a demonstration at https://gist.github.com/RoadRunnr/311a7679fff6fbdf367c455b960f1ba8. It implements a simple UDP echo server with socket.erl. The client uses gen_udp to send messages and wait for the response.
> The client also sends Erlang ping message to the server and expects to get a pong answer back. The socket.erl based server is supposed to not block (and as far as I can tell, it does not), it therefore should be able to answer the Erlang ping message all the time.
> There are also some simple busy loop process running to get some load. Without them the problem is not reproducible.
>
> The sample is failing in about 20% off the test runs, when it does the output is something like:
>
> $ ~/stest.escript
> Server Pid <0.78.0>
> Server Addr #{addr => {127,0,0,1},family => inet,port => 38959}
> ping timeout
>            round trip    Clnt/Srvr    Srvr/Clnt     ProcPing
> ...
>       85: ******** ns, ******** ns,    57675 ns,    42332 ns
> ...
>
> The failure happens because a 'ping' message is not see in time by the receive clause in the server process. It seems that either the process is not scheduled for some time (multiple seconds), or the scanning of the mailbox is missing the message.
>
> I have ruled out that the UDP messages are being dropped, otherwise the clients gen_udp:recv would never return.
>
> Does anyone have a clue what might cause this? Or point out where my sample is broken.
>
> Many thanks
> Andreas
>
> --
>
> Andreas Schultz
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions



--
Björn Gustavsson, Erlang/OTP, Ericsson AB
Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Andreas Schultz-3
Am Mo., 28. Okt. 2019 um 15:22 Uhr schrieb Björn Gustavsson <[hidden email]>:
Thanks for reporting this issue.

We have investigated the issue and found that it is a bug in the compiler.

wow. I have to say I'm a bit scared of OTP 22.x at the moment because of all the "bugs in the compiler" changes lately.
Did the rewrite of the compiler in 22.0 cause all this or was it present before and just went unnoticed?

Anyway, many thanks for finding this,
Andreas


https://bugs.erlang.org/browse/ERL-1076

https://github.com/erlang/otp/pull/2434

/Björn

On Wed, Oct 23, 2019 at 3:53 PM Andreas Schultz
<[hidden email]> wrote:
>
> Hi,
>
> After converting an application to socket.erl in OTP 22.1, the test suites started to fail with random timeouts. It took me a while to figure out that gen_server:calls arrived multiple seconds late in the server.
>
> I have a demonstration at https://gist.github.com/RoadRunnr/311a7679fff6fbdf367c455b960f1ba8. It implements a simple UDP echo server with socket.erl. The client uses gen_udp to send messages and wait for the response.
> The client also sends Erlang ping message to the server and expects to get a pong answer back. The socket.erl based server is supposed to not block (and as far as I can tell, it does not), it therefore should be able to answer the Erlang ping message all the time.
> There are also some simple busy loop process running to get some load. Without them the problem is not reproducible.
>
> The sample is failing in about 20% off the test runs, when it does the output is something like:
>
> $ ~/stest.escript
> Server Pid <0.78.0>
> Server Addr #{addr => {127,0,0,1},family => inet,port => 38959}
> ping timeout
>            round trip    Clnt/Srvr    Srvr/Clnt     ProcPing
> ...
>       85: ******** ns, ******** ns,    57675 ns,    42332 ns
> ...
>
> The failure happens because a 'ping' message is not see in time by the receive clause in the server process. It seems that either the process is not scheduled for some time (multiple seconds), or the scanning of the mailbox is missing the message.
>
> I have ruled out that the UDP messages are being dropped, otherwise the clients gen_udp:recv would never return.
>
> Does anyone have a clue what might cause this? Or point out where my sample is broken.
>
> Many thanks
> Andreas
>
> --
>
> Andreas Schultz
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions



--
Björn Gustavsson, Erlang/OTP, Ericsson AB


--

Andreas Schultz

-- 

Principal Engineer

t: +49 391 819099-224

------------------------------- enabling your networks -----------------------------

Travelping GmbH 

Roentgenstraße 13

39108 Magdeburg

Germany

t: +49 391 819099-0

f: +49 391 819099-299

e: [hidden email]

w: https://www.travelping.com/

   
Company registration: Amtsgericht Stendal  Reg. No.: HRB 10578
Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780

 

Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Jaka Bac
Hi,

There were big improvements in compiler that were released with OTP 22. 

If you are interested in details, see

Jaka

On Mon, Oct 28, 2019, 16:39 Andreas Schultz <[hidden email]> wrote:
Am Mo., 28. Okt. 2019 um 15:22 Uhr schrieb Björn Gustavsson <[hidden email]>:
Thanks for reporting this issue.

We have investigated the issue and found that it is a bug in the compiler.

wow. I have to say I'm a bit scared of OTP 22.x at the moment because of all the "bugs in the compiler" changes lately.
Did the rewrite of the compiler in 22.0 cause all this or was it present before and just went unnoticed?

Anyway, many thanks for finding this,
Andreas


https://bugs.erlang.org/browse/ERL-1076

https://github.com/erlang/otp/pull/2434

/Björn

On Wed, Oct 23, 2019 at 3:53 PM Andreas Schultz
<[hidden email]> wrote:
>
> Hi,
>
> After converting an application to socket.erl in OTP 22.1, the test suites started to fail with random timeouts. It took me a while to figure out that gen_server:calls arrived multiple seconds late in the server.
>
> I have a demonstration at https://gist.github.com/RoadRunnr/311a7679fff6fbdf367c455b960f1ba8. It implements a simple UDP echo server with socket.erl. The client uses gen_udp to send messages and wait for the response.
> The client also sends Erlang ping message to the server and expects to get a pong answer back. The socket.erl based server is supposed to not block (and as far as I can tell, it does not), it therefore should be able to answer the Erlang ping message all the time.
> There are also some simple busy loop process running to get some load. Without them the problem is not reproducible.
>
> The sample is failing in about 20% off the test runs, when it does the output is something like:
>
> $ ~/stest.escript
> Server Pid <0.78.0>
> Server Addr #{addr => {127,0,0,1},family => inet,port => 38959}
> ping timeout
>            round trip    Clnt/Srvr    Srvr/Clnt     ProcPing
> ...
>       85: ******** ns, ******** ns,    57675 ns,    42332 ns
> ...
>
> The failure happens because a 'ping' message is not see in time by the receive clause in the server process. It seems that either the process is not scheduled for some time (multiple seconds), or the scanning of the mailbox is missing the message.
>
> I have ruled out that the UDP messages are being dropped, otherwise the clients gen_udp:recv would never return.
>
> Does anyone have a clue what might cause this? Or point out where my sample is broken.
>
> Many thanks
> Andreas
>
> --
>
> Andreas Schultz
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions



--
Björn Gustavsson, Erlang/OTP, Ericsson AB


--

Andreas Schultz

-- 

Principal Engineer

t: +49 391 819099-224

------------------------------- enabling your networks -----------------------------

Travelping GmbH 

Roentgenstraße 13

39108 Magdeburg

Germany

t: +49 391 819099-0

f: +49 391 819099-299

e: [hidden email]

w: https://www.travelping.com/

   
Company registration: Amtsgericht Stendal  Reg. No.: HRB 10578
Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780

 

Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Björn Gustavsson-4
In reply to this post by Andreas Schultz-3
On Mon, Oct 28, 2019 at 4:39 PM Andreas Schultz <[hidden email]> wrote:
Am Mo., 28. Okt. 2019 um 15:22 Uhr schrieb Björn Gustavsson <[hidden email]>:
Thanks for reporting this issue.

We have investigated the issue and found that it is a bug in the compiler.

wow. I have to say I'm a bit scared of OTP 22.x at the moment because of all the "bugs in the compiler" changes lately.
Did the rewrite of the compiler in 22.0 cause all this or was it present before and just went unnoticed?

It is a new bug in OTP 22. This particular
bug was in the new optimization pass for
optimizing receives.

There are more details in our blog, but the short
story is that we had reached a dead end in the
compiler and that it was very time consuming
and error prone to add new optimizations to
the compiler.

Most of the reported compiler bugs in OTP 22
was of the kind that causes the compiler to crash.
Those bugs are annoying but basically harmless in 
that they can't subtly change the meaning of your
program. Unfortunately, a few bugs were of the kind
that the compiler generated incorrect code.

Moving forward, the new compiler architecture
will give us a solid ground for further improvements
of the compiler.

/Björn


--
Björn Gustavsson, Erlang/OTP, Ericsson AB
Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Andreas Schultz-3
Am Di., 29. Okt. 2019 um 07:07 Uhr schrieb Björn Gustavsson <[hidden email]>:
On Mon, Oct 28, 2019 at 4:39 PM Andreas Schultz <[hidden email]> wrote:
Am Mo., 28. Okt. 2019 um 15:22 Uhr schrieb Björn Gustavsson <[hidden email]>:
Thanks for reporting this issue.

We have investigated the issue and found that it is a bug in the compiler.

wow. I have to say I'm a bit scared of OTP 22.x at the moment because of all the "bugs in the compiler" changes lately.
Did the rewrite of the compiler in 22.0 cause all this or was it present before and just went unnoticed?

It is a new bug in OTP 22. This particular
bug was in the new optimization pass for
optimizing receives.

There are more details in our blog, but the short
story is that we had reached a dead end in the
compiler and that it was very time consuming
and error prone to add new optimizations to
the compiler.

I'm sure that over time all the benefits of the new compiler architecture are well worth the price.
But for OTP 22.x it has led to a few, but highly frustrating problems. Even the few incorrect code generation bugs have led to very hard to understand problems for the users.

I'm not sure that scope and impact of this bug here is even fully understood.

My demonstration code used a bare receive, but the code that actually triggered it used a plain gen_server. The result of the bug was that gen_sever:calls seemed to arrive extremely late (multiple seconds).
It would therefore seem that the incorrect code was present in main receive loop of gen_server (and probably also gen_statem and gen_event). This would mean almost all Erlang applications on OTP 22.x could be affected.

The effects might go unnoticed in many tests cases in other projects, until it causes unexplainable failures.

Since all this was introduced in OTP 22, the sensible suggestion for everyone seem to be test OTP 22.x as well as they can, but to stay away from it for production use.

Most of the reported compiler bugs in OTP 22
was of the kind that causes the compiler to crash.
Those bugs are annoying but basically harmless in 
that they can't subtly change the meaning of your
program. Unfortunately, a few bugs were of the kind
that the compiler generated incorrect code.

Moving forward, the new compiler architecture
will give us a solid ground for further improvements
of the compiler.

I'm sure it will and I'm sure the current pain is well worth it.

Regards,
Andreas


/Björn


--
Björn Gustavsson, Erlang/OTP, Ericsson AB

Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Björn Gustavsson-4
On Tue, Oct 29, 2019 at 9:14 AM Andreas Schultz
<[hidden email]> wrote:

>
> I'm sure that over time all the benefits of the new compiler architecture are well worth the price.
> But for OTP 22.x it has led to a few, but highly frustrating problems. Even the few incorrect code generation bugs have led to very hard to understand problems for the users.
>
> I'm not sure that scope and impact of this bug here is even fully understood.
>
> My demonstration code used a bare receive, but the code that actually triggered it used a plain gen_server. The result of the bug was that gen_sever:calls seemed to arrive extremely late (multiple seconds).
> It would therefore seem that the incorrect code was present in main receive loop of gen_server (and probably also gen_statem and gen_event). This would mean almost all Erlang applications on OTP 22.x could be affected.
>
> The effects might go unnoticed in many tests cases in other projects, until it causes unexplainable failures.
>
> Since all this was introduced in OTP 22, the sensible suggestion for everyone seem to be test OTP 22.x as well as they can, but to stay away from it for production use.
>

We have done some more investigating.

The issues you saw are caused by TWO bugs,
the compiler bug and a bug in the run-time
system.

The bug in the compiler caused a position in
the message queue to be saved even when it
was not guaranteed that a receive would be
executed. The only module in OTP that was
hit by this bug was the socket module, and
in this case it would have been harmless
without the other bug.

The bug in the run-time system would cause
the saved position to be used for a receive that
was not supposed to use the saved position.
This bug was introduced in OTP 21, but it is
unlikely that it could ever be triggered by
code emitted by the compiler in OTP 21.

So what does that mean for production use
of OTP 22.x?

We know that OTP 22 is already used in
production. The problems you saw were
caused by the use of the socket module that
was hit by the compiler bug. Unless one has
another module that uses receive in a similar
way to how socket uses it, using gen_server
and the other modules is perfectly safe.

Since the socket module was introduced
in OTP 22 and is still experimental, it
would be advisable to avoid it in production
use for the moment.

The fix for the compiler should be enough
to fix the kind of problems you saw, but we
will of course fix the bug in the run-time
system as well.

/Björn

--
Björn Gustavsson, Erlang/OTP, Ericsson AB
Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Andreas Schultz-3
Hi Björn,

Thanks for the detailed and  thorough explanation. 

Can you point me to the fix for the second bug?

I'm currently chasing what looks very much like the same issue, only the message that is not arriving in time is originating in the inet driver (gen_tcp) this time.
But it could also be something different or me being too stupid. A complete explanation is here http://erlang.org/pipermail/erlang-questions/2019-September/098419.html

Many thanks,
Andreas

Am Di., 29. Okt. 2019 um 12:40 Uhr schrieb Björn Gustavsson <[hidden email]>:
On Tue, Oct 29, 2019 at 9:14 AM Andreas Schultz
<[hidden email]> wrote:
>
> I'm sure that over time all the benefits of the new compiler architecture are well worth the price.
> But for OTP 22.x it has led to a few, but highly frustrating problems. Even the few incorrect code generation bugs have led to very hard to understand problems for the users.
>
> I'm not sure that scope and impact of this bug here is even fully understood.
>
> My demonstration code used a bare receive, but the code that actually triggered it used a plain gen_server. The result of the bug was that gen_sever:calls seemed to arrive extremely late (multiple seconds).
> It would therefore seem that the incorrect code was present in main receive loop of gen_server (and probably also gen_statem and gen_event). This would mean almost all Erlang applications on OTP 22.x could be affected.
>
> The effects might go unnoticed in many tests cases in other projects, until it causes unexplainable failures.
>
> Since all this was introduced in OTP 22, the sensible suggestion for everyone seem to be test OTP 22.x as well as they can, but to stay away from it for production use.
>

We have done some more investigating.

The issues you saw are caused by TWO bugs,
the compiler bug and a bug in the run-time
system.

The bug in the compiler caused a position in
the message queue to be saved even when it
was not guaranteed that a receive would be
executed. The only module in OTP that was
hit by this bug was the socket module, and
in this case it would have been harmless
without the other bug.

The bug in the run-time system would cause
the saved position to be used for a receive that
was not supposed to use the saved position.
This bug was introduced in OTP 21, but it is
unlikely that it could ever be triggered by
code emitted by the compiler in OTP 21.

So what does that mean for production use
of OTP 22.x?

We know that OTP 22 is already used in
production. The problems you saw were
caused by the use of the socket module that
was hit by the compiler bug. Unless one has
another module that uses receive in a similar
way to how socket uses it, using gen_server
and the other modules is perfectly safe.

Since the socket module was introduced
in OTP 22 and is still experimental, it
would be advisable to avoid it in production
use for the moment.

The fix for the compiler should be enough
to fix the kind of problems you saw, but we
will of course fix the bug in the run-time
system as well.

/Björn

--
Björn Gustavsson, Erlang/OTP, Ericsson AB


--

Andreas Schultz

Reply | Threaded
Open this post in threaded view
|

Re: OTP 22.1 socket.erl somehow breaks message delivery or scheduler

Lukas Larsson-8


On Tue, Oct 29, 2019 at 1:07 PM Andreas Schultz <[hidden email]> wrote:
Hi Björn,

Thanks for the detailed and  thorough explanation. 

Can you point me to the fix for the second bug?


It is most likely not the fix we are going to merge as it is too conservative about when the message queue optimization can trigger, but it is correct.

Lukas
 

I'm currently chasing what looks very much like the same issue, only the message that is not arriving in time is originating in the inet driver (gen_tcp) this time.
But it could also be something different or me being too stupid. A complete explanation is here http://erlang.org/pipermail/erlang-questions/2019-September/098419.html

Many thanks,
Andreas

Am Di., 29. Okt. 2019 um 12:40 Uhr schrieb Björn Gustavsson <[hidden email]>:
On Tue, Oct 29, 2019 at 9:14 AM Andreas Schultz
<[hidden email]> wrote:
>
> I'm sure that over time all the benefits of the new compiler architecture are well worth the price.
> But for OTP 22.x it has led to a few, but highly frustrating problems. Even the few incorrect code generation bugs have led to very hard to understand problems for the users.
>
> I'm not sure that scope and impact of this bug here is even fully understood.
>
> My demonstration code used a bare receive, but the code that actually triggered it used a plain gen_server. The result of the bug was that gen_sever:calls seemed to arrive extremely late (multiple seconds).
> It would therefore seem that the incorrect code was present in main receive loop of gen_server (and probably also gen_statem and gen_event). This would mean almost all Erlang applications on OTP 22.x could be affected.
>
> The effects might go unnoticed in many tests cases in other projects, until it causes unexplainable failures.
>
> Since all this was introduced in OTP 22, the sensible suggestion for everyone seem to be test OTP 22.x as well as they can, but to stay away from it for production use.
>

We have done some more investigating.

The issues you saw are caused by TWO bugs,
the compiler bug and a bug in the run-time
system.

The bug in the compiler caused a position in
the message queue to be saved even when it
was not guaranteed that a receive would be
executed. The only module in OTP that was
hit by this bug was the socket module, and
in this case it would have been harmless
without the other bug.

The bug in the run-time system would cause
the saved position to be used for a receive that
was not supposed to use the saved position.
This bug was introduced in OTP 21, but it is
unlikely that it could ever be triggered by
code emitted by the compiler in OTP 21.

So what does that mean for production use
of OTP 22.x?

We know that OTP 22 is already used in
production. The problems you saw were
caused by the use of the socket module that
was hit by the compiler bug. Unless one has
another module that uses receive in a similar
way to how socket uses it, using gen_server
and the other modules is perfectly safe.

Since the socket module was introduced
in OTP 22 and is still experimental, it
would be advisable to avoid it in production
use for the moment.

The fix for the compiler should be enough
to fix the kind of problems you saw, but we
will of course fix the bug in the run-time
system as well.

/Björn

--
Björn Gustavsson, Erlang/OTP, Ericsson AB


--

Andreas Schultz