BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Ulf Wiger-2
I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.

The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.

I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.


In short, if permission(B) -> false, what happens is:
start(A) -> {error, {not_started, B}}
start(B) -> ok
start(A) -> {error,  {not_started, B}}
... [repeat endlessly]

Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:

1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
2. the call hangs until the flag(s) change, but start(B) is only called once.
3. Warn against the use of permissions in the docs, and deprecate them.

I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)

Anyway, permissions were left in the API, and ARE documented.

Thoughts?

BR,
Ulf Wiger
Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Mikael Pettersson-5
On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <[hidden email]> wrote:

>
> I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.
>
> The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.
>
> I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.
>
> https://twitter.com/uwiger/status/1372944356781531136
>
> In short, if permission(B) -> false, what happens is:
> start(A) -> {error, {not_started, B}}
> start(B) -> ok
> start(A) -> {error,  {not_started, B}}
> ... [repeat endlessly]
>
> Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:
>
> 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
> 2. the call hangs until the flag(s) change, but start(B) is only called once.
> 3. Warn against the use of permissions in the docs, and deprecate them.
>
> I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)
>
> Anyway, permissions were left in the API, and ARE documented.
>
> Thoughts?

I know we've used the permissions mechanism occasionally during
maintenance or live upgrades. Off-hand I don't know if we'd want
alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
about this).

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Ulf Wiger-2
When I started looking closer into this, it would appear as if there is a long-standing bug in the application_controller regarding permissions.

And with "long-standing" I mean that it was there even when Kostis did some Tidier-based cleanup 11 years ago. Kostis didn't introduce it, though.

When servicing a start request, the application_controller, if permission(App) == false, adds a new entry to the `start_p_false` list, i.e. a new entry for each request.

... but when servicing a subsequent {permit_application, App, true}, it uses lists:keydelete/3 to remove the App from the `start_p_false` list.

lists:keydelete/3 obviously only removes the first matching entry.

Earlier in that function, it also only locates the first pending request (or rather, chronologically the last), and uses the `From` in `spawn_starter()'.

The rest of the pending requests should be handled somewhere - likely in `handle_application_started/3`, but aren't.

BR,
Ulf W

On Sat, Mar 20, 2021 at 1:45 PM Mikael Pettersson <[hidden email]> wrote:
On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <[hidden email]> wrote:
>
> I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.
>
> The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.
>
> I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.
>
> https://twitter.com/uwiger/status/1372944356781531136
>
> In short, if permission(B) -> false, what happens is:
> start(A) -> {error, {not_started, B}}
> start(B) -> ok
> start(A) -> {error,  {not_started, B}}
> ... [repeat endlessly]
>
> Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:
>
> 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
> 2. the call hangs until the flag(s) change, but start(B) is only called once.
> 3. Warn against the use of permissions in the docs, and deprecate them.
>
> I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)
>
> Anyway, permissions were left in the API, and ARE documented.
>
> Thoughts?

I know we've used the permissions mechanism occasionally during
maintenance or live upgrades. Off-hand I don't know if we'd want
alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
about this).

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Dániel Szoboszlay
In reply to this post by Mikael Pettersson-5
Hi,

The use case Mikael mentioned is that we have an application (not a distributed application by the way) that may or may not be started at the time of the upgrade; but if it is started, it needs to be stopped just before the upgrade and restarted right after it. This can be easily achieved by simply disabling the application for the time of the upgrade. The application controller will remember whether it was started when we disabled it and will restart if iaccordingly when we enable it again. It is just more convenient this way than storing the original started/stopped state of the application somewhere else for the time of the upgrade.

For our use case 1. and 2. are equally acceptable, but 3. is obviously not.

I'm not sure which solution should be preferred. I have a feeling that there are some significant semantic inconsistencies on how application dependencies and permissions work. Like application:start(B) would return ok when B is not permitted, but then you cannot start A. But if later on you permit B it gets started, but A does not. And even though you can only start A if B is both permitted and started, once A is running, stopping or disabling B would leave A running. Maybe all applications shall have a new state that says "start blocked by disabled dependency". Then ensure_all_started(A) could simply return ok, and leave A in this state until you permit B.

/Daniel

On Sat, 20 Mar 2021 at 13:45, Mikael Pettersson <[hidden email]> wrote:
On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <[hidden email]> wrote:
>
> I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.
>
> The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.
>
> I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.
>
> https://twitter.com/uwiger/status/1372944356781531136
>
> In short, if permission(B) -> false, what happens is:
> start(A) -> {error, {not_started, B}}
> start(B) -> ok
> start(A) -> {error,  {not_started, B}}
> ... [repeat endlessly]
>
> Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:
>
> 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
> 2. the call hangs until the flag(s) change, but start(B) is only called once.
> 3. Warn against the use of permissions in the docs, and deprecate them.
>
> I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)
>
> Anyway, permissions were left in the API, and ARE documented.
>
> Thoughts?

I know we've used the permissions mechanism occasionally during
maintenance or live upgrades. Off-hand I don't know if we'd want
alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
about this).

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Ulf Wiger-2
In reply to this post by Ulf Wiger-2
Hmm, trying some more with OTP 24, it addresses the problem with the memory growth, but still isn't permission-aware.

Consider test apps a and b, where a depends on b.

15> application:permit(b,false).
ok
16> application:ensure_all_started(a).
{error,{a,{not_started,b}}}
17> application:which_applications().
[{stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]
18> application:permit(b,true).
ok
19> application:which_applications().
[{b,"test app","0.1"},
 {stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]

The call to application:ensure_all_started(a) fails, and supposedly all child apps that were started will have been stopped again, and it does look that way.

But if we later permit b to run, it turns out that the start request wasn't actually removed, and b pops up.

This is for sure a much less serious problem than the previous one.

However, I'm not sure if returning error is actually the right thing to do there. The call SHOULD probably hang.

Comments?

BR,
Ulf W

On Mon, Mar 22, 2021 at 1:23 PM Ulf Wiger <[hidden email]> wrote:
When I started looking closer into this, it would appear as if there is a long-standing bug in the application_controller regarding permissions.

And with "long-standing" I mean that it was there even when Kostis did some Tidier-based cleanup 11 years ago. Kostis didn't introduce it, though.

When servicing a start request, the application_controller, if permission(App) == false, adds a new entry to the `start_p_false` list, i.e. a new entry for each request.

... but when servicing a subsequent {permit_application, App, true}, it uses lists:keydelete/3 to remove the App from the `start_p_false` list.

lists:keydelete/3 obviously only removes the first matching entry.

Earlier in that function, it also only locates the first pending request (or rather, chronologically the last), and uses the `From` in `spawn_starter()'.

The rest of the pending requests should be handled somewhere - likely in `handle_application_started/3`, but aren't.

BR,
Ulf W

On Sat, Mar 20, 2021 at 1:45 PM Mikael Pettersson <[hidden email]> wrote:
On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <[hidden email]> wrote:
>
> I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.
>
> The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.
>
> I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.
>
> https://twitter.com/uwiger/status/1372944356781531136
>
> In short, if permission(B) -> false, what happens is:
> start(A) -> {error, {not_started, B}}
> start(B) -> ok
> start(A) -> {error,  {not_started, B}}
> ... [repeat endlessly]
>
> Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:
>
> 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
> 2. the call hangs until the flag(s) change, but start(B) is only called once.
> 3. Warn against the use of permissions in the docs, and deprecate them.
>
> I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)
>
> Anyway, permissions were left in the API, and ARE documented.
>
> Thoughts?

I know we've used the permissions mechanism occasionally during
maintenance or live upgrades. Off-hand I don't know if we'd want
alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
about this).

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Fred Hebert-2
I initially implemented `application:ensure_all_started/1,2' because it made for an easier get-started mechanism than getting a full blown release, and avoided having people do the annoying "try until it works" garbage routine by hand (which the function is now doing), or having to just line up all the start calls at the beginning of a test run.

I would be in favour of just letting it fail and return back {error, {b, {permit, false}}} -- there are more variations we could use, but this term allows to:
- specifically mention it's b that isn't allowed to start
- uses the 'permit' keyword that matches the function name, such that googling something like "erlang permit false" will yield things such as the doc page for application:permit/2 (or this thread, given how google indexing works)

That being said, it does not line up with the current API usage, which would probably need some fixing of some sort? Look at this session:

1> application:load(ssl).
ok
2> application:permit(ssl, false).
ok
3> application:ensure_all_started(crypto).
{ok,[crypto]}
4> application:ensure_all_started(public_key).
{ok,[asn1,public_key]}
5> application:start(ssl).
ok
6> application:stop(ssl).
{error,{not_started,ssl}}


Starting the application "works" even if it does not with permissions; the call silently fails while returning it succeeded. Since the current documentation for ensure_all_started states the following:

The function reports {error, {AppName,Reason}} for errors, where Reason is any possible reason returned by start/1,2 when starting a specific dependency.

We are in a situation where ensure_all_started can't or won't know that permissions are at play (I didn't when I implemented it), and it appears that we would need to do an explicit permission check before each run to properly return the error explaining the issue without breaking the compatibility of start/1,2. The gotcha here is that there's apparently no way to access this status aside from application_controller internals that would need some extra visibility, which start/1,2 calls on its own as well.

Regards,
Fred.

On Mon, Mar 22, 2021 at 1:06 PM Ulf Wiger <[hidden email]> wrote:
Hmm, trying some more with OTP 24, it addresses the problem with the memory growth, but still isn't permission-aware.

Consider test apps a and b, where a depends on b.

15> application:permit(b,false).
ok
16> application:ensure_all_started(a).
{error,{a,{not_started,b}}}
17> application:which_applications().
[{stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]
18> application:permit(b,true).
ok
19> application:which_applications().
[{b,"test app","0.1"},
 {stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]

The call to application:ensure_all_started(a) fails, and supposedly all child apps that were started will have been stopped again, and it does look that way.

But if we later permit b to run, it turns out that the start request wasn't actually removed, and b pops up.

This is for sure a much less serious problem than the previous one.

However, I'm not sure if returning error is actually the right thing to do there. The call SHOULD probably hang.

Comments?

BR,
Ulf W

On Mon, Mar 22, 2021 at 1:23 PM Ulf Wiger <[hidden email]> wrote:
When I started looking closer into this, it would appear as if there is a long-standing bug in the application_controller regarding permissions.

And with "long-standing" I mean that it was there even when Kostis did some Tidier-based cleanup 11 years ago. Kostis didn't introduce it, though.

When servicing a start request, the application_controller, if permission(App) == false, adds a new entry to the `start_p_false` list, i.e. a new entry for each request.

... but when servicing a subsequent {permit_application, App, true}, it uses lists:keydelete/3 to remove the App from the `start_p_false` list.

lists:keydelete/3 obviously only removes the first matching entry.

Earlier in that function, it also only locates the first pending request (or rather, chronologically the last), and uses the `From` in `spawn_starter()'.

The rest of the pending requests should be handled somewhere - likely in `handle_application_started/3`, but aren't.

BR,
Ulf W

On Sat, Mar 20, 2021 at 1:45 PM Mikael Pettersson <[hidden email]> wrote:
On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <[hidden email]> wrote:
>
> I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.
>
> The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.
>
> I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.
>
> https://twitter.com/uwiger/status/1372944356781531136
>
> In short, if permission(B) -> false, what happens is:
> start(A) -> {error, {not_started, B}}
> start(B) -> ok
> start(A) -> {error,  {not_started, B}}
> ... [repeat endlessly]
>
> Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:
>
> 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
> 2. the call hangs until the flag(s) change, but start(B) is only called once.
> 3. Warn against the use of permissions in the docs, and deprecate them.
>
> I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)
>
> Anyway, permissions were left in the API, and ARE documented.
>
> Thoughts?

I know we've used the permissions mechanism occasionally during
maintenance or live upgrades. Off-hand I don't know if we'd want
alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
about this).

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Ulf Wiger-2
Permissions are a bit tricky. :)

The issue I encountered was the use of `application:ensure_all_started()` in `rebar3 shell`, which I think is perfectly appropriate use of the function.
But if that use is supposed to mimic the starting of applications via a boot script, then it should either return `ok` (counter-intuitive), or hang until the permission flag(s) turn to true.

The `application:close()` function should definitely remove the corresponding entries in `start_p_false`, and if there are such entries, not complain that the app hasn't been started.

Basically, once an application has been started, the permission flag should toggle it on or off - if you call `permit(A, false)` on running application A, it should be stopped. A subsequent call to `permit(A, true)` should start it again.

This is why I think it might be better for `ensure_all_started()` to hang rather than returning an error tuple. Alternatively, that a new function, `await_all_started()` does this, if that's what's desired.
The question then becomes which one 'rebar3 shell` should use...

BR,
Ulf W

On Tue, Mar 23, 2021 at 1:13 PM Fred Hebert <[hidden email]> wrote:
I initially implemented `application:ensure_all_started/1,2' because it made for an easier get-started mechanism than getting a full blown release, and avoided having people do the annoying "try until it works" garbage routine by hand (which the function is now doing), or having to just line up all the start calls at the beginning of a test run.

I would be in favour of just letting it fail and return back {error, {b, {permit, false}}} -- there are more variations we could use, but this term allows to:
- specifically mention it's b that isn't allowed to start
- uses the 'permit' keyword that matches the function name, such that googling something like "erlang permit false" will yield things such as the doc page for application:permit/2 (or this thread, given how google indexing works)

That being said, it does not line up with the current API usage, which would probably need some fixing of some sort? Look at this session:

1> application:load(ssl).
ok
2> application:permit(ssl, false).
ok
3> application:ensure_all_started(crypto).
{ok,[crypto]}
4> application:ensure_all_started(public_key).
{ok,[asn1,public_key]}
5> application:start(ssl).
ok
6> application:stop(ssl).
{error,{not_started,ssl}}


Starting the application "works" even if it does not with permissions; the call silently fails while returning it succeeded. Since the current documentation for ensure_all_started states the following:

The function reports {error, {AppName,Reason}} for errors, where Reason is any possible reason returned by start/1,2 when starting a specific dependency.

We are in a situation where ensure_all_started can't or won't know that permissions are at play (I didn't when I implemented it), and it appears that we would need to do an explicit permission check before each run to properly return the error explaining the issue without breaking the compatibility of start/1,2. The gotcha here is that there's apparently no way to access this status aside from application_controller internals that would need some extra visibility, which start/1,2 calls on its own as well.

Regards,
Fred.

On Mon, Mar 22, 2021 at 1:06 PM Ulf Wiger <[hidden email]> wrote:
Hmm, trying some more with OTP 24, it addresses the problem with the memory growth, but still isn't permission-aware.

Consider test apps a and b, where a depends on b.

15> application:permit(b,false).
ok
16> application:ensure_all_started(a).
{error,{a,{not_started,b}}}
17> application:which_applications().
[{stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]
18> application:permit(b,true).
ok
19> application:which_applications().
[{b,"test app","0.1"},
 {stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]

The call to application:ensure_all_started(a) fails, and supposedly all child apps that were started will have been stopped again, and it does look that way.

But if we later permit b to run, it turns out that the start request wasn't actually removed, and b pops up.

This is for sure a much less serious problem than the previous one.

However, I'm not sure if returning error is actually the right thing to do there. The call SHOULD probably hang.

Comments?

BR,
Ulf W

On Mon, Mar 22, 2021 at 1:23 PM Ulf Wiger <[hidden email]> wrote:
When I started looking closer into this, it would appear as if there is a long-standing bug in the application_controller regarding permissions.

And with "long-standing" I mean that it was there even when Kostis did some Tidier-based cleanup 11 years ago. Kostis didn't introduce it, though.

When servicing a start request, the application_controller, if permission(App) == false, adds a new entry to the `start_p_false` list, i.e. a new entry for each request.

... but when servicing a subsequent {permit_application, App, true}, it uses lists:keydelete/3 to remove the App from the `start_p_false` list.

lists:keydelete/3 obviously only removes the first matching entry.

Earlier in that function, it also only locates the first pending request (or rather, chronologically the last), and uses the `From` in `spawn_starter()'.

The rest of the pending requests should be handled somewhere - likely in `handle_application_started/3`, but aren't.

BR,
Ulf W

On Sat, Mar 20, 2021 at 1:45 PM Mikael Pettersson <[hidden email]> wrote:
On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <[hidden email]> wrote:
>
> I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.
>
> The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.
>
> I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.
>
> https://twitter.com/uwiger/status/1372944356781531136
>
> In short, if permission(B) -> false, what happens is:
> start(A) -> {error, {not_started, B}}
> start(B) -> ok
> start(A) -> {error,  {not_started, B}}
> ... [repeat endlessly]
>
> Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:
>
> 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
> 2. the call hangs until the flag(s) change, but start(B) is only called once.
> 3. Warn against the use of permissions in the docs, and deprecate them.
>
> I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)
>
> Anyway, permissions were left in the API, and ARE documented.
>
> Thoughts?

I know we've used the permissions mechanism occasionally during
maintenance or live upgrades. Off-hand I don't know if we'd want
alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
about this).

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Michael Truog
Most usage of application:ensure_all_started/1 wouldn't want it to hang due to permissions because that wouldn't be fail-fast and the usage would often be for testing that needs to fail (the system state is not as expected).

Your desire to have the permissions cause the application start sequence to hang would be part of manual interaction with the application start sequence.  So, it seems best to have that in a separate function that is specifically for manual interaction with the applications, like application:await_all_started/1 .

I would want to avoid functionality that can cause the execution to hang, so I would avoid permissions and application:await_all_started/1 to keep the execution of higher-level source code predictable.  One way of resolving these separate use-cases is to have any hang-execution functionality in a separate module, like application_shell to clearly indicate the functionality is for manual interaction only.

Best Regards,
Michael

On 3/23/21 5:45 AM, Ulf Wiger wrote:
Permissions are a bit tricky. :)

The issue I encountered was the use of `application:ensure_all_started()` in `rebar3 shell`, which I think is perfectly appropriate use of the function.
But if that use is supposed to mimic the starting of applications via a boot script, then it should either return `ok` (counter-intuitive), or hang until the permission flag(s) turn to true.

The `application:close()` function should definitely remove the corresponding entries in `start_p_false`, and if there are such entries, not complain that the app hasn't been started.

Basically, once an application has been started, the permission flag should toggle it on or off - if you call `permit(A, false)` on running application A, it should be stopped. A subsequent call to `permit(A, true)` should start it again.

This is why I think it might be better for `ensure_all_started()` to hang rather than returning an error tuple. Alternatively, that a new function, `await_all_started()` does this, if that's what's desired.
The question then becomes which one 'rebar3 shell` should use...

BR,
Ulf W

On Tue, Mar 23, 2021 at 1:13 PM Fred Hebert <[hidden email]> wrote:
I initially implemented `application:ensure_all_started/1,2' because it made for an easier get-started mechanism than getting a full blown release, and avoided having people do the annoying "try until it works" garbage routine by hand (which the function is now doing), or having to just line up all the start calls at the beginning of a test run.

I would be in favour of just letting it fail and return back {error, {b, {permit, false}}} -- there are more variations we could use, but this term allows to:
- specifically mention it's b that isn't allowed to start
- uses the 'permit' keyword that matches the function name, such that googling something like "erlang permit false" will yield things such as the doc page for application:permit/2 (or this thread, given how google indexing works)

That being said, it does not line up with the current API usage, which would probably need some fixing of some sort? Look at this session:

1> application:load(ssl).
ok
2> application:permit(ssl, false).
ok
3> application:ensure_all_started(crypto).
{ok,[crypto]}
4> application:ensure_all_started(public_key).
{ok,[asn1,public_key]}
5> application:start(ssl).
ok
6> application:stop(ssl).
{error,{not_started,ssl}}


Starting the application "works" even if it does not with permissions; the call silently fails while returning it succeeded. Since the current documentation for ensure_all_started states the following:

The function reports {error, {AppName,Reason}} for errors, where Reason is any possible reason returned by start/1,2 when starting a specific dependency.

We are in a situation where ensure_all_started can't or won't know that permissions are at play (I didn't when I implemented it), and it appears that we would need to do an explicit permission check before each run to properly return the error explaining the issue without breaking the compatibility of start/1,2. The gotcha here is that there's apparently no way to access this status aside from application_controller internals that would need some extra visibility, which start/1,2 calls on its own as well.

Regards,
Fred.

On Mon, Mar 22, 2021 at 1:06 PM Ulf Wiger <[hidden email]> wrote:
Hmm, trying some more with OTP 24, it addresses the problem with the memory growth, but still isn't permission-aware.

Consider test apps a and b, where a depends on b.

15> application:permit(b,false).
ok
16> application:ensure_all_started(a).
{error,{a,{not_started,b}}}
17> application:which_applications().
[{stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]
18> application:permit(b,true).
ok
19> application:which_applications().
[{b,"test app","0.1"},
 {stdlib,"ERTS  CXC 138 10","3.15"},
 {kernel,"ERTS  CXC 138 10","8.0"}]

The call to application:ensure_all_started(a) fails, and supposedly all child apps that were started will have been stopped again, and it does look that way.

But if we later permit b to run, it turns out that the start request wasn't actually removed, and b pops up.

This is for sure a much less serious problem than the previous one.

However, I'm not sure if returning error is actually the right thing to do there. The call SHOULD probably hang.

Comments?

BR,
Ulf W

On Mon, Mar 22, 2021 at 1:23 PM Ulf Wiger <[hidden email]> wrote:
When I started looking closer into this, it would appear as if there is a long-standing bug in the application_controller regarding permissions.

And with "long-standing" I mean that it was there even when Kostis did some Tidier-based cleanup 11 years ago. Kostis didn't introduce it, though.

When servicing a start request, the application_controller, if permission(App) == false, adds a new entry to the `start_p_false` list, i.e. a new entry for each request.

... but when servicing a subsequent {permit_application, App, true}, it uses lists:keydelete/3 to remove the App from the `start_p_false` list.

lists:keydelete/3 obviously only removes the first matching entry.

Earlier in that function, it also only locates the first pending request (or rather, chronologically the last), and uses the `From` in `spawn_starter()'.

The rest of the pending requests should be handled somewhere - likely in `handle_application_started/3`, but aren't.

BR,
Ulf W

On Sat, Mar 20, 2021 at 1:45 PM Mikael Pettersson <[hidden email]> wrote:
On Sat, Mar 20, 2021 at 8:46 AM Ulf Wiger <[hidden email]> wrote:
>
> I had the brilliant idea of using application permissions for a particular use case. This seemed to work perfectly, until I ran `rebar3 shell`, and spotted some disturbing behavior.
>
> The bug, apparently, lies in that `application:ensure_all_started(A)` ends up busy-looping if A depends on B, and permission(B) -> false. What's worse, for each call to start(B), the application controller notices the permission flag, returns `ok` and inserts an entry in its internal `start_p_false` list. This amounts to a memory leak.
>
> I commented it in a tweet, then decided to try to find the source, esp. since I suspected `application:ensure_all_started/1`.
>
> https://twitter.com/uwiger/status/1372944356781531136
>
> In short, if permission(B) -> false, what happens is:
> start(A) -> {error, {not_started, B}}
> start(B) -> ok
> start(A) -> {error,  {not_started, B}}
> ... [repeat endlessly]
>
> Now, it could be fixed by adding a permission check in the looping function, but this raises the question of what should happen in the above case. Three alternatives:
>
> 1. ensure_all_started(A) returns {error, {not_permitted, B}}, or something
> 2. the call hangs until the flag(s) change, but start(B) is only called once.
> 3. Warn against the use of permissions in the docs, and deprecate them.
>
> I'm assuming that most of you may not even know about permissions. They were introduced back in 1996-97 (I believe), when I and Martin Björklund were going back and forth on how to support distributed applications and cluster control. Eventually, this led to dist_ac and the protocol being defined, so that users could write a controller app taking control of an application and giving instructions on where it should run. In the AXD301, this was done by the RCM application. I believe I talked about it at the EUC 1997, but it's hard to find information about that on the web. :)
>
> Anyway, permissions were left in the API, and ARE documented.
>
> Thoughts?

I know we've used the permissions mechanism occasionally during
maintenance or live upgrades. Off-hand I don't know if we'd want
alternative 1 or 2 (my colleague Daniel Szoboszlay might know more
about this).

/Mikael

Reply | Threaded
Open this post in threaded view
|

Re: BUG: fatal interaction between application:ensure_all_started(A) and permit(B, false)

Ulf Wiger-2
In reply to this post by Ulf Wiger-2
Ah, so yet another bug! :)

I guess this how you know that a feature hasn't really been used much.

BR,
Ulf W

Den fre 26 mars 2021 17:06Martin Björklund <[hidden email]> skrev:
Ulf Wiger <[hidden email]> wrote:
> Permissions are a bit tricky. :)

Indeed.  This discussion prompted me to use permissions for a use case
of my own, but the system behaved really strange.  Eventually I found
the problem.  Have a look at this:

Eshell V11.1.4  (abort with ^G)
1> application:load(sasl).
ok
2> application:permit(sasl, false).
ok
3> application:start(sasl).
ok
4> application:permit(sasl, true).
ok
5> application:which_applications().
[{sasl,"SASL  CXC 138 11","4.0.1"},
 {stdlib,"ERTS  CXC 138 10","3.14"},
 {kernel,"ERTS  CXC 138 10","7.2"}]


This is fine and as expected.  But now let's throw in an additional
permit false:


Eshell V11.1.4  (abort with ^G)
1> application:load(sasl).
ok
2> application:permit(sasl, false).
ok
3> application:start(sasl).
ok
4> application:permit(sasl, false).       % <-- HERE
ok
5> application:permit(sasl, true).
ok
6> application:which_applications().
[{stdlib,"ERTS  CXC 138 10","3.14"},
 {kernel,"ERTS  CXC 138 10","7.2"}]


Now sasl isn't started!

It turns out that the second permit false will remove the
application from the start_p_false list in application_controller.erl:

    %% start requested but not started because permit was false
    {false, {true, _Appl}, false, {value, _Tuple}, false, false} ->
       update_permissions(AppName, Bool),
       SS = S#state{start_p_false = keydelete(AppName, 1, SPF)},
       {reply, ok, SS};

This doesn't seem right.  I think this should be a noop.


/martin