Supervisor child pid

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Supervisor child pid

Zsolt Laky
Dear Team,

Finding a child Pid under a supervisor can be costly with the suggestion I found on the net with supervisor:which_children/1 as with a high number of children it returns a huge list with {Id, Child, Type, Modules} to find the MyChildId in.

What I did is:

-spec lookup(Name) -> Result when
 Name   :: term(),
 Result :: {ok, pid()} | {error, not_registered}.
lookup(Name) ->
 FakeChild = #{ id => Name,
                start => {?MODULE, ignore_start_child, []},
                restart => temporary,
                shutdown => 2000,
                type => worker,
                modules => dynamic},
 case supervisor:start_child(actor_sup, FakeChild) of
   {error, {already_started, Pid}} ->
     {ok, Pid};
   _Else -> %% {error, {ignored, _}}
     {error, not_registered}
 end.

-spec ignore_start_child() ->
 {error, ignored}.
ignore_start_child() ->
 {error, ignored}.

I ask the experts if they see any pitfall I might have missed.

Thanks in advance and kind regards,
Zsolt
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Roger Lipscombe-2
supervisor:get_child_spec/2 seems less brittle and more understandable.

On Tue, 3 Mar 2020 at 19:51, Zsolt Laky <[hidden email]> wrote:

>
> Dear Team,
>
> Finding a child Pid under a supervisor can be costly with the suggestion I found on the net with supervisor:which_children/1 as with a high number of children it returns a huge list with {Id, Child, Type, Modules} to find the MyChildId in.
>
> What I did is:
>
> -spec lookup(Name) -> Result when
>  Name   :: term(),
>  Result :: {ok, pid()} | {error, not_registered}.
> lookup(Name) ->
>  FakeChild = #{ id => Name,
>                 start => {?MODULE, ignore_start_child, []},
>                 restart => temporary,
>                 shutdown => 2000,
>                 type => worker,
>                 modules => dynamic},
>  case supervisor:start_child(actor_sup, FakeChild) of
>    {error, {already_started, Pid}} ->
>      {ok, Pid};
>    _Else -> %% {error, {ignored, _}}
>      {error, not_registered}
>  end.
>
> -spec ignore_start_child() ->
>  {error, ignored}.
> ignore_start_child() ->
>  {error, ignored}.
>
> I ask the experts if they see any pitfall I might have missed.
>
> Thanks in advance and kind regards,
> Zsolt
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Zsolt Laky
Thanks for your suggestion Roger, however get_child_spec/2 does not give the Pid for the child back.
Zsolt

> On Mar 3, 2020, at 9:54 PM, Roger Lipscombe <[hidden email]> wrote:
>
> supervisor:get_child_spec/2 seems less brittle and more understandable.
>
> On Tue, 3 Mar 2020 at 19:51, Zsolt Laky <[hidden email]> wrote:
>>
>> Dear Team,
>>
>> Finding a child Pid under a supervisor can be costly with the suggestion I found on the net with supervisor:which_children/1 as with a high number of children it returns a huge list with {Id, Child, Type, Modules} to find the MyChildId in.
>>
>> What I did is:
>>
>> -spec lookup(Name) -> Result when
>> Name   :: term(),
>> Result :: {ok, pid()} | {error, not_registered}.
>> lookup(Name) ->
>> FakeChild = #{ id => Name,
>>                start => {?MODULE, ignore_start_child, []},
>>                restart => temporary,
>>                shutdown => 2000,
>>                type => worker,
>>                modules => dynamic},
>> case supervisor:start_child(actor_sup, FakeChild) of
>>   {error, {already_started, Pid}} ->
>>     {ok, Pid};
>>   _Else -> %% {error, {ignored, _}}
>>     {error, not_registered}
>> end.
>>
>> -spec ignore_start_child() ->
>> {error, ignored}.
>> ignore_start_child() ->
>> {error, ignored}.
>>
>> I ask the experts if they see any pitfall I might have missed.
>>
>> Thanks in advance and kind regards,
>> Zsolt

Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

zxq9-2
In reply to this post by Zsolt Laky
On 2020/03/04 4:08, Zsolt Laky wrote:
> Finding a child Pid under a supervisor

...
> I ask the experts if they see any pitfall I might have missed.

The most important issue starts back at the premise:

1. Why is the system designed in such a way that interrogation of a
supervisor about its children is necessary?
2. Why is the performance of that interrogation so important?

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Zsolt Laky
HI Craig,

1. It is not necessary. I could use gproc as a registry for supervised proc_lib:spawned processes. Since supervisor itself has its registry, why should not I use that instead?
2. With thousands of medium lived processes (like session handling processes) I believe performance is important.

Kind regards,
Zsolt

> On Mar 4, 2020, at 5:02 AM, zxq9 <[hidden email]> wrote:
>
> On 2020/03/04 4:08, Zsolt Laky wrote:
>> Finding a child Pid under a supervisor
>
> ...
>> I ask the experts if they see any pitfall I might have missed.
>
> The most important issue starts back at the premise:
>
> 1. Why is the system designed in such a way that interrogation of a supervisor about its children is necessary?
> 2. Why is the performance of that interrogation so important?
>
> -Craig

Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

zxq9-2
On 2020/03/04 14:03, Zsolt Laky wrote:
> HI Craig,
>
> 1. It is not necessary. I could use gproc as a registry for supervised proc_lib:spawned processes. Since supervisor itself has its registry, why should not I use that instead?

Because the meaning of a process in a system is more than its PID.
Whatever part of the system knows the relevant meanings can be written
in such a way that it *already* knows a map of meanings to PIDs.

It can take a bit of experience to identify the natural places for this,
especially in a large system.

> 2. With thousands of medium lived processes (like session handling processes) I believe performance is important.

Performance is important, sure, but not because X set of PIDs happen to
be children of supervisor Y. Simplicity tends to beat performance in
most cases anyway.

Writing a system in such a way that there is a process that is
responsible for the proper and sensible operation of a service within
the system (whether it is session management, chat clients, game mob
balance/spawning, stream routing, or whatever else) means that typically
such a service manager is both the location where settings for the
service reside and the place where the current status of the system is
known ("The service is configured to only accept N client connections of
type X at a given moment before closing the listen socket and reporting
to a higher level manager" <- these sort of tasks).

If you need, for example, a way to route DMs in a chat system, the
service manager itself is nearly always the simplest and fastest place
to implement the lookup -- it knows a LOT more than the supervisor of
the client connections ever could and quite often features of the system
will evolve to the point that the additional service information or
session meta will impact the *way* that the task that requires routing
will function anyway.

Chat channels (or game zones or instances or whatever else needs to
group a set of workers that means something inside the system but is
arbitrary from the perspective of the supervisor) tend to be processes
of their own and already know the PIDs of the member sessions being
handled, and their enrolled lists are tiny fractions of a supervisor's
set of children.

The only time where a list of just the PIDs is a meaningful piece of
information is when some sort of system level service task is being
performed, and these tend to be tasks that block, do not occur in a
tight loop, and are not time sensitive.

Once a system (and the team or person writing it) evolves to the point
that it is really addressing the problem to be solved in a natural way,
the issue of looking up children of a given supervisor just doesn't come
up. I remember several times when I *thought* I had this problem only to
find that it just evaporated from the needs/tasks list the moment some
other area of the system was restructured.

I now regard the feeling that "I need to look up the list of a
supervisor's children" as a code smell in just about every case other
than writing system tools.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Roger Lipscombe-2
On Wed, 4 Mar 2020 at 08:33, zxq9 <[hidden email]> wrote:
> I now regard the feeling that "I need to look up the list of a
> supervisor's children" as a code smell in just about every case other
> than writing system tools.

fwiw, our current system has a custom supervisor implementation in one
(and only one) place which deals specifically with "start a child with
this ID; if it already exists, give me back the current pid"; it's a
simplification/enhancement of a simple_one_for_one supervisor (which
doesn't otherwise track IDs).

But, we *also* have a process registry tracking the same processes,
but for different purposes. Whether I'd do that again in hindsight,
I'm not sure; at the time we identified the need for a custom
supervisor, and I *think* some of those constraints are still valid.
Our process registry is also custom, because we had slightly different
needs than most of the extant process registry implementations
offered.

--
Roger.
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

zxq9-2
On 2020/03/04 18:22, Roger Lipscombe wrote:

> On Wed, 4 Mar 2020 at 08:33, zxq9 <[hidden email]> wrote:
>> I now regard the feeling that "I need to look up the list of a
>> supervisor's children" as a code smell in just about every case other
>> than writing system tools.
>
> fwiw, our current system has a custom supervisor implementation in one
> (and only one) place which deals specifically with "start a child with
> this ID; if it already exists, give me back the current pid"; it's a
> simplification/enhancement of a simple_one_for_one supervisor (which
> doesn't otherwise track IDs).
>
> But, we *also* have a process registry tracking the same processes,
> but for different purposes. Whether I'd do that again in hindsight,
> I'm not sure; at the time we identified the need for a custom
> supervisor, and I *think* some of those constraints are still valid.
> Our process registry is also custom, because we had slightly different
> needs than most of the extant process registry implementations
> offered.

I can totally relate to this story!

It is very common in systems that I work on these days for a "service
manager" sort of process to be a sibling of the simple_one_for_one that
owns the workers in the service, and the service manager to have a few
process registry-ish sort of functions. These nearly always wind up
doing quite a few extra things beyond just looking up a PID ("is A
allowed to look B up? Does that cost a point?" etc.) and that pays off
in the end because pure Name -> PID lookup isn't nearly as important
having a proper abstraction for a service as an entity of its own.

In any case, there is usually a need for internal services to have the
supervisor bits *and* service management bits, and this quite often
results in implementation of some form of registry within the manager
part (or glomming this together inside a custom supervisor). One of the
most common design problems I see with folks new to architecture of
concurrent systems is to fail to identify when they are writing an
internal service and leave the responsibility for managing it *as* a
service to functions that are either a) scattered through the system and
have to do cheetah flips querying various components to build up enough
state to make service-level management decisions or b) push everything
into a God process that is supposed to know everything and winds up
becoming a SPOF and SPBN at the same time.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Zsolt Laky
In reply to this post by Roger Lipscombe-2
We have somehow a similar situation as Roger.

The only reason of having the processes supervised in our case is to use release_handler for live upgrades. We had been having our own registry before starting to use supervisors and all "supervised" processes are "temporary" so they should not be restarted. The idea is, that if anyway we need to link our processes to a supervisor (for one specific reason), why shall we implement a separate registry for finding them?

I am greatfull for your thoughs, and enjoy consulting on a system design topic, nevertheless the original question is, whether the small code provided does what it was written for in a way that it does not break functionality of a one-to-one supervisor in this specific use case.

-spec lookup(Name) -> Result when
Name   :: term(),
Result :: {ok, pid()} | {error, not_registered}.
lookup(Name) ->
FakeChild = #{ id => Name,
              start => {?MODULE, ignore_start_child, []},
              restart => temporary,
              shutdown => 2000,
              type => worker,
              modules => dynamic},
case supervisor:start_child(my_sup, FakeChild) of
 {error, {already_started, Pid}} ->
   {ok, Pid};
 _Else -> %% {error, {ignored, _}}
   {error, not_registered}
end.

-spec ignore_start_child() ->
{error, ignored}.
ignore_start_child() ->
{error, ignored}.

Zsolt

> On Mar 4, 2020, at 10:22 AM, Roger Lipscombe <[hidden email]> wrote:
>
> On Wed, 4 Mar 2020 at 08:33, zxq9 <[hidden email]> wrote:
>> I now regard the feeling that "I need to look up the list of a
>> supervisor's children" as a code smell in just about every case other
>> than writing system tools.
>
> fwiw, our current system has a custom supervisor implementation in one
> (and only one) place which deals specifically with "start a child with
> this ID; if it already exists, give me back the current pid"; it's a
> simplification/enhancement of a simple_one_for_one supervisor (which
> doesn't otherwise track IDs).
>
> But, we *also* have a process registry tracking the same processes,
> but for different purposes. Whether I'd do that again in hindsight,
> I'm not sure; at the time we identified the need for a custom
> supervisor, and I *think* some of those constraints are still valid.
> Our process registry is also custom, because we had slightly different
> needs than most of the extant process registry implementations
> offered.
>
> --
> Roger.

Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Roger Lipscombe-2
In reply to this post by zxq9-2
On Wed, 4 Mar 2020 at 10:01, zxq9 <[hidden email]> wrote:
> It is very common in systems that I work on these days for a "service
> manager" sort of process to be a sibling of the simple_one_for_one that
> owns the workers in the service, and the service manager to have a few
> process registry-ish sort of functions.

I had a dig in the code, and discovered the justification for a custom
supervisor:

- We needed simple_one_for_one's "every child is the same" semantics.
- We needed one_for_one's find-or-start semantics (simple_one_for_one
ignores IDs).
- We did *not* want max-restarts; children restart infinitely. We
definitely did *not* want the supervisor killing itself.
- Default restart strategy is transient.

But (and I think this is an important distinction): while the
supervisor has find-or-start, i.e. uniqueness, we still need a
registry in order to *find* those processes from other places.

Other features of our registry: processes can have names and
properties; names are owned by a single process, but a process can
have multiple names; properties are many:many. Unlike gproc, which
already does a lot of that, we also had the need that registering a
name transferred it to the new process, rather than failing. I also,
at the time, found gproc's API overly confusing.

The supervisor and registry are not siblings (in the strict sense that
they have a common parent supervisor), but they're "siblings" in the
way that they're used. Another reason for our registry to not be a
strict sibling is that we also use it (in a different application)
paired with ranch.
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Jesper Louis Andersen-2
In reply to this post by Zsolt Laky
On Tue, Mar 3, 2020 at 8:51 PM Zsolt Laky <[hidden email]> wrote:
Dear Team,

Finding a child Pid under a supervisor can be costly with the suggestion I found on the net with supervisor:which_children/1 as with a high number of children it returns a huge list with {Id, Child, Type, Modules} to find the MyChildId in.


Usually, I tend to keep the supervision domain separate from the communication domain. The reason is that supervision forms a tree and communication forms a hypergraph. That is, if you want to know the Pid of a process, you look it up in a registry, like gproc.

Mixing the life-time domain with the communication domain locks down your lifetime domain against changes. Also note that you can often cache Pids in processes depending on where they sit in a supervision tree. If you have a sibling in a one-for-all supervisor, you can rely on the sibling not changing Pid, as if it does, then you are getting terminated anyway. The same often happens if the Pid is sitting higher up the tree and you are sharing a common path. You can often rely on the fact that if it goes away, you are getting terminated very quickly.

As an aside, this policy is somewhat the same you have in Go with its "context" package. a context.Context encodes when it is time for your goroutine to terminate, but it doesn't encode the actual policy of why you are terminating. This is generally good decoupling of concerns since you might want to use that goroutine in a different setting having a different policy. It is much the same I view supervision trees in Erlang: they encode policy separate from work, so the policy can change over time without the worker having to change.

It *is* more work up front, but in the longer run, it tends to win you robustness, and also efficiency, which you have learned as well.

Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Zsolt Laky
Hi Jesper,

Thanks for your valuable notes. I see the point and agree. In a system with "real" supervision that is the model to be followed.

In our special case with one_for_one single layer supervision with all childs "temporary", as I see the supervisor does nothing but tracking the Ids, as when a child terminates for any reason it is not restarted, just removed from the child list of the supervisor. I feel it as a kind of registry with the advantage of having the Ids as a term() instead of atom() like the standard registry/2.
Moreover, if the spawned process makes an attempt to add itself to this supervisor, (and routes the reqest if supervisor:add_child returns with {error, {already_registered, {Pid}} ) performance can be pretty good as the supervisor process does not wait the "child" process to spin up with a possible long init function, blocking other childs to start.

In all other cases it seems to be a solution to be avoided, I agree.

Thanks all for your relevant thoughts, I learned a lot! Again.

Cheers
Zsolt

On Mar 4, 2020, at 1:16 PM, Jesper Louis Andersen <[hidden email]> wrote:

On Tue, Mar 3, 2020 at 8:51 PM Zsolt Laky <[hidden email]> wrote:
Dear Team,

Finding a child Pid under a supervisor can be costly with the suggestion I found on the net with supervisor:which_children/1 as with a high number of children it returns a huge list with {Id, Child, Type, Modules} to find the MyChildId in.


Usually, I tend to keep the supervision domain separate from the communication domain. The reason is that supervision forms a tree and communication forms a hypergraph. That is, if you want to know the Pid of a process, you look it up in a registry, like gproc.

Mixing the life-time domain with the communication domain locks down your lifetime domain against changes. Also note that you can often cache Pids in processes depending on where they sit in a supervision tree. If you have a sibling in a one-for-all supervisor, you can rely on the sibling not changing Pid, as if it does, then you are getting terminated anyway. The same often happens if the Pid is sitting higher up the tree and you are sharing a common path. You can often rely on the fact that if it goes away, you are getting terminated very quickly.

As an aside, this policy is somewhat the same you have in Go with its "context" package. a context.Context encodes when it is time for your goroutine to terminate, but it doesn't encode the actual policy of why you are terminating. This is generally good decoupling of concerns since you might want to use that goroutine in a different setting having a different policy. It is much the same I view supervision trees in Erlang: they encode policy separate from work, so the policy can change over time without the worker having to change.

It *is* more work up front, but in the longer run, it tends to win you robustness, and also efficiency, which you have learned as well.


Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Roger Lipscombe-2
On Thu, 5 Mar 2020 at 06:06, Zsolt Laky <[hidden email]> wrote:
> performance can be pretty good as the supervisor process does not wait the "child" process to spin up with a possible long init function, blocking other childs to start.

If this is your problem, then it's possible to make the child process
start asynchronously. I wrote something on my blog, here:
https://blog.differentpla.net/blog/2018/06/12/gen-server-enter-loop/
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor child pid

Fred Youhanaie-2

On 05/03/2020 08:11, Roger Lipscombe wrote:
> On Thu, 5 Mar 2020 at 06:06, Zsolt Laky <[hidden email]> wrote:
>> performance can be pretty good as the supervisor process does not wait the "child" process to spin up with a possible long init function, blocking other childs to start.
>
> If this is your problem, then it's possible to make the child process
> start asynchronously. I wrote something on my blog, here:
> https://blog.differentpla.net/blog/2018/06/12/gen-server-enter-loop/
>

I'll just add that since OTP/21.0 (or perhaps later) we have a new callback available in gen_server - handle_continue. This will allow init to return immediately, and then continue the rest of the
initialization asynchronously within the callback, and before the first client message.


Cheers,
Fred