Supervisor and configuration update — how to do it?

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Supervisor and configuration update — how to do it?

Max Lapshin-2
Hi.

Our flussonic has big amount of code related to smooth configuration
update. Almost all configuration parameters can be changed on fly: it
is very important, because we handle big amounts of video traffic.  It
is very bad to leave 20 000 people without video if you want to
reconfigure single channel.


We have to fight with supervisor concepts: we do not accept
configuration in start_link/init, because it may change later.

Usually code is looking so:


init([Options]) ->
  {ok, update_options(Options, #state{})}.

handle_info({update_options, Options}, #state{} = State) ->
  {noreply, update_options(Options, State)};



Such approach doesn't work well with supervisor: if process is
restarted, it will be restarted with old Options.

My question is: is such approach antipattern for OTP?
Maybe someone else has already met with it and has some ideas about
it, like modifying supervisor.erl to change MFA?
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Loïc Hoguin-3
In Ranch 2.0 the options are put in an ets table, and then fetched from
those tables in the supervisor init functions. The same could be done
using persistent_term.

On 20/05/2020 07:36, Max Lapshin wrote:

> Hi.
>
> Our flussonic has big amount of code related to smooth configuration
> update. Almost all configuration parameters can be changed on fly: it
> is very important, because we handle big amounts of video traffic.  It
> is very bad to leave 20 000 people without video if you want to
> reconfigure single channel.
>
>
> We have to fight with supervisor concepts: we do not accept
> configuration in start_link/init, because it may change later.
>
> Usually code is looking so:
>
>
> init([Options]) ->
>    {ok, update_options(Options, #state{})}.
>
> handle_info({update_options, Options}, #state{} = State) ->
>    {noreply, update_options(Options, State)};
>
>
>
> Such approach doesn't work well with supervisor: if process is
> restarted, it will be restarted with old Options.
>
> My question is: is such approach antipattern for OTP?
> Maybe someone else has already met with it and has some ideas about
> it, like modifying supervisor.erl to change MFA?
>

--
Loïc Hoguin
https://ninenines.eu
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Michael Truog
In reply to this post by Max Lapshin-2
On 5/19/20 10:36 PM, Max Lapshin wrote:

> Hi.
>
> Our flussonic has big amount of code related to smooth configuration
> update. Almost all configuration parameters can be changed on fly: it
> is very important, because we handle big amounts of video traffic.  It
> is very bad to leave 20 000 people without video if you want to
> reconfigure single channel.
>
>
> We have to fight with supervisor concepts: we do not accept
> configuration in start_link/init, because it may change later.
I prefer putting configuration in start_link/init and changing it with a
message.  That keeps the approach functional, pursuing referential
transparency, making sure the update of the configuration info is not
error-prone because the sequence for the update gets well represented in
a sequence of functions that are called (easy to test, easy to add
asserts to, easy to reason about and make assumptions about).  I am
assuming you don't use the process dictionary.  Things are normally not
completely referentially transparent due to catching exceptions and
messages getting received, but often that processing is isolated in
function calls for that purpose.

I understand the tendency to pursue ets, mnesia, persistent_term or a
global data store for (transient read-write) configuration, because that
approach seems easy with mutable data and time is spent trying to ignore
the data inconsistency problems that occur as things change.  Arguing
for global data is similar to arguing about the benefits of global
variables and imperative programming, so often people coming from an
imperative programming background will feel compelled to find a way to
use global data to solve problems. Its best to resist this temptation in
a language that supports functional programming.

Best Regards,
Michael
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Tristan Sloughter-4
In reply to this post by Max Lapshin-2
You can update the childspecs with either an appup or manually using the steps an appup does of suspending the process and replacing its state with `sys` calls.

Can't use that for the top level supervisor but if that crashes you go through the application start process anyway where you can pick up the new config state to pass along?

On Tue, May 19, 2020, at 23:36, Max Lapshin wrote:

> Hi.
>
> Our flussonic has big amount of code related to smooth configuration
> update. Almost all configuration parameters can be changed on fly: it
> is very important, because we handle big amounts of video traffic.  It
> is very bad to leave 20 000 people without video if you want to
> reconfigure single channel.
>
>
> We have to fight with supervisor concepts: we do not accept
> configuration in start_link/init, because it may change later.
>
> Usually code is looking so:
>
>
> init([Options]) ->
>   {ok, update_options(Options, #state{})}.
>
> handle_info({update_options, Options}, #state{} = State) ->
>   {noreply, update_options(Options, State)};
>
>
>
> Such approach doesn't work well with supervisor: if process is
> restarted, it will be restarted with old Options.
>
> My question is: is such approach antipattern for OTP?
> Maybe someone else has already met with it and has some ideas about
> it, like modifying supervisor.erl to change MFA?
>
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Michael Truog
On 5/20/20 6:39 AM, Tristan Sloughter wrote:
> You can update the childspecs with either an appup or manually using the steps an appup does of suspending the process and replacing its state with `sys` calls.
>
> Can't use that for the top level supervisor but if that crashes you go through the application start process anyway where you can pick up the new config state to pass along?
I think that if configuration is going through the start_link/init, from
the supervisor childspec, that means the maximum restart intensity
(MaxR) should be 0 for the top-level supervisor.  That would avoid
relying on the supervisor's automatic restart because the childspec
start_link arguments can not easily be changed (in an atomic way).  That
would also mean that a single Erlang process ran by the top-level
supervisor would be tracking the configuration state changes, making
those changes happen in an ordered sequential way (to avoid the data
inconsistency problems that can occur with global data used
concurrently).  The configuration data can always be split among
separate Erlang processes that own separate parts of the configuration,
if that is necessary with parts of the configuration that do not have
dependencies.

Best Regards,
Michael


>
> On Tue, May 19, 2020, at 23:36, Max Lapshin wrote:
>> Hi.
>>
>> Our flussonic has big amount of code related to smooth configuration
>> update. Almost all configuration parameters can be changed on fly: it
>> is very important, because we handle big amounts of video traffic.  It
>> is very bad to leave 20 000 people without video if you want to
>> reconfigure single channel.
>>
>>
>> We have to fight with supervisor concepts: we do not accept
>> configuration in start_link/init, because it may change later.
>>
>> Usually code is looking so:
>>
>>
>> init([Options]) ->
>>    {ok, update_options(Options, #state{})}.
>>
>> handle_info({update_options, Options}, #state{} = State) ->
>>    {noreply, update_options(Options, State)};
>>
>>
>>
>> Such approach doesn't work well with supervisor: if process is
>> restarted, it will be restarted with old Options.
>>
>> My question is: is such approach antipattern for OTP?
>> Maybe someone else has already met with it and has some ideas about
>> it, like modifying supervisor.erl to change MFA?
>>

Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Ameretat Reith
I used to separate worker parameters to dynamic or static; static parameters
would be set by supervisor but dynamic ones would be inducted from application
env on worker initialization _or_ at runtime. So, each worker needing dynamic
parameters, would have a `do_load_option` function invoking
`application:get_env`  for each interesting dynamic parameter, and it would be
called on worker initialization, e.g. in `init/1` for gen_servers then
populate state. And like your approach, there was a `handle_call` to invoke
`do_load_options` and reset state. This message would be sent on to workers
found by `supervisor:which_children`.

I had scripts (Juju hooks and then a homebuilt deployment system hook) that
could set application env in sys.config on configuration changes. Then, that
hook would call (rpcterms) a rpc on release to reload `sys.config`. This
`reload_sys_config` calls `application_controller:change_application_data` to
update envs and then procedures to call `do_load_options` on workers.
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Oleg Tarasenko
Hey Max,

We also had a similar problem with our production. Our configuration (and states) were results of computations over a long period of time. 
The system itself was managed by Mesos, so it was hard to use erlang hot updates, as it was suggested earlier (another topic, is 
that hot updates are not trivial when you have multiple stateful processes).

Losing the state was quite painful for us, as it would require at least a couple of minutes to restore it. Which would result
in at least 100K failed requests from the customers. And because of our SLAs, it would have cost us quite a bit.

What worked for is Redis caching. It gave us reasonable performance + was external to our system (so it was easier to
re-boot nodes).

Best regards,
Oleg



On Wed, May 20, 2020 at 11:23 PM Ameretat Reith <[hidden email]> wrote:
I used to separate worker parameters to dynamic or static; static parameters
would be set by supervisor but dynamic ones would be inducted from application
env on worker initialization _or_ at runtime. So, each worker needing dynamic
parameters, would have a `do_load_option` function invoking
`application:get_env`  for each interesting dynamic parameter, and it would be
called on worker initialization, e.g. in `init/1` for gen_servers then
populate state. And like your approach, there was a `handle_call` to invoke
`do_load_options` and reset state. This message would be sent on to workers
found by `supervisor:which_children`.

I had scripts (Juju hooks and then a homebuilt deployment system hook) that
could set application env in sys.config on configuration changes. Then, that
hook would call (rpcterms) a rpc on release to reload `sys.config`. This
`reload_sys_config` calls `application_controller:change_application_data` to
update envs and then procedures to call `do_load_options` on workers.
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

zxq9-2
In reply to this post by Max Lapshin-2
On 2020/05/20 14:36, Max Lapshin wrote:
> We have to fight with supervisor concepts: we do not accept
> configuration in start_link/init, because it may change later.

Two thoughts on dynamic supervision and a final thought that leads to a
question:

- Thought 1: Supervision tree declarations
It is possible to write a declarative structure that defines a
supervision tree and a function to read down it, spawning the
appropriate supervisors and workers as it goes along.

Because this is possible it is also possible to make the supervision
tree definition be the dynamic output of another function. There is no
necessity for the parameters for supervision start to be statically
defined as literals -- the input for this could come from anywhere.

(I've experimented with dynamic generation and parsing of the definition
code a good bit, but the uses cases are very narrow in production code
-- a boringly obvious, readable, static definition is the right answer
*almost* all the time. I might make a lib for the "read this supervision
tree declaration and start it up" if there is any interest in it. It
does make reading a new project for the first time pretty easy because
the structure of the project is obvious by glancing only at the tree
definition instead of chasing ideas around the codebase.)

- Thought 2: Dynamic supervision definitions usually suck
It is quite a common experience to find projects in the wild (or
especially in consulting) where I find supervisor code that is dynamic
and therefore not easy to understand at a glance what the final
structure and relationship of the resulting processes will be. It is
also common in such projects for supervision to me more of a "plate"
than a "tree" and lack any true robust recovery capability beyond the
first layer.

This is nearly always a bad approach.

- Thought 3: What does "dynamic" mean?
When we start a supervisor it starts with a restart strategy and its
child definitions. Until now "dynamic supervision" has meant "start
parameters at start time", but not dynamically updating the supervisor's
overall rules on the fly.

We can dynamically add and remove child definitions and start/stop
children by making requests to a supervisor. Why can we not also check
and update supervisor rules and childspecs on the fly?

The use cases for this would be extremely limited, but it doesn't seem
hard to implement a complete compliment to supervisor:get_childspec/2
(with the exception of trying to change a simple_one_for_one to anything
else or vise versa).

   supervisor:get_childspec/2
   supervisor:set_childspec/3
   supervisor:get_restart_strategy/1
   supervisor:set_restart_strategy/2

Thoughts?

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Max Lapshin-2
I have the same repeated pattern in many places:

I start process, it has some configuration that can be read and
changed on fly without disconnecting sockets and releasing resources.

This process has some siblings-helpers that are launched after it and
that are connected to him. Usually this is a one-for-all-of-them
strategy.

I think of trying to find some common and reasonable pattern here with
editing supervisor there.

Right now we mostly use external configuration converger: process that
sleeps for several seconds and then wakes up and starts checking if
whole system if properly configured: all required processes are
started or killed.

It works, but it is not as smooth as it can be.
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Michael Truog
On 5/22/20 7:52 AM, Max Lapshin wrote:

> I have the same repeated pattern in many places:
>
> I start process, it has some configuration that can be read and
> changed on fly without disconnecting sockets and releasing resources.
>
> This process has some siblings-helpers that are launched after it and
> that are connected to him. Usually this is a one-for-all-of-them
> strategy.
>
> I think of trying to find some common and reasonable pattern here with
> editing supervisor there.
>
> Right now we mostly use external configuration converger: process that
> sleeps for several seconds and then wakes up and starts checking if
> whole system if properly configured: all required processes are
> started or killed.
>
> It works, but it is not as smooth as it can be.
The external configuration converger sounds like a problem to me because
your description doesn't sound like fail-fast behavior, if the converger
doesn't always control the configuration changes (it sounds like it
doesn't because you said it was external).  The sleep delay before it
resolves configuration problem should be making the failures slow.

If you have a process that owns the configuration (I call the one in
CloudI "configurator") do synchronous requests (spawn_link with
gen_server:call, or some similar approach can make the synchronous
requests occur in parallel) when changing configuration, the response
would tell you whether it succeeded or not.  That would allow the
configuration process to fail-fast.  That also lets the process restarts
be reserved for unexpected errors (as much as possible, the bugs
developers are unable to anticipate).

Best Regards,
Michael
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Max Lapshin-2
I'm afraid that I haven't got your idea.


Look, I have a live video stream. It collects video frames into big
chunks. There is a configuration setting that allows to change size of
these chunks: from 1 seconds to 10 seconds.

This configuration must be done without any restarts.

How do you advice to do this configuration?
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Tristan Sloughter-4
In this example you could use persistent_term, if the value fits in 1 word it does not trigger a global garbage collection on update.

On Mon, May 25, 2020, at 12:08, Max Lapshin wrote:

> I'm afraid that I haven't got your idea.
>
>
> Look, I have a live video stream. It collects video frames into big
> chunks. There is a configuration setting that allows to change size of
> these chunks: from 1 seconds to 10 seconds.
>
> This configuration must be done without any restarts.
>
> How do you advice to do this configuration?
>
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Bob Gustafson-2
In reply to this post by Max Lapshin-2
To zero in on your exact situation:

1) Is your video stream continuous, or does it stop after 10 seconds?

2) Is the configuration change done manually by human, or is it done
programmatically according to some preset criteria?

3) Could you save the whole video stream in one chunk and then
after-the-capture, chop it into the desired size chunks?

4) following along with 3), you could cache the continuing stream and
chop it into desired size chunks from the cache as the capture is
continuing. Each chunk is saved and deleted from the past end of the
capture cache.

Do any of these scenarios match your circumstances?

BobG

On 5/25/20 1:08 PM, Max Lapshin wrote:

> I'm afraid that I haven't got your idea.
>
>
> Look, I have a live video stream. It collects video frames into big
> chunks. There is a configuration setting that allows to change size of
> these chunks: from 1 seconds to 10 seconds.
>
> This configuration must be done without any restarts.
>
> How do you advice to do this configuration?
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Max Lapshin-2
In reply to this post by Tristan Sloughter-4
Persistent_term is a global variable that is extremely hard to test, maintain.

We are trying to move away from such approach and use erlang as pure
as possible.

I do not see any difference between persistent_term and
application:get_env and both of them
are global variables.
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Max Lapshin-2
In reply to this post by Bob Gustafson-2
I'm speaking about surveillance camera or TV channel.

2000 IP cameras are recording 7/24. Or it is a Superbowl streaming.

you cannot even think about stopping process. All reconfigurations are
made on fly.

On Mon, May 25, 2020 at 9:30 PM Bob Gustafson <[hidden email]> wrote:

>
> To zero in on your exact situation:
>
> 1) Is your video stream continuous, or does it stop after 10 seconds?
>
> 2) Is the configuration change done manually by human, or is it done
> programmatically according to some preset criteria?
>
> 3) Could you save the whole video stream in one chunk and then
> after-the-capture, chop it into the desired size chunks?
>
> 4) following along with 3), you could cache the continuing stream and
> chop it into desired size chunks from the cache as the capture is
> continuing. Each chunk is saved and deleted from the past end of the
> capture cache.
>
> Do any of these scenarios match your circumstances?
>
> BobG
>
> On 5/25/20 1:08 PM, Max Lapshin wrote:
> > I'm afraid that I haven't got your idea.
> >
> >
> > Look, I have a live video stream. It collects video frames into big
> > chunks. There is a configuration setting that allows to change size of
> > these chunks: from 1 seconds to 10 seconds.
> >
> > This configuration must be done without any restarts.
> >
> > How do you advice to do this configuration?
Reply | Threaded
Open this post in threaded view
|

Re: Supervisor and configuration update — how to do it?

Michael Truog
In reply to this post by Max Lapshin-2
On 5/25/20 11:08 AM, Max Lapshin wrote:

> I'm afraid that I haven't got your idea.
>
>
> Look, I have a live video stream. It collects video frames into big
> chunks. There is a configuration setting that allows to change size of
> these chunks: from 1 seconds to 10 seconds.
>
> This configuration must be done without any restarts.
>
> How do you advice to do this configuration?
I am a fan of using process messages to change configuration, for two
main reasons:
1) The configuration data stays consistent because the changes are
controlled (i.e., the state changes go through sequential states,
keeping the number of possible states small, making things easier to
test and understand)
2) Specific errors when configuration problems occur are able to be
handled, probably by reporting the problem to who changed the
configuration and either avoiding or limiting the configuration change. 
A configuration error during startup should cause the startup to fail,
to avoid the undefined runtime length pursuing erroneous operation.

Assuming a gen_server process, use a call if the configuration change is
able to return an error.  If no error is possible, e.g., if a record
field is updated and that is all that happens, use a cast.  The
configuration change might be slow when changing 1000s of processes, but
the messages can be done in parallel and the benefits (everything can be
tested in a controlled way, all the configuration states are valid, no
inconsistent configuration data will occur, etc.) are better than
pursuing the global data approach.

Best Regards,
Michael