|
I would really like to discourage people from avoiding
selective receive because it's "expensive". It can be expensive on very large message queues, but this is a pretty rare error condition, and fairly easily observable. (I know of projects that have banned use of selective receive for this reason, but without having thought much about what to use instead, and when.) You can use erlang:system_monitor/2 to quickly detect if a process is growing memory in a strange way. An old legacy Ericsson system implemented selective receive in a way that the message queue could hold at most 6 messages. Any more than that was obviously an error. I think it might be useful to be able to specify such a limit as a spawn option, perhaps together with maximum heap size. Exceeding the limit could perhaps lead to the process being killed (which might seem backwards in the case of the message queue, but at least gives a visible indication), or that the sender process would be suspended (which could potentially lead to the whole system stopping.) BR, Ulf W 2008/5/31 Christopher Atkins <[hidden email]>: > Hello, I tried (poorly--I'm a complete novice) to implement a benchmark from > your earlier statement. I didn't do the same thing (load up the message > mailbox before consuming them), but what I did write led to a perplexing (to > me) discovery. If I uncomment the line in [loop1/0] below, performance for > that loop degrades by an order of magnitude. Why is that? > > -module(test_receive). > -compile(export_all). > > start() -> > statistics(runtime), > statistics(wall_clock), > PidLoop1 = spawn(?MODULE, loop1,[]), > sender(PidLoop1, 10000000), > {_, Loop1Time1} = statistics(runtime), > {_, Loop1Time2} = statistics(wall_clock), > io:format("Sent ~p messages in ~p /~p~n", [100000, Loop1Time1, > Loop1Time2]), > statistics(runtime), > statistics(wall_clock), > PidLoop2 = spawn(?MODULE, loop2,[]), > sender(PidLoop2, 10000000), > {_, Loop2Time1} = statistics(runtime), > {_, Loop2Time2} = statistics(wall_clock), > io:format("Sent ~p messages in ~p /~p~n", [100000, Loop2Time1, > Loop2Time2]). > > sender(_, 0) -> void; > sender(Pid, N) -> > if > N rem 2 =:= 2 -> > Pid ! test2; > true -> > Pid ! test1 > end, > sender(Pid, N - 1). > > proc1(F) -> > receive > start -> spawn_link(F) > end. > > loop1() -> > receive > %%test1 -> loop1(); > test2 -> loop1() > end. > > loop2() -> > receive > _ -> loop2() > end. > > > ------------------------------------------------------------------------------------------------------------------------------ > Message: 2 > Date: Fri, 30 May 2008 18:07:18 +0200 > From: "Per Melin" <[hidden email]> > Subject: Re: [erlang-questions] eep: multiple patterns > To: "Sean Hinde" <[hidden email]> > Cc: Erlang Questions <[hidden email]> > Message-ID: > <[hidden email]> > Content-Type: text/plain; charset=ISO-8859-1 > > 2008/5/30 Per Melin <[hidden email]>: >> If I send 100k 'foo' messages and then 100k 'bar' messages to a >> process, and then do a catch-all receive until there are no messages >> left, that takes 0.03 seconds. >> >> If I do a selective receive of only the 'bar' messages, it takes 90 >> seconds. > > I found my old test code: > > -module(selective). > > -export([a/2, c/2]). > > a(Atom, N) -> > spawn(fun() -> b(Atom, N) end). > > b(Atom, N) -> > spam_me(foo, N), > spam_me(bar, N), > R = timer:tc(?MODULE, c, [Atom, N]), > io:format("TC: ~p~n", [R]). > > c(Atom, N) -> > receive > Atom -> c(Atom, N - 1) > after 0 -> > N > end. > > spam_me(Msg, Copies) -> > lists:foreach(fun(_) -> self() ! Msg end, lists:duplicate(Copies, 0)). > > --- > > 2> selective:a(bar, 100000). > <0.38.0> > TC: {124130689,0} > 3> selective:a(foo, 100000). > <0.40.0> > TC: {23176,0} > > > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://www.erlang.org/mailman/listinfo/erlang-questions > erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
Actually, I assume that in just about all cases where you
have a process that needs selective receive semantics, it's probably perfectly ok to set a low limit on the maximum length of the message queue. A buffering process could be placed in front of it, which might also normally do dispatch. It would not use selective receive, and so wouldn't suffer much from a large message queue. BR, Ulf W 2008/5/31 Ulf Wiger <[hidden email]>: > An old legacy Ericsson system implemented selective receive > in a way that the message queue could hold at most 6 messages. > Any more than that was obviously an error. > > I think it might be useful to be able to specify such a limit as > a spawn option, perhaps together with maximum heap size. > Exceeding the limit could perhaps lead to the process being > killed (which might seem backwards in the case of the message > queue, but at least gives a visible indication), or that the sender > process would be suspended (which could potentially lead to the > whole system stopping.) > > BR, > Ulf W erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
2008/5/31 Ulf Wiger <[hidden email]>:
> Actually, I assume that in just about all cases where you > have a process that needs selective receive semantics, > it's probably perfectly ok to set a low limit on the maximum > length of the message queue. A buffering process could > be placed in front of it, which might also normally do > dispatch. It would not use selective receive, and so wouldn't > suffer much from a large message queue. The last time selective receive broke things for me was actually not in my own code, but in Mnesia. When Mnesia loads a distributed table from another node it subscribes to table events before it starts to copy the table, and then ignores those table event messages while it's (selectively) receiving the table contents. Depending on the size of the table and the rate at which the table is updated on the other node, this can make your message queue grow until you run out of memory. This is not a case where a long queue obviously is an error. Except perhaps in the design. > 2008/5/31 Ulf Wiger <[hidden email]>: > >> An old legacy Ericsson system implemented selective receive >> in a way that the message queue could hold at most 6 messages. >> Any more than that was obviously an error. >> >> I think it might be useful to be able to specify such a limit as >> a spawn option, perhaps together with maximum heap size. >> Exceeding the limit could perhaps lead to the process being >> killed (which might seem backwards in the case of the message >> queue, but at least gives a visible indication), or that the sender >> process would be suspended (which could potentially lead to the >> whole system stopping.) >> >> BR, >> Ulf W > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://www.erlang.org/mailman/listinfo/erlang-questions > erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Ulf Wiger-2
> I would really like to discourage people from avoiding
> selective receive because it's "expensive". I would second that. Selective receive is similar to thinking single threaded in a multi-threaded environment (the approach that erlang in general supports). Isolate a group of related messages using the selective part and then you don't have to worry about all the other interleave interruptions that may occur. But as Ulf said, we aren't aware of any books on how to structure your messaging architecture which take you stepwise up from a simple architecture to a complicated selective receive. I do caution a beginner to start simple and build up; understand how the message queue works by creating test scenarios that produce specific results. One of the early admonishments one hears is to always have a catch all clause in your receive statements, which of course eliminates selective receive and causes your code to process messages in the order they arrived. To get around this, you can split the receive into separate functions, and then call one function to handle one logical message stream and another function to handle a different logical message stream. The thing to watch out in the split receive case is the missing message: receive {a, How} -> do_stuff(); {a, When, Why} -> do_stuff(); after 500 -> timeout end. receive {b, How} -> do_stuff(); {b, When, Why} -> do_stuff(); after 500 -> timeout end. Now you can handle the two streams independently, maybe giving more time to 'a' stream items than 'b' stream items. But suppose you accidentally send a message with {c, X} and it only happens once an hour. You will gradually get a queue which fills up with {c, X} messages, but you won't notice the slowdown for a few days. Whenever you have disjoint receive statements, you need to take care that there is a technique for emptying unexpected messages. Even though your queue should never get long, a new programmer on the staff may send a new message to your process without you knowing and it will take a while to discover the cause. jay _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Ulf Wiger-2
I wrote:
> >Whenever you have disjoint receive statements, you need to > >take care that there is a technique for emptying unexpected > >messages. Edward Fine accidentally replied only to me directly: > Is this a good place to use the catch-all, or is there a better > technique? I ask this as a newcomer to Erlang. (This posting also gives an alternative example to Valentin's priority problem suggestion) Consider a case where you are doing a scatter / gather algorithm to spread processing across nodes or across different processing algorithms. To make it concrete, suppose we have a database with 5 different tables and we need to collect information from each table to assemble into a single view to the user. The standard approach is to use the DB capability to join the tables. This introduces a single point access problem since the database server is doing all the work while the initiating process waits. Instead we put each of the tables in a different DB, flat file or ets table. Then we create a process for each one that provides caching and an access interface using messages. They may end up on the same machine or on 5 different machines, but we will get parallelism on the I/O and possibly on the cache and assembly processing (if there are multiple cores or multiple machines in the case of cache and assembly). What does the code look like? [Assume getQueries(UserId) generates a list of queries that are related to the database information we would like to display and that the length of this list matches the number of DB processes we have. ] doUserQuery(UserId) -> Queries = getQueries(UserId), QueryRef = make_ref(), [Pid ! {getData, QueryRef, UserId, Query} || {Pid, Query} <- lists:zip(DbPids, Queries)], Responses = collect_responses(QueryRef), display_db_info(Responses), erlang:send_after(1000, self(), {cleanup, QueryRef}). This is a pretty hokey approach -- you would want something better than a 1 second delay to tell you whether to eliminate old messages from the queue, but it is a concrete example to describe why you would want to use selective receive and what to do to make sure it doesn't cause you a problem. collectResponses(QueryRef) -> collectResponses(QueryRef, []). collectResponses(QueryRef, Responses) -> receive {responseData, QueryRef, _UserId, Results} -> collectResponses(QueryRef, [Responses | Results]) after 100 -> Responses end. Again, my hokey example collects results as long as they are present or no new ones show up for 100 milliseconds. What we have so far is a single request message sent to 5 processes and a function which implements selective receive to collect only the messages that are in response to the initial request from a variety of responders (hopefully all, but not if some are slow to respond). What happens if we have a slow responding database, but it does actually produce results after 1/2 second. It was too slow to be collected but it puts messages on the queue anyway. If we have no mechanism to clear them, they will build up and cause things to gradually slow down. So at some higher level we need the following code: main() -> receive %% Throw away late arriving results from a previous request {cleanup, QueryRef} -> dumpOldResults(QueryRef); {userRequest, UserId} -> doUserQuery(UserId) end, main(). dumpOldQueryResults(QueryRef) -> receive {responseData, QueryRef, _UserId, _Results} -> dumpOldQueryResults(QueryRef) after 0 -> ok end. In the main function, we give priority to cleaning up old messages. This will keep the queue short, however, it ensures a full queue scan for every user request. As long as the queue is short, that won't hurt us. Dumping old messages just cycles as fast as it can accepting messages that have our unique token and ignoring the rest of the data in the message. If there are no clean up messages remaining, we than accept a new user request (which will necessarily cause the message queue to grow for a short period) and display the results. What did we see? Selective receive used in 3 different ways: 1) To collect the results of a request (a two-way session conversation) 2) To handle self notifications for maintenance + user requests 3) To handle old messages from an expired session It turns out the {cleanup, QueryRef} message is not necessary in the above example and we can just consume all {responseData, ...} messages inside main(), but it depends on how new requests are placed on the queue and whether timing allows two requests to be interleaved in the results set (you don't want to remove all the responseData for a pending request that has not had time to collect results yet). Structuring as above gave more explicit different uses of selective receive. The problem remaining in the code above is that there is no "catch all" clause. Do we worry about that? It depends on how the system evolves. If you interface to a known protocol and you have covered all the messages supported via selective receive, then you could do without a catch all. If your system is evolving or there are other processes or programmers who might inject new message types, you need a catch all in the main/0 function (although you have to be careful not consume something that should stay on the queue). I have not tried this code, nor have I typed it into a erl prompt, so I can't guarantee it even compiles. Mostly it should give you ideas about ways to use selective receive. What if we didn't have selective receive? I see two choices: 1) Start a thread and open a new socket to the databases for each user request. Maintain the conversations as independent channels. 2) Create a hash table of messages received related to each request. This requires managing the conversation correlations yourself. Both of these approaches are much more code than selective receive requires and the complexity of concepts does not increase, so selective receive is a better approach and a useful feature of erlang. Is there a better way to manage the conversations rather than the whole cleanup back channel messaging? If you can spawn a new process for each request, the responses will go to privately owned message queues. When enough responses, or enough time has passed, the newly spawned request process returns its results and terminates. Any messages stuck on the queue are eliminated. Any future messages are silently discarded since there is no process to receive them. If the backend DB process were monitoring the request process, it could even interrupt its response to discard the results rather than waiting for processing to complete and pass them on to a non-existent process. With erlang, there are many architectural choices when you consider the uses of messaging and selective receive. jay _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
Jay,
Thanks for a very detailed and informative response. Although it obviously depends on circumstances, I feel that, given Erlang's extremely fast process creation time and small process size, I would first consider your last option, namely, to create an individual process per request, and use an ETS table to coordinate responses. If there are very many responses to be collected for each request, I would intuitively imagine in my "Erlang newbie fog" that using an ETS table with its constant-time performance and no-garbage-collection characteristics would be better on average than using selective receive, which I understand has to do a linear scan and move unprocessed messages to another area. Of course, intuition often does not stand up to the reality of performance measurements, so it would be interesting to see a benchmark of the various architectural options you have described, perhaps as a function of response time vs. request rate. Regards, Edwin Fine On Sun, Jun 1, 2008 at 8:09 PM, Jay Nelson <[hidden email]> wrote:
_______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
On Jun 1, 2008, at 8:35 PM, Edwin Fine wrote: > Jay, > > Thanks for a very detailed and informative response. Although it > obviously depends on circumstances, I feel that, given Erlang's > extremely fast process creation time and small process size, I > would first consider your last option, namely, to create an > individual process per request, and use an ETS table to coordinate > responses. If there are very many responses to be collected for > each request, I would intuitively imagine in my "Erlang newbie fog" > that using an ETS table with its constant-time performance and no- > garbage-collection characteristics would be better on average than > using selective receive, which I understand has to do a linear scan > and move unprocessed messages to another area. Of course, intuition > often does not stand up to the reality of performance measurements, > so it would be interesting to see a benchmark of the various > architectural options you have described, perhaps as a function of > response time vs. request rate. If you spawn a separate process for each, there is no need for an ets table. Just have the process send the results back "en masse". Dying PID's final message: Caller ! {responseData, QueryRef, AllTheDataAssembledAsNeeded} The caller's main loop can just: receive {responseData, QueryRef, Results} -> do_something(Results) end. If you need to pass it back to another process, just arrange: QueryRef = {make_ref(), UltimatePidToSendResults} Then the receive pattern above can become: {responseData, {Ref, EndPid}, Results} -> EndPid ! {response, self(), Results} In erlang you find that you lose code where in other languages you must add code. You don't check for errors, just code like it will succeed. Don't reconstruct structured data as a hash table or tree when the message can be tagged and returned to you with the correct classification as you knew it to start with. jay _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Ulf Wiger-2
Ulf Wiger wrote:
> I would really like to discourage people from avoiding > selective receive because it's "expensive". It can be > expensive on very large message queues, but this is > a pretty rare error condition, and fairly easily observable. > i think the "issue" of how the emu deals with huge in-queues is pretty uninteresting. in my personal experience, every single time this has come up the real problem has turned out to be lack of proper flow control (typically using {active,true} sockets). having 100k messages in an in-queue is not a realistic use case. the fact that this is not, afaik, particularly well documented is of course a problem. mats _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
2008/6/2 Mats Cronqvist <[hidden email]>:
This is true - but if one has no prior experience of this situation, it is hard to understand why a system is behaving sluggishly. What will be nice is having an option, as Ulf suggested earlier, to have bounded message queues (kill the process if the message queue length exceeds a certain value). That way, flow control problems will be more readily visible to users. In real life situations, when a process gets into this state, the only way to fix it is to kill that process as it will probably never catch up. This has been discussed before: http://www.erlang.org/pipermail/erlang-questions/2006-January/018364.html It also fits in well with the "Let it crash" philosophy. cheers Chandru _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
Chandru wrote:
> 2008/6/2 Mats Cronqvist <[hidden email] > <mailto:[hidden email]>>: > > Ulf Wiger wrote: > > I would really like to discourage people from avoiding > > selective receive because it's "expensive". It can be > > expensive on very large message queues, but this is > > a pretty rare error condition, and fairly easily observable. > > > > i think the "issue" of how the emu deals with huge in-queues is > pretty > uninteresting. > in my personal experience, every single time this has come up the > real > problem has turned out to be lack of proper flow control (typically > using {active,true} sockets). having 100k messages in an in-queue > is not > a realistic use case. > the fact that this is not, afaik, particularly well documented is of > course a problem. > > > This is true - but if one has no prior experience of this situation, > it is hard to understand why a system is behaving sluggishly. What > will be nice is having an option, as Ulf suggested earlier, to have > bounded message queues (kill the process if the message queue length > exceeds a certain value). That way, flow control problems will be more > readily visible to users. true enough. mats _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
On Tue, Jun 3, 2008 at 8:28 AM, Mats Cronqvist <[hidden email]> wrote:
> Chandru wrote: >> 2008/6/2 Mats Cronqvist <[hidden email] >> <mailto:[hidden email]>>: >> >> Ulf Wiger wrote: >> > I would really like to discourage people from avoiding >> > selective receive because it's "expensive". It can be >> > expensive on very large message queues, but this is >> > a pretty rare error condition, and fairly easily observable. >> > >> >> i think the "issue" of how the emu deals with huge in-queues is >> pretty >> uninteresting. >> in my personal experience, every single time this has come up the >> real >> problem has turned out to be lack of proper flow control (typically >> using {active,true} sockets). having 100k messages in an in-queue >> is not >> a realistic use case. >> the fact that this is not, afaik, particularly well documented is of >> course a problem. >> >> >> This is true - but if one has no prior experience of this situation, >> it is hard to understand why a system is behaving sluggishly. What >> will be nice is having an option, as Ulf suggested earlier, to have >> bounded message queues (kill the process if the message queue length >> exceeds a certain value). That way, flow control problems will be more >> readily visible to users. > > true enough. > > mats > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://www.erlang.org/mailman/listinfo/erlang-questions > > What > will be nice is having an option, as Ulf suggested earlier, to have > bounded message queues (kill the process if the message queue length > exceeds a certain value). +1 P.S. Sorry Mats for sending this only to You previously -- Gleb Peregud http://gleber.pl/ Every minute is to be grasped. Time waits for nobody. -- Inscription on a Zen Gong _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Chandru-4
On 2 Jun 2008, at 14:31, Chandru wrote: > 2008/6/2 Mats Cronqvist <[hidden email]>: > Ulf Wiger wrote: > > I would really like to discourage people from avoiding > > selective receive because it's "expensive". It can be > > expensive on very large message queues, but this is > > a pretty rare error condition, and fairly easily observable. > > > > i think the "issue" of how the emu deals with huge in-queues is > pretty > uninteresting. > in my personal experience, every single time this has come up the > real > problem has turned out to be lack of proper flow control (typically > using {active,true} sockets). having 100k messages in an in-queue is > not > a realistic use case. > the fact that this is not, afaik, particularly well documented is of > course a problem. > > This is true - but if one has no prior experience of this situation, > it is hard to understand why a system is behaving sluggishly. What > will be nice is having an option, as Ulf suggested earlier, to have > bounded message queues (kill the process if the message queue length > exceeds a certain value). That way, flow control problems will be > more readily visible to users. In real life situations, when a > process gets into this state, the only way to fix it is to kill that > process as it will probably never catch up. This has been discussed > before: http://www.erlang.org/pipermail/erlang-questions/2006-January/018364.html > > It also fits in well with the "Let it crash" philosophy. I respectfully disagree. It is nigh on impossible to predict where there might be some error that leads to a large queue, and this would lead to "defensive programming" where every process has a short max length. This would result in random crashes and loss of data for those uncommon situations in an generally well designed system where there might be a legitimate short term peak in queue lengths. We already have a mechanism to restart if a queue grows too large (actually 2 - process_info monitoring, and out of memory !) cheers, Sean _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
On Tue, Jun 3, 2008 at 12:11 PM, Sean Hinde <[hidden email]> wrote:
> I respectfully disagree. It is nigh on impossible to predict where > there might be some error that leads to a large queue, and this would > lead to "defensive programming" where every process has a short max > length. This would result in random crashes and loss of data for those > uncommon situations in an generally well designed system where there > might be a legitimate short term peak in queue lengths. > > We already have a mechanism to restart if a queue grows too large > (actually 2 - process_info monitoring, and out of memory !) Maybe more-queued-than-a-set-threshold could be made into a traceable event? What happened to the thread about creating a dtrace provider for erlang? _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
On 3 Jun 2008, at 11:38, Christian S wrote: > On Tue, Jun 3, 2008 at 12:11 PM, Sean Hinde <[hidden email]> > wrote: >> I respectfully disagree. It is nigh on impossible to predict where >> there might be some error that leads to a large queue, and this would >> lead to "defensive programming" where every process has a short max >> length. This would result in random crashes and loss of data for >> those >> uncommon situations in an generally well designed system where there >> might be a legitimate short term peak in queue lengths. >> >> We already have a mechanism to restart if a queue grows too large >> (actually 2 - process_info monitoring, and out of memory !) > > Maybe more-queued-than-a-set-threshold could be made into a > traceable event? Could be nice yes. I sill think it would also be much better if the system didn't slow to a crawl if queues grow large - this is the effect that almost guarantees the need for a restart. To quote Chandru "In real life situations, when a process gets into this state, the only way to fix it is to kill that process as it will *probably never catch up*" (emphasis mine). Both slowdown effects (GC copying and selective receive repeated searching) seem quite amenable to smart optimisations. > What happened to the thread about creating a dtrace provider for > erlang? I was left with the impression someone went away to start implementing stuff .. Cheers, Sean _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Sean Hinde
2008/6/3 Sean Hinde <[hidden email]>:
I agree it is nearly impossible to predict this - but what options does a programmer have without this bounded queue facility. 1. Introduce message queue monitoring for every process which is potentially long lived, which imho is extra boiler plate code which reduces readability of core functionality. Also there will be different ways of doing it depending on how your process is structured (gen_fsm, gen_server, gen_event, pure erlang...). If all that one does upon detecting this condition is clear the message queue by discarding messages, or terminate the process, wouldn't it be good to have this built-in? 2. have another process which monitors the entire system - which is not very scalable when you have hundreds of thousands of processes. 3. Wait for the system to crash in live and then figure out what happened. cheers Chandru _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
2008/6/3 Vlad Dumitrescu <[hidden email]>:
HI, I agree. This would be useful too. cheers Chandru _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Chandru-4
On 3 Jun 2008, at 12:30, Chandru wrote: > > We already have a mechanism to restart if a queue grows too large > (actually 2 - process_info monitoring, and out of memory !) > > > I agree it is nearly impossible to predict this - but what options > does a programmer have without this bounded queue facility. Well, I guess, mostly you need to have a design that doesn't lead to massive queue build up under sustained overload :-). This might mean input load regulation, or tweaking the process structure (the logger process problem). The system is unlikely to be performing to spec during this whole period of queue build up followed by cyclic restart - it doesn't really matter if the system restarts because it runs out of memory or cyclic restarts one process inside. It is still an outage for customers of the system. All you need to know is that it has crashed and why, so you can fix the bug. The erl_crash dump will tell you about the huge message queue. > 1. Introduce message queue monitoring for every process which is > potentially long lived, which imho is extra boiler plate code which > reduces readability of core functionality. Also there will be > different ways of doing it depending on how your process is > structured (gen_fsm, gen_server, gen_event, pure erlang...). If all > that one does upon detecting this condition is clear the message > queue by discarding messages, or terminate the process, wouldn't it > be good to have this built-in? Another option - fix the system so that it doesn't get into that state. > 2. have another process which monitors the entire system - which > is not very scalable when you have hundreds of thousands of processes. > > 3. Wait for the system to crash in live and then figure out what > happened. Exactly. It is a bad bug that leads to such queue build up. Crashing is fine in this case, and probably preferable to lingering onwards silently failing to provide service. Cheers, Sean _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
2008/6/3 Sean Hinde <[hidden email]>:
Ofcourse :-) But as you say, sometimes it is hard to predict it so the design probably didn't cater for it. The system is unlikely to be performing to spec during this whole period of queue build up followed by cyclic restart - it doesn't really matter if the system restarts because it runs out of memory or cyclic restarts one process inside. It is still an outage for customers of the system. I have seen erlang nodes die a few times without producing an erl_crash.dump. Sometimes it is because Ops got impatient and brutally killed all erlang related processes. Even if you did allow the system to run out of memory, for a system with a lot of memory, it will take a long time. All the while, the system will not be responding as it should be.
I'm all for fixing the system - all I'm asking for is facilities to detect this with less pain.
Exactly my point. I guess we both agree that it should crash. The disagreement seems to be about *when and how* it should crash.I would prefer that the process in question crash because in all probability, it's callers have timedout and not expecting a response any way. cheers Chandru _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
On 3 Jun 2008, at 13:49, Chandru wrote: > > 2008/6/3 Sean Hinde <[hidden email]>: > > > Well, I guess, mostly you need to have a design that doesn't lead to > massive queue build up under sustained overload :-). This might mean > input load regulation, or tweaking the process structure (the logger > process problem). > > Ofcourse :-) But as you say, sometimes it is hard to predict it so > the design probably didn't cater for it. All telecom systems are soak tested at X times overload on all external interfaces before going into service right :-) Although not all web systems perhaps !! > Another option - fix the system so that it doesn't get into that > state. > > I'm all for fixing the system - all I'm asking for is facilities to > detect this with less pain. If it is just detection you are after then have a process that calls process_info to get the queue length of all processes in the system once per minute and raise an alarm if any are above a threshold. That is not much overhead at all, and can be done without introducing new features. > Exactly my point. I guess we both agree that it should crash. The > disagreement seems to be about *when and how* it should crash.I > would prefer that the process in question crash because in all > probability, it's callers have timedout and not expecting a response > any way. Either way in all likelihood the same fault will manifest itself again within a few seconds. I can't help but imagine the proposed feature misused in all sorts of quite disgusting ways. Shudder! Sean _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
|
2008/6/3 Sean Hinde <[hidden email]>:
> > If it is just detection you are after then have a process that calls > process_info to get the queue length of all processes in the system > once per minute and raise an alarm if any are above a threshold. That > is not much overhead at all, and can be done without introducing new > features. Just for fun, I made a few additions to plain_fsm, to play around with this. The idea is that since you have a hook there anyway, you might parameterize that hook so that it can check certain limits upon receive. The example program fsm_example.erl had a state with an extended_receive and a timeout clause. I added an option to tell plain_fsm to react if the message queue grew past 3 messages: spawn_link() -> plain_fsm:spawn_link(?MODULE, fun() -> process_flag(trap_exit,true), queue_limit(), idle(mystate) end). queue_limit() -> plain_fsm:store_options( [{watch, [{queue, 3, fun(S) -> io:format("msg queue too long!~n"), flush(), S end}]} ]). Testing the code in the shell: 1> P = fsm_example:spawn_link(). <0.33.0> timeout in idle timeout in idle 2> [P ! hi || _ <- lists:seq(1,10)]. [hi,hi,hi,hi,hi,hi,hi,hi,hi,hi] timeout in idle msg queue too long! timeout in idle timeout in idle In the current version, you can insert checks for message queue length and heap size, and run_queue, as a quick and dirty way to detect CPU overload. I haven't checked it in in Jungerl - not convinced yet that it's a good idea. If anyone wants to play with it, I can send you the code. Anyway, you're absolutely right in that this kind of check can be made fairly easily without introducing new 'features'. BR, Ulf W _______________________________________________ erlang-questions mailing list [hidden email] http://www.erlang.org/mailman/listinfo/erlang-questions |
| Powered by Nabble | Edit this page |
