|
In case async signal
communication, is it possible to discard the signals queuing if the process
detects that it will not be able process them?
I remember that the
process info returns the signal queue length, so it is possible to set a
threshold. But I did not find anything for discarding.
Thanks,
Jozsef
_______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On 2011-06-15, at 00:22 , József Bérces wrote:
> In case async signal communication, is it possible to discard the signals queuing if the process detects that it will not be able process them? > > I remember that the process info returns the signal queue length, so it is possible to set a threshold. But I did not find anything for discarding. Any message read from a process's mailbox (via a receive statement) is removed from the mailbox. If you don't act on it, it will be discarded. So the way to discard a message is "read it, and ignore it". _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
Thanks.
Actually, the real question is the performance of going through the inbox and discarding 1 by 1. If we are talking about an overloaded system that, for any reason, cannot tell the sender to reduce the traffic, then probably the only way to avoid a system crash is to discard the messages and notify the operators (raise an alarm, write event log, etc.) about this. So the discard shall be very quick as we are already overloaded. Do we have something for that? -----Original Message----- From: Masklinn [mailto:[hidden email]] Sent: Wednesday, June 15, 2011 5:44 To: József Bérces Cc: [hidden email] Subject: Re: [erlang-questions] discarding signals On 2011-06-15, at 00:22 , József Bérces wrote: > In case async signal communication, is it possible to discard the signals queuing if the process detects that it will not be able process them? > > I remember that the process info returns the signal queue length, so it is possible to set a threshold. But I did not find anything for discarding. Any message read from a process's mailbox (via a receive statement) is removed from the mailbox. If you don't act on it, it will be discarded. So the way to discard a message is "read it, and ignore it". _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Wed, Jun 15, 2011 at 6:02 AM, József Bérces
<[hidden email]> wrote: > Thanks. > > Actually, the real question is the performance of going through the inbox and discarding 1 by 1. > > If we are talking about an overloaded system that, for any reason, cannot tell the sender to reduce the traffic, then probably the only way to avoid a system crash is to discard the messages and notify the operators (raise an alarm, write event log, etc.) about this. So the discard shall be very quick as we are already overloaded. Do we have something for that? > flush() -> receive Message -> flush() after 0 -> ok end. Is really fast. If you _think_ that it is not fast, than you are on wrong way, because you are going to build high loaded system, based on thinks. You must have benchmarks, that appears to look like real load. And if you will find on these benchmarks that it is not enough, than you will need to do something more. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by József Bérces (LA/ETH)
On Wed, Jun 15, 2011 at 04:02, József Bérces <[hidden email]> wrote:
> If we are talking about an overloaded system that, for any reason, cannot tell the sender to reduce the traffic, then probably the only way to avoid a system crash is to discard the messages and notify the operators (raise an alarm, write event log, etc.) about this. So the discard shall be very quick as we are already overloaded. Do we have something for that? I dare say that you want to concentrate on avoidance rather than recovery. In other words, you want to write in some flow-control into the application such that the problem never occurs in the first place. The problem with recovering by emptying the queue is that it is as if we lost those messages to the senders, so they won't react and probably push us into another recovery phase soon thereafter. If we have a flow-control system built in, the senders will know, mainly due to timeouts, and can hence react to the situation. Effectively, you are implementing queue bounds by the flow-control scheme so you can limit how many messages are in the queue and keep the queue loaded at a sustainable rate. Another reason I'd prefer avoidance is that recovery is often way more expensive in the long run. You will perhaps try to resend the messages or redo a lot of heavy computation. Avoiding that in the first place will also improve the throughput. -- J. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
Hi all,
Was reading the "discarding signals" thread and thought that maybe a better approach to handling this condition would be to kill the process if the mailbox exceeds a certain size. It seems to me that it would be a cleaner approach than flushing the mailbox and attempting to recover and it also fits better with Erlang's "let it crash" philosophy. Basically, the way I see it, we'd have this implemented at system level. One could set a flag, similar to, say, fullsweep_after - either per process or system wide. Default could be infinity, which reverts to the current behavior. Any thoughts on this? Any technical or philosophical arguments, pro or contra? For the record, I do believe that one should try as hard as possible to avoid getting into this situation to begin with, but occasionally it happens and it is one of Erlang's more spectacular failure modes. Cheers, Mihai >> If we are talking about an overloaded system that, for any reason, cannot tell the sender to reduce the traffic, then probably the only way to avoid a system crash is to discard the messages and notify the operators (raise an alarm, write event log, etc.) about this. So the discard shall be very quick as we are already overloaded. Do we have something for that? _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
I think we will stop in the same place, as with discussion: How not to
kill erlang system due to OOM. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Mihai Balea
I once made an experimental version of plain_fsm which could enforce a message queue limit. I don't think I ever checked it in, but I remember that it wasn't terribly hard to implement. Plain_fsm itself nowadays lives at: http://github.com/esl/plain_fsm BR, Ulf On 15 Jun 2011, at 18:02, Mihai Balea wrote:
_______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Max Lapshin-2
On Jun 15, 2011, at 12:07 PM, Max Lapshin wrote: > I think we will stop in the same place, as with discussion: How not to > kill erlang system due to OOM. Are you referring to this thread? http://erlang.2086793.n4.nabble.com/Why-Beam-smp-crashes-when-memory-is-over-tt2118397.html#none Mihai _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Wed, Jun 15, 2011 at 8:32 PM, Mihai Balea <[hidden email]> wrote:
> Are you referring to this thread? > > http://erlang.2086793.n4.nabble.com/Why-Beam-smp-crashes-when-memory-is-over-tt2118397.html#none exactly _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
Thanks for all the thoughts and suggestions. If I got it right, there were two main branches:
1. Avoid the congestion situation 2. Detect and kill/restart the problematic process(es) The problem with these approaches that the Erlang applications are not just playing with themselves but receive input from other nodes. Those nodes can be very numerous and uncontrollable. As an example, just let's take the mobile network where the traffic is generated by millions of subscribers using mobile devices from many vendors. In this case we (1) cannot control the volume of the traffic and (2) cannot make sure that all the devices follow the protocol. So there can be situations when we cannot avoid congestion simply because the source of the traffic is beyond our reach. Killing and restarting is not the right way either: - A restart causes total outage for a while that is very unwelcome by the users (e.g. network operators) of our boxes - Erlang is advertised to be robust but killing and restarting is not a sign of robustness. So the user can easily call us liar: "You say your node is robust but it is restarting frequently!" So I still believe that very quick discard of the signals is a key for real robustness. Obviously, it shall be used *only* in the right circumstances, but in those cases that would be the only way to keep the node alive and minimize the traffic loss. Then the question is still open: Discarding 1-by-1 is the best what we can do or there is something more efficient to get rid of the excess traffic? -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Max Lapshin Sent: Thursday, June 16, 2011 0:25 To: Mihai Balea Cc: erlang-questions Questions Subject: Re: [erlang-questions] Kill process if message mailbox reaches a certain size (was discarding signals) On Wed, Jun 15, 2011 at 8:32 PM, Mihai Balea <[hidden email]> wrote: > Are you referring to this thread? > > http://erlang.2086793.n4.nabble.com/Why-Beam-smp-crashes-when-memory-i > s-over-tt2118397.html#none exactly _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Wed, Jun 15, 2011 at 8:11 PM, József Bérces
<[hidden email]> wrote: > Thanks for all the thoughts and suggestions. If I got it right, there were two main branches: > > 1. Avoid the congestion situation > 2. Detect and kill/restart the problematic process(es) > > The problem with these approaches that the Erlang applications are not just playing with themselves but receive input from other nodes. Those nodes can be very numerous and uncontrollable. > > As an example, just let's take the mobile network where the traffic is generated by millions of subscribers using mobile devices from many vendors. In this case we (1) cannot control the volume of the traffic and (2) cannot make sure that all the devices follow the protocol. > So there can be situations when we cannot avoid congestion simply because the source of the traffic is beyond our reach. The Erlang distribution protocol is only suitable for connecting a relatively small number of trusted nodes on a LAN. If you were to expertly implement such an application you would have some Erlang nodes speaking to these mobile devices, but with another protocol (probably over TCP), and then you would have as much control as you need over the other details. For example, you can avoid congestion by rate limiting or refusing to accept new connections. When the Erlang nodes speak to each other (with or without Erlang distribution), you also control that protocol and can avoid congestion there as well. -bob _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Thu, Jun 16, 2011 at 8:55 AM, Bob Ippolito <[hidden email]> wrote:
I think I get the jist of it, but could someone quantify, as to how "small" is "relatively small number" here ? Fifty, few hundreds, couple of thousands ? What is the largest 'Erlang cloud' (i.e. hosts running Erlang processes communicating accross nodes in a cluster), that has been seen ? If you were to expertly implement such an application you would have In telecom world such a situation is pretty common, however think of the situation that even to discard a message (due to congestion) that starts a new transaction if I need to determine things like priority, transaction-id or application level session-id etc., I'd have to have the ability to decode that much message in the rate-limiter process, which I think we are saying, will be written in C/C++ and communicating over IP or another protocol. Duplicating the decode logic, I guess would be unavoidable, in most such cases. Or is there a better behavioral pattern someone has figured out ? _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Jun 16, 2011, at 8:06 AM, Banibrata Dutta wrote: > I think I get the jist of it, but could someone quantify, as to how "small" is "relatively small number" here ? Fifty, few hundreds, couple of thousands ? > What is the largest 'Erlang cloud' (i.e. hosts running Erlang processes communicating accross nodes in a cluster), that has been seen ? I had a cluster of 150 nodes on Amazon EC2, one node per small instance. That was too much with mesh connections between nodes, e.g. processes exchanging messages everywhere. Heavy message traffic between nodes caused frequent network splits (partitions). I had to resort to limiting traffic to make it manageable. -------------------------------------------------------------------------- - for hire: mac osx device driver ninja, kernel extensions and usb drivers ---------------------+------------+--------------------------------------- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont ---------------------+------------+--------------------------------------- _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by József Bérces (LA/ETH)
I sense a few misconceptions here. First, I assume that the proposal for "kill and restart" is to kill and restart the particular process that has an inbox that is "too full." This does not mean that the client crashes -- it means that those messages are lost. This is no different to how Erlang systems are robust in the face of rare crashing bugs (as opposed to, say, C++, where generally the entire system goes down because of a rare stray pointer bug, for example). Basically, crashing and re-starting an Erlang worker process is just one way of clearing out the message queue, and also making sure that any possible state corruption goes away because the process re-starts "afresh." The Erlang/OTP supervision tree is designed to work in this mode.
Second, when you have an amount of load that comes in, and you cannot control it, then what is generally done is to simply model the load, model the application, and provision enough server hardware that you can keep up with the load. In an emergency (an unexpected surge that doubles load compared to anything seen before), you'll additionally want capabilities to reject some part of the incoming requests. For HTTP, this is where status 503 (Server Busy) comes in, for example. I'm assuming all your clients use some common protocol, like TCP or HTTPS or whatever, and that you do appropriate protection against un-trusted data at that layer.
When it comes to cluster sizes, we're running a 11 node cluster with >100,000 users and it's running mostly idle on a gigabit switch. I would consider this a "smallish cluster." We're planning on increasing our data rates a lot in the future, though -- at some point, we'll need to provision to 10 GBps. We scale using a crossbar and consistent hasning.
I've heard of clusters that do a million users per node, and use broadcast to all other nodes in a cluster of 50 nodes. That also scales on available networking hardware, as long as most users are not generators of large or frequent packets.
I would advise against single-core nodes or cloud-based nodes that don't have local networking, because these get much less work done per node (and per network packet) than larger systems. Buying a single server from Dell today, you get 12 cores and 24 hardware threads even on the low end. Next year, that number will be 40, 80 or even 160 (for the higher end).
So, in your case, I would suggest making sure that you know what the protocol is that clients use to connect to the server, and then making sure that you have some way of reporting temporary capacity overload to the clients, and then making sure that you have good metrics on the utilization of the server cluster (CPU, memory, network bandwidth, etc) so that you can put in more hardware when needed. If CPU goes 100% for a long time (which would be a precondition for a queue to fill up), start rejecting requests and log an alert for the operator to buy more hardware.
I also recommend modeling the traffic across the backplane of the Erlang nodes. How much data do you send per user "event" to other users, and how many other users? Broadcast or point-to-point? Sum it all up, double it, and see if you can still swing that on your current network backplane. If not, buy a bigger network, or start working on ways to compress/reduce the data stream :-)
Sincerely, jw -- Americans might object: there is no way we would sacrifice our living standards for the benefit of people in the rest of the world. Nevertheless, whether we get there willingly or not, we shall soon have lower consumption rates, because our present rates are unsustainable. On Wed, Jun 15, 2011 at 8:11 PM, József Bérces <[hidden email]> wrote: Thanks for all the thoughts and suggestions. If I got it right, there were two main branches: _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
Jon,
On Jun 16, 2011, at 9:09 AM, Jon Watte wrote: > I would advise against single-core nodes or cloud-based nodes that don't have local networking, What do you mean by "don't have local networking"? I thought the loopback network interface is a given. > I also recommend modeling the traffic across the backplane of the Erlang nodes. How much data do you send per user "event" to other users, and how many other users? Broadcast or point-to-point? Sum it all up, double it, and see if you can still swing that on your current network backplane. If not, buy a bigger network, or start working on ways to compress/reduce the data stream :-) I would also keep in mind that messages between Erlang processes and inter-node "kernel pings" go over the same socket. This means that pings start to lag as the message traffic gets heavy. Erlang nodes split when pings are significantly delayed and then in your in deep doodoo. This used to be a case a couple of years ago, I don't think anything changed since. -------------------------------------------------------------------------- - for hire: mac osx device driver ninja, kernel extensions and usb drivers ---------------------+------------+--------------------------------------- http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont ---------------------+------------+--------------------------------------- _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Jon Watte
On Jun 16, 2011, at 1:09 AM, Jon Watte wrote: > First, I assume that the proposal for "kill and restart" is to kill and restart the particular process that has an inbox that is "too full." This does not mean that the client crashes -- it means that those messages are lost. This is no different to how Erlang systems are robust in the face of rare crashing bugs (as opposed to, say, C++, where generally the entire system goes down because of a rare stray pointer bug, for example). Basically, crashing and re-starting an Erlang worker process is just one way of clearing out the message queue, and also making sure that any possible state corruption goes away because the process re-starts "afresh." The Erlang/OTP supervision tree is designed to work in this mode. That is exactly what I was proposing. When you kill just one offending process, you will probably lose one call (or transaction, etc ) but you don't bog down a scheduler or even an entire VM. Especially considering that processes with growing message queues tend to exhibit runaway memory consumption as well, which will eventually bring down the VM (see the thread Max mentioned). Mihai _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Joel Reymont
By "local networking" I mean "single-hop networking" -- you really want your Erlang cluster to be "local" in the sense that all the nodes share a single, non-blocking switch, for as far as you can push that (I know you can get 48-port 10-gig switches reasonably affordably; I imagine density is getting even higher).
When you lease hardware on a cloud, you don't know how far away each instance will be, and it's pretty clear that you will share network resources with other virtual instances, so it's not a great fit for applications that are highly communications driven or require low latency. The latency is even more problematic, because virtualization will cause random scheduling jitter to the order of 30 milliseconds at best, and > 1000 milliseconds at worst, last I measured. I've never measured worse than 3 ms jitter on "bare metal" Linux-based installations.
Sincerely, jw -- Americans might object: there is no way we would sacrifice our living standards for the benefit of people in the rest of the world. Nevertheless, whether we get there willingly or not, we shall soon have lower consumption rates, because our present rates are unsustainable. On Thu, Jun 16, 2011 at 12:55 AM, Joel Reymont <[hidden email]> wrote: Jon, _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
| Powered by Nabble | Edit this page |
