Troubleshooting a high-load scenario

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Troubleshooting a high-load scenario

Joel Reymont
Folks,

I have a test harness that launches poker bots against a poker  
server. The harness is written in Erlang but the poker server is C++  
on Windows. The poker server uses completion ports and async IO.

I'm running into trouble with just 500 bots playing on the server,  
launched from the same VM. It appears that the bots get their  
commands up to 1 minute late. I'm trying to troubleshoot this and I'm  
looking for ideas. I would like to believe that it's not Erlang  
running out of steam but the C++ server :-).

I read the packets manually since their length is little-endian and  
includes the whole packet (with the 4-byte length). I enabled  
{nodelay, true} on the socket since I always write complete packets  
to the socket.

I use selective receive and used to have no flow control in my socket  
reader. It would just read packet length, read packet and send the  
whole packet to its parent. Message queues were filling up when I was  
doing that so I only read the next network message once the current  
one has been processed.

I'm using the default socket buffer size for sending and receiving.  
I'm not sure what the default buffer size as it's not stated in 'man  
inet'. I do not have the source code for the poker server and I'm not  
sure what IOCP does in the scenario when I delay reading from the  
socket on my end. I'm being told by the client's techs that I could  
be getting the command 1 minute late because I'm reading it from the  
socket 1 minute late and the command sits in the network buffers all  
the while.

How do I troubleshoot this scenario? The bots don't do much  
processing themselves, basically make a decision and shoot a command  
back. They don't even react to all commands. The server spits out  
packets all the time, though, since all bots in the game get game  
notifications and table state updates from the lobby.

        Thanks, Joel

--
http://wagerlabs.com/





Reply | Threaded
Open this post in threaded view
|

RE: Troubleshooting a high-load scenario

Ulf Wiger (TN/EAB)
 
Joel Reymont wrote:
>
> I use selective receive and used to have no flow control in
> my socket reader. It would just read packet length, read
> packet and send the whole packet to its parent. Message
> queues were filling up when I was doing that so I only read
> the next network message once the current one has been processed.

Have you tried setting your socket reader to high priority?

  process_flag(priority, high).


I usually set middle-man processes to high priority since
they don't originate any load anyway. They should do their
thing and get out of the way as quickly as possible.

Regards,
Uffe
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Joel Reymont
Are you suggesting this _with_ the socket reader waiting for the  
message to be processed before reading the next one?

On Jan 17, 2006, at 11:43 AM, Ulf Wiger ((AL/EAB)) wrote:

> Have you tried setting your socket reader to high priority?
>
>   process_flag(priority, high).
>
>
> I usually set middle-man processes to high priority since
> they don't originate any load anyway. They should do their
> thing and get out of the way as quickly as possible.

--
http://wagerlabs.com/





Reply | Threaded
Open this post in threaded view
|

RE: Troubleshooting a high-load scenario

Ulf Wiger (TN/EAB)
In reply to this post by Joel Reymont

> Are you suggesting this _with_ the socket reader waiting for
> the message to be processed before reading the next one?

No. That seems a bit like reversed flow control anyway.
Do you expect the server to overload the bots?
Doesn't the majority of the load originate with the bots?

/Uffe
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Joel Reymont
The majority of the load originates with the server. It keeps sending  
moves by other players, table state updates, etc. etc. etc. How does  
that change your answer?

On Jan 17, 2006, at 11:57 AM, Ulf Wiger ((AL/EAB)) wrote:

>
>> Are you suggesting this _with_ the socket reader waiting for
>> the message to be processed before reading the next one?
>
> No. That seems a bit like reversed flow control anyway.
> Do you expect the server to overload the bots?
> Doesn't the majority of the load originate with the bots?
>
> /Uffe

--
http://wagerlabs.com/





Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Joel Reymont
In reply to this post by Ulf Wiger (TN/EAB)
Assume that there are 10 bots sitting at a table and playing.
For every message that a bot sends to the poker server there will be
at least 10 messages and probably more like 20-30 sent to each bot
by the poker server.

It appears that the bots cannot cope with such an influx of messages
but I would like to be sure that the issue is indeed in my harness
and not in the poker server.

On Jan 17, 2006, at 11:57 AM, Ulf Wiger ((AL/EAB)) wrote:

>
>> Are you suggesting this _with_ the socket reader waiting for
>> the message to be processed before reading the next one?
>
> No. That seems a bit like reversed flow control anyway.
> Do you expect the server to overload the bots?
> Doesn't the majority of the load originate with the bots?

--
http://wagerlabs.com/





Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Matthias Lang
Joel Reymont writes:

 > It appears that the bots cannot cope with such an influx of messages
 > but I would like to be sure that the issue is indeed in my harness
 > and not in the poker server.

Are you working to understand the system, or just twiddling random
knobs in public in the hope of sudden "problem disappearance"?

I'd _start_ with this experiment

  1. Find a value N such that

         a) a load generator running N bots runs acceptably

     AND b) a load generator running M bots, where N < M < 2N,
            does not run acceptably.

  2. Use two load generators (i.e. seperate, otherwise idle
     machines!), each running N bots.

The system's behaviour in those two runs will most likely give you a
strong indication of where the bottleneck is. But random twiddling is
less ambitious and more amusing, especially for the audience.

Next step: sniff the network and analyse the traffic.

Matthias

N.B. Your description of the problem leaves open the possibility of
the number of messages being quadratically related to the number of
subscribers. My experiment above is set up for a linear relation.
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Joel Reymont
In reply to this post by Joel Reymont

On Jan 17, 2006, at 1:18 PM, Roger Larsson wrote:

> Do you simulate decision time? If not you can easily create an  
> overload
> by only a few bots.

What do you mean by simulating decision time? My goal is to respond  
as fast as possible within the limits given by the server. I'm given  
5, 15 or 30 seconds to responde, for example.

> Use tcpdump to see when the messages actually arrives.

I'll try this indeed.

> Why selective receive here? Shouldn't all received messages be  
> processed?

Yes, they should be processed. It's natural, though, to write

receive
  {tcp, ... } -> ...;
  {script, ...} -> ...;
  {timeout, ...} -> ...;
end

as opposed to

receive
   X -> process(X)
end
...
process({tcp, ...}) -> ...;
process({script, ...}) -> ...;
etc

There seems to be a clear benefit to the latter approach when your  
queues start getting large. Is this correct?

        Thanks, Joel

--
http://wagerlabs.com/





Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Joel Reymont
In reply to this post by Matthias Lang

On Jan 17, 2006, at 1:28 PM, Matthias Lang wrote:

> Are you working to understand the system, or just twiddling random
> knobs in public in the hope of sudden "problem disappearance"?

I'm trying to understand what knobs to twiddle. I'm having trouble  
with this and thus I'm asking the public.

> I'd _start_ with this experiment
>
>   1. Find a value N such that
>
>          a) a load generator running N bots runs acceptably

I have established that 500 bots from one VM run fine.

>      AND b) a load generator running M bots, where N < M < 2N,
>             does not run acceptably.

I have established that 1000 bots do not run fine on one VM. Running  
two VMs with 500 bots each fails also.

>   2. Use two load generators (i.e. seperate, otherwise idle
>      machines!), each running N bots.

We ran that and it appears that the bottleneck could be on the  
server. One machine running 500 bots is fine. Two machines running  
500 bots is not.

> Next step: sniff the network and analyse the traffic.

I will look into that.

> N.B. Your description of the problem leaves open the possibility of
> the number of messages being quadratically related to the number of
> subscribers. My experiment above is set up for a linear relation.

Every bot gets notifications of other bots. So whenever 1 bot acts  
everyone else gets notification. 2 bots would generate 2 messages for  
every action, 10 bots would generate 10 messages, etc.

        Thanks, Joel

--
http://wagerlabs.com/





Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Matthias Lang

Joel's test case #1
 > I have established that 500 bots from one VM run fine.

Joel's test case #2
 > I have established that 1000 bots do not run fine on one VM. Running  
 > two VMs with 500 bots each fails also.

Joel's test case #3
 > We ran that and it appears that the bottleneck could be on the  
 > server. One machine running 500 bots is fine. Two machines running  
 > 500 bots is not.

Joel's model of the problem:
 > Every bot gets notifications of other bots. So whenever 1 bot acts  
 > everyone else gets notification. 2 bots would generate 2 messages for  
 > every action, 10 bots would generate 10 messages, etc.

That's not linear. Yet you've chosen bot numbers as though the system
was linear, which makes your results mostly useless.

Here's a page which introduces the difference between linear and
quadratic relations:

  http://chatterbeeshomework.homestead.com/chatterbeesmath3.html

In test case #3, your bots have to deal with twice as many incoming
messages per second as in test case #1. So the numbers you have chosen
mean that your results don't allow you to conclude that the server is
the problem.

Or is there some throttling mechanism you're not telling us about?
I.e. is there something which makes the bots' message generation rate
decrease as you increase the number of bots?

Matthias
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

Joel Reymont

On Jan 17, 2006, at 2:34 PM, Matthias Lang wrote:

> Joel's model of the problem:
>> Every bot gets notifications of other bots. So whenever 1 bot acts
>> everyone else gets notification. 2 bots would generate 2 messages for
>> every action, 10 bots would generate 10 messages, etc.

I should clarify... Bots only respond to bet requests. That's all  
they do.
The messages are generated by the poker server notifying other bots  
of changes
in game and server state.

> In test case #3, your bots have to deal with twice as many incoming
> messages per second as in test case #1. So the numbers you have chosen
> mean that your results don't allow you to conclude that the server is
> the problem.

I see what you are saying but...

> Or is there some throttling mechanism you're not telling us about?
> I.e. is there something which makes the bots' message generation rate
> decrease as you increase the number of bots?

I deal in tables. A table holds 10 bots. Only bots at this table get  
notified
when another bot takes action. All bots get notified when a bot joins  
a table
but that's once per test.

The throttling mechanism is that the tables running on machine A have  
nothing
to do with tables running on machine B unless bots from both machines  
join
a single table. Even then bots are limited to receiving notifications of
actions taken by bots sitting at the same table.

Does this make my diagnosis correct?

        Joel

--
http://wagerlabs.com/





Reply | Threaded
Open this post in threaded view
|

RE: Troubleshooting a high-load scenario

Ulf Wiger (TN/EAB)
In reply to this post by Joel Reymont

I stick to my recommendation.

As a rule, I don't think one gains much by
having messages stay in the socket buffer,
since you can't prioritize them, for example.

It's usually better to have an agent suck the
messages out of the buffer and get them as quickly
as possible to a point where they can be parsed,
tagged and prioritized. Once messages have been
identified and prioritized, effective load
control can take place.

For example, your bots respond to bet requests,
but could probably dispense with other messages
rather cheaply. The efficiency of the bots will
also increase if they can process more than one
message per time slice. The efficiency of the
socket reader also increases if it's allowed to
read and dispatch several messages at once.

/Uffe

> -----Original Message-----
> From: Joel Reymont [mailto:[hidden email]]
> Sent: den 17 januari 2006 13:06
> To: Ulf Wiger (AL/EAB)
> Cc: Erlang Questions
> Subject: Re: Troubleshooting a high-load scenario
>
> The majority of the load originates with the server. It keeps
> sending moves by other players, table state updates, etc.
> etc. etc. How does that change your answer?
>
> On Jan 17, 2006, at 11:57 AM, Ulf Wiger ((AL/EAB)) wrote:
>
> >
> >> Are you suggesting this _with_ the socket reader waiting for the
> >> message to be processed before reading the next one?
> >
> > No. That seems a bit like reversed flow control anyway.
> > Do you expect the server to overload the bots?
> > Doesn't the majority of the load originate with the bots?
> >
> > /Uffe
>
> --
> http://wagerlabs.com/
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Troubleshooting a high-load scenario

Matthias Lang
Ulf Wiger (AL/EAB) writes:

 > As a rule, I don't think one gains much by
 > having messages stay in the socket buffer,
 > since you can't prioritize them, for example.

There is one important gain: TCP flow control, i.e. in protocols which
don't have some other means for flow control, you can make the remote
socket block by letting the buffer fill. Assuming you're not using
{active, true}.

Matthias
Reply | Threaded
Open this post in threaded view
|

RE: Troubleshooting a high-load scenario

Ulf Wiger (TN/EAB)
In reply to this post by Joel Reymont

This is true.

/U

> -----Original Message-----
> From: Matthias Lang [mailto:[hidden email]]
> Sent: den 17 januari 2006 17:36
> To: Ulf Wiger (AL/EAB)
> Cc: Erlang Questions
> Subject: RE: Troubleshooting a high-load scenario
>
> Ulf Wiger (AL/EAB) writes:
>
>  > As a rule, I don't think one gains much by  > having
> messages stay in the socket buffer,  > since you can't
> prioritize them, for example.
>
> There is one important gain: TCP flow control, i.e. in
> protocols which don't have some other means for flow control,
> you can make the remote socket block by letting the buffer
> fill. Assuming you're not using {active, true}.
>
> Matthias
>
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting a high-load scenario

J. Pablo Fernández
In reply to this post by Joel Reymont
On Tuesday 17 January 2006 10:50, Joel Reymont wrote:
> > Use tcpdump to see when the messages actually arrives.
>
> I'll try this indeed.

ethereal is another great tool to debug network applications.
--
Pupeno <[hidden email]> (http://pupeno.com)
Vendemos: Conocer y collecionar Monedas y Billetes de Todo el Mundo:
http://pupeno.com/vendo/#monedas

attachment0 (196 bytes) Download Attachment