gen_udp:recv()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

gen_udp:recv()

Avinash Dhumane-5

Hello!

 

I am taking a short-cut to inquire this, by posting to you all (than going through relevant documentation).

 

Is gen_udp:recv() to be invoked per datagram, or can it return a binary of multiple datagrams concatenated?

 

I am asking this because there is Length to recv(), and I don’t understand why it is there. Should it specify the length of expected datagram? What if the datagrams of variable length? Should it, then, specify the maximum length?

 

The default receive buffer size is 8K bytes. If Length argument to recv() is specified to this value, will it still return one datagram or multiple?

 

In contrast, in gen_tcp:recv(), the Length argument could be specified to zero, in which case, all the arrived bytes are returned as the binary (as per documentation). Does gen_udp:recv() do similar – return all the datagrams arrived (if Length is specified to be zero)?

 

My need for the socket is {active, false}. I had tried with {active, true} in the past, and got the datagrams as individual messages in the inbox of my process. So, I assume that there will be one call to recv() per datagram. I do not have the setup to test {active, false}; hence, I am not sure about the behaviour of (and, rationale for) Length argument to recv().

 

Please!

 

Thanks

Avinash

 

Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

zxq9-2
Hi Avinash,

On 2021/04/05 22:42, Avinash Dhumane wrote:
> I am taking a short-cut to inquire this, by posting to you all (than
> going through relevant documentation).

Oh! The humanity!

> Is gen_udp:recv() to be invoked per datagram, or can it return a binary
> of multiple datagrams concatenated?

It returns a single datagram with some meta about the transmission
(port, address, etc.).

> I am asking this because there is Length to recv(), and I don’t
> understand why it is there. Should it specify the length of expected
> datagram? What if the datagrams of variable length? Should it, then,
> specify the maximum length?

This goes back to memory allocation issues when you are allocating
memory manually. How much is the max size you want to receive might be
based on the size of the buffer you're willing to set aside for the
incoming data -- not the kind of thing people have to contemplate as
carefully as they did in earlier eras of networking.

There are some other lower level UDP dials and knobs that allow you to
peek at the size and UDP headers and whatnot to figure out what a
reasonable balance between getting an entire datagram, being safe, and
not being wasteful might be -- and typically you know what protocol
you're working with.

Anyway, in Erlang this is pretty much irrelevant. The common way to
receive UDP traffic (and TCP traffic, for that matter) is to leave all
the low level stuff to the runtime and inside your code just receive it
as an Erlang message and not deal with udp:recv/2,3 at all.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

zxq9-2
On 2021/04/06 23:14, Łukasz Niemier wrote:
>> Anyway, in Erlang this is pretty much irrelevant. The common way to receive UDP traffic (and TCP traffic, for that matter) is to leave all the low level stuff to the runtime and inside your code just receive it as an Erlang message and not deal with udp:recv/2,3 at all.
>
> Fortunately new `socket` module have `socket:recv/1` function that fetches all data that there are in the socket right now. So it should be perfectly possible to add `gen_udp:recv/1` that would receive whole packet at once. Also calling `gen_udp:recv(Socket, 1500)` should be safe enough as single UDP packet cannot be larger than MTU.

Indeed. However, I can't think of any cases where I would want to mess
with that rather than receive packets as messages.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Max Lapshin-2
Well, I like the idea to receive 20-200 UDP messages in a single
binary with separate structure holding mapping of this data.

For us it may help to remove special driver that does the same.
Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Max Lapshin-2
Exactly.

We receive MPEG-TS _stream_ in multicast messages and it is very
painful to receive so many messages (it can be around 90K per second
totally or 300-900 per one stream per second) one by one. Reducing
this about by 10 decreases CPU usage a lot.




On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <[hidden email]> wrote:

>
> > Well, I like the idea to receive 20-200 UDP messages in a single
> > binary with separate structure holding mapping of this data.
>
> Then it sounds like you want stream protocol instead of packet protocol.
> The whole idea of packet protocol is to receive messages in packets.
>
> --
>
> Łukasz Niemier
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

RE: gen_udp:recv()

Avinash Dhumane-5

Our case is similar. The stock exchange disseminates the market movements (called, ticks) as stream of variable length messages (datagrams or packets), each less than 50 bytes, and each stamped with sequence number, over multicast (udp) streams, each stream replicated so that if a packet is missed (dropped) – due to the nature of underlying udp – it may be received on the replicated stream (before fallback on a tcp-based recovery server).

 

The message rate (without replication) exceeds 100K per second. Issuing as many calls, at Erlang-application level, viz. gen_udp:recv, is impractical, though I suppose, at underlying OS system-call level, each datagram must be received using a separate call. Therefore, I believed, that Erlang might provide an abstraction, which packs a number of datagrams (packets / messages) arrived so far and deliver to the application, for further handling. One-to-one call from app to Erlang, and further from Erlang to OS, for each datagram, would seem too much taxing.

 

 

From: [hidden email]
Sent: 06 April 2021 10:20 PM
To: [hidden email]
Cc: [hidden email]
Subject: Re: gen_udp:recv()

 

Exactly.

 

We receive MPEG-TS _stream_ in multicast messages and it is very

painful to receive so many messages (it can be around 90K per second

totally or 300-900 per one stream per second) one by one. Reducing

this about by 10 decreases CPU usage a lot.

 

 

 

 

On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <[hidden email]> wrote:

> 

> > Well, I like the idea to receive 20-200 UDP messages in a single

> > binary with separate structure holding mapping of this data.

> 

> Then it sounds like you want stream protocol instead of packet protocol.

> The whole idea of packet protocol is to receive messages in packets.

> 

> --

> 

> Łukasz Niemier

> [hidden email]

> 

 

Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Jesper Louis Andersen-2
In reply to this post by Avinash Dhumane-5
On Mon, Apr 5, 2021 at 4:21 PM Avinash Dhumane <[hidden email]> wrote:

My need for the socket is {active, false}. I had tried with {active, true} in the past, and got the datagrams as individual messages in the inbox of my process. So, I assume that there will be one call to recv() per datagram. I do not have the setup to test {active, false}; hence, I am not sure about the behaviour of (and, rationale for) Length argument to recv().


It's due to being there in low level socket handling (in C). You need to specify the size of the buffer to write the result into. It matters somewhat. While Ethernet normally specifies a limit of 1500 bytes (minus header, minus tagging, ...), a local loopback interface can easily be 16k, 64k, or more. And if you are considering a high count of receivers, it might be wise to limit the maximal size you are willing to receive.

As for the overhead: if you have many small messages and need to have a high processing rate, I'm going to suggest you either try {active, N} or go the route Max laid out by moving the processing loop into C to get it fast enough. You can work backwards: 100k messages per second means you have to process messages at a rate of one per 10 microseconds. At those rates, it isn't uncommon you need some kind of microbatching to keep up.


--
J.
Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Andreas Schultz-3
In reply to this post by Avinash Dhumane-5
On Linux and BSD, the recvmmsg would can read multiple packets at once. Adding it to gen_udp might be impractical, but the new socket module would be a good place for a function that uses that system call.

Am Mi., 7. Apr. 2021 um 13:19 Uhr schrieb Avinash Dhumane <[hidden email]>:

Our case is similar. The stock exchange disseminates the market movements (called, ticks) as stream of variable length messages (datagrams or packets), each less than 50 bytes, and each stamped with sequence number, over multicast (udp) streams, each stream replicated so that if a packet is missed (dropped) – due to the nature of underlying udp – it may be received on the replicated stream (before fallback on a tcp-based recovery server).

 

The message rate (without replication) exceeds 100K per second. Issuing as many calls, at Erlang-application level, viz. gen_udp:recv, is impractical, though I suppose, at underlying OS system-call level, each datagram must be received using a separate call. Therefore, I believed, that Erlang might provide an abstraction, which packs a number of datagrams (packets / messages) arrived so far and deliver to the application, for further handling. One-to-one call from app to Erlang, and further from Erlang to OS, for each datagram, would seem too much taxing.

 

 

From: [hidden email]
Sent: 06 April 2021 10:20 PM
To: [hidden email]
Cc: [hidden email]
Subject: Re: gen_udp:recv()

 

Exactly.

 

We receive MPEG-TS _stream_ in multicast messages and it is very

painful to receive so many messages (it can be around 90K per second

totally or 300-900 per one stream per second) one by one. Reducing

this about by 10 decreases CPU usage a lot.

 

 

 

 

On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <[hidden email]> wrote:

> 

> > Well, I like the idea to receive 20-200 UDP messages in a single

> > binary with separate structure holding mapping of this data.

> 

> Then it sounds like you want stream protocol instead of packet protocol.

> The whole idea of packet protocol is to receive messages in packets.

> 

> --

> 

> Łukasz Niemier

> [hidden email]

> 

 



--

Andreas Schultz

Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Max Lapshin-2
In reply to this post by Avinash Dhumane-5
I suppose that you really need the same as me.

You need to receive _one_  big binary with an extra field containing
messages boundaries and other meta info.

On Wed, Apr 7, 2021 at 1:25 PM Avinash Dhumane <[hidden email]> wrote:

>
> Our case is similar. The stock exchange disseminates the market movements (called, ticks) as stream of variable length messages (datagrams or packets), each less than 50 bytes, and each stamped with sequence number, over multicast (udp) streams, each stream replicated so that if a packet is missed (dropped) – due to the nature of underlying udp – it may be received on the replicated stream (before fallback on a tcp-based recovery server).
>
>
>
> The message rate (without replication) exceeds 100K per second. Issuing as many calls, at Erlang-application level, viz. gen_udp:recv, is impractical, though I suppose, at underlying OS system-call level, each datagram must be received using a separate call. Therefore, I believed, that Erlang might provide an abstraction, which packs a number of datagrams (packets / messages) arrived so far and deliver to the application, for further handling. One-to-one call from app to Erlang, and further from Erlang to OS, for each datagram, would seem too much taxing.
>
>
>
>
>
> From: Max Lapshin
> Sent: 06 April 2021 10:20 PM
> To: Łukasz Niemier
> Cc: Erlang Questions
> Subject: Re: gen_udp:recv()
>
>
>
> Exactly.
>
>
>
> We receive MPEG-TS _stream_ in multicast messages and it is very
>
> painful to receive so many messages (it can be around 90K per second
>
> totally or 300-900 per one stream per second) one by one. Reducing
>
> this about by 10 decreases CPU usage a lot.
>
>
>
>
>
>
>
>
>
> On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <[hidden email]> wrote:
>
> >
>
> > > Well, I like the idea to receive 20-200 UDP messages in a single
>
> > > binary with separate structure holding mapping of this data.
>
> >
>
> > Then it sounds like you want stream protocol instead of packet protocol.
>
> > The whole idea of packet protocol is to receive messages in packets.
>
> >
>
> > --
>
> >
>
> > Łukasz Niemier
>
> > [hidden email]
>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: gen_udp:recv()

Avinash Dhumane-5

Indeed!

 

As Jesper too pointed out, a sub-10-microsecond level turnaround time seems impossible to achieve in Erlang; the best I got is 60 microseconds, and average is in the order of hundreds of microseconds (almost close to a millisecond)

 

I am taking the “tick-to-order” processing (in stock-market parlance) out of Erlang and into Rust to meet this ultra-low latency requirement.

 

Thanks all for confirming the overall approach.

 

 

From: [hidden email]
Sent: 07 April 2021 06:32 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]
Subject: Re: gen_udp:recv()

 

I suppose that you really need the same as me.

 

You need to receive _one_  big binary with an extra field containing

messages boundaries and other meta info.

 

On Wed, Apr 7, 2021 at 1:25 PM Avinash Dhumane <[hidden email]> wrote:

> 

> Our case is similar. The stock exchange disseminates the market movements (called, ticks) as stream of variable length messages (datagrams or packets), each less than 50 bytes, and each stamped with sequence number, over multicast (udp) streams, each stream replicated so that if a packet is missed (dropped) – due to the nature of underlying udp – it may be received on the replicated stream (before fallback on a tcp-based recovery server).

> 

> 

> 

> The message rate (without replication) exceeds 100K per second. Issuing as many calls, at Erlang-application level, viz. gen_udp:recv, is impractical, though I suppose, at underlying OS system-call level, each datagram must be received using a separate call. Therefore, I believed, that Erlang might provide an abstraction, which packs a number of datagrams (packets / messages) arrived so far and deliver to the application, for further handling. One-to-one call from app to Erlang, and further from Erlang to OS, for each datagram, would seem too much taxing.

> 

> 

> 

> 

> 

> From: Max Lapshin

> Sent: 06 April 2021 10:20 PM

> To: Łukasz Niemier

> Cc: Erlang Questions

> Subject: Re: gen_udp:recv()

> 

> 

> 

> Exactly.

> 

> 

> 

> We receive MPEG-TS _stream_ in multicast messages and it is very

> 

> painful to receive so many messages (it can be around 90K per second

> 

> totally or 300-900 per one stream per second) one by one. Reducing

> 

> this about by 10 decreases CPU usage a lot.

> 

> 

> 

> 

> 

> 

> 

> 

> 

> On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <[hidden email]> wrote:

> 

> >

> 

> > > Well, I like the idea to receive 20-200 UDP messages in a single

> 

> > > binary with separate structure holding mapping of this data.

> 

> >

> 

> > Then it sounds like you want stream protocol instead of packet protocol.

> 

> > The whole idea of packet protocol is to receive messages in packets.

> 

> >

> 

> > --

> 

> >

> 

> > Łukasz Niemier

> 

> > [hidden email]

> 

> >

> 

> 

 

Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

zxq9-2
In reply to this post by Avinash Dhumane-5
On 2021/04/07 19:25, Avinash Dhumane wrote:
>  would seem too much taxing.

TRY IT or you'll never know.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Raimo Niskanen-11
In reply to this post by Jesper Louis Andersen-2
On Wed, Apr 07, 2021 at 02:11:59PM +0200, Jesper Louis Andersen wrote:

> On Mon, Apr 5, 2021 at 4:21 PM Avinash Dhumane <[hidden email]> wrote:
>
> > My need for the socket is {active, false}. I had tried with {active, true}
> > in the past, and got the datagrams as individual messages in the inbox of
> > my process. So, I assume that there will be one call to recv() per
> > datagram. I do not have the setup to test {active, false}; hence, I am not
> > sure about the behaviour of (and, rationale for) Length argument to recv().
> >
> >
> It's due to being there in low level socket handling (in C). You need to
> specify the size of the buffer to write the result into. It matters
> somewhat. While Ethernet normally specifies a limit of 1500 bytes (minus
> header, minus tagging, ...), a local loopback interface can easily be 16k,
> 64k, or more. And if you are considering a high count of receivers, it
> might be wise to limit the maximal size you are willing to receive.
>
> As for the overhead: if you have many small messages and need to have a
> high processing rate, I'm going to suggest you either try {active, N} or go
> the route Max laid out by moving the processing loop into C to get it fast
> enough. You can work backwards: 100k messages per second means you have to
> process messages at a rate of one per 10 microseconds. At those rates, it
> isn't uncommon you need some kind of microbatching to keep up.

There is also an option {read_packets,N} that is useful to combine with
{active,N} or {active,true}.  It sets how many datagrams to read in a tight
loop before leaving the inet_drv ready for input callback.  The default is 5.
If you set it too high you can flood the VM with more datagrams than
it can manage.

/ Raimo Niskanen


>
>
> --
> J.

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Jesper Louis Andersen-2
In reply to this post by Jesper Louis Andersen-2
On Thu, Apr 8, 2021 at 9:51 PM [hidden email] <[hidden email]> wrote:

Presuming that by micro-batching” you mean more than one message being delivered at the same time, I don’t think that would be the only way to accomplish rates of 100k messages per second.
In fact, in one of our project we have accomplished rates of above 200k UDP messages per second per Erlang node (using Erlang’s gen_udp module).


It just means that you avoid processing messages one-by-one, but set up a short interval (5ms say) and process in batches in that short interval. In practice, this will "feel" instant in the system, but amortizes some back-n-forth overhead over multiple messages, reducing it. In a certain sense, the kernel buffer on an UDP connection is already doing micro batching when it delivers messages up to the VM. The actual amount of messages you can process is somewhat dependent on the hardware configuration as well, and probably also kernel tuning at these rates.


I am not sure what process_flag( message_queue_data, off_heap ) actually does. I mean, I do understand that a particular process queue is allocated from some "private space", but not sure how such memory is managed.

It would be great if anyone can shed some light on this.


In normal operation, the message queue is part of the process heap[0]. So when it's time to do a GC run, all of the messages on the heap are also scanned. This takes time proportional to the size of the message queue, especially if the message queue is large and the rest of the heap is small. But note that a message arriving in the queue can't point to other data in the heap. This leads to the idea of storing the messages off-heap, in their own private space. This then improves GC times because we don't have to traverse data in the queue anymore, and it's safe because the lifetime analysis is that messages in the queue can't keep heap data alive.

On the flip side though, is that when you have messages off_heap, sending messages to the process is slightly slower. This has to do with the optimizations and locking pertaining to message passing. The normal way is that you allocate temporary space for the messages from the sending process and then hook these small allocations into the message queue of the receiving process. They are "temporary" private off-heap spaces, which doesn't require locking for a long time, and doesn't require a lock on the memory area of the receiving process heap at all. Messages are then "internalized" to the process and the next garbage collection will move the messages into the main heap, so we can free up all the smaller memory spaces. With off-heap, we need to manage the off-heap area, and this requires some extra locking, which can potentially conflict more, slowing down message delivery.

--
J.
Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

Jesper Louis Andersen-2
In reply to this post by Avinash Dhumane-5
On Wed, Apr 7, 2021 at 3:50 PM Avinash Dhumane <[hidden email]> wrote:

 

I am taking the “tick-to-order” processing (in stock-market parlance) out of Erlang and into Rust to meet this ultra-low latency requirement.


I know Janes St. has a separate OCaml ingester which handles the message feed of this form and transforms it into a more digestible format for the rest of their systems. The key point is to demux the stream quickly and send relevant information into another CPU core for further processing. It also cuts down traffic in the case you can throw away a lot of messages from the feed you aren't interested in. At 10us per message, it isn't as if your general processing budget is very high. However, if you have a larger machine, the pure erlang solution is to use something like {active, N} or {active, true} and then move messages into other processes as fast as possible, as Valentin suggests. To me, 60-200us per udp message sounds way too high and my hunch is that this is due to configuration.


Reply | Threaded
Open this post in threaded view
|

RE: gen_udp:recv()

Avinash Dhumane-5

The “tick-to-order” indeed needs to be vertically partitioned so that the ticks (market movements) do not sit in the queue waiting to be attended. More so because the ticks arrive in the bursts; the next microsecond might bring no tick, or just one tick, or ten, or hundred, or might even eight hundred. If we  pick up, off the wire, packet-by-packet (I.e. tick-by-tick) and process sequentially, not only we will lose many of them but the last tick (of the burst) will lose its value by the time it gets attention.

 

Each tick, to be attended to (and not to be skipped), undergoes intense computation, before its culmination into an order (buy and/or sell) that tries to materialise the opportunity envisioned by the trader. This is High-Frequency-Algorithm-Trading all about, where every microsecond gained matters. But, pure Erlang programs, especially when the processing is vertically partitioned (as oppose to horizontally, where we give entire application to a single “request” or single “client”, and spawn as many instances as the requests or clients), where (in vertical case) we hop from one process to next, the latency involved till the order is fired is totally unpredictable – that’s why the range is from 60 microsecond to several hundred microseconds.

 

I had even tried collocating entire computation (from receiving tick to firing order) into a single Erlang-process just to see how fast can it go; but, the metal performance of “C” is impossible (to even consider) to match (least to beat). But, programming and debugging in C/C++ is naturally expensive; safe-Rust at least takes many worries off the head. Erlang programming is still my preferred choice “for understanding” the data structures and control structures required for the application.

 

 

From: [hidden email]
Sent: 09 April 2021 03:23 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]; [hidden email]
Subject: Re: gen_udp:recv()

 

On Wed, Apr 7, 2021 at 3:50 PM Avinash Dhumane <[hidden email]> wrote:

 

I am taking the “tick-to-order” processing (in stock-market parlance) out of Erlang and into Rust to meet this ultra-low latency requirement.

 

I know Janes St. has a separate OCaml ingester which handles the message feed of this form and transforms it into a more digestible format for the rest of their systems. The key point is to demux the stream quickly and send relevant information into another CPU core for further processing. It also cuts down traffic in the case you can throw away a lot of messages from the feed you aren't interested in. At 10us per message, it isn't as if your general processing budget is very high. However, if you have a larger machine, the pure erlang solution is to use something like {active, N} or {active, true} and then move messages into other processes as fast as possible, as Valentin suggests. To me, 60-200us per udp message sounds way too high and my hunch is that this is due to configuration.

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: gen_udp:recv()

zxq9-2
On 2021/04/10 15:47, Avinash Dhumane wrote:
> The “tick-to-order” indeed needs to be /vertically/ partitioned so that
> the ticks (market movements) do not sit in the queue waiting to be
> attended. More so because the ticks arrive in the bursts; the next
> microsecond might bring no tick, or just one tick, or ten, or hundred,
> or might even eight hundred.

Why are the ticks not partitioned by type? Is it only possible to
receive ALL ticks available in the market at a single address:port? This
seems... very weird and not what I've seen in HFT, at least.

-Craig
Reply | Threaded
Open this post in threaded view
|

RE: gen_udp:recv()

Avinash Dhumane-5

The market is indeed partitioned into several multicast streams, and each stream replicated, to cover for tick-loss, backed up by a tcp-based recovery server.

 

From: [hidden email]
Sent: 10 April 2021 12:22 PM
To: [hidden email]
Subject: Re: gen_udp:recv()

 

On 2021/04/10 15:47, Avinash Dhumane wrote:

> The “tick-to-order” indeed needs to be /vertically/ partitioned so that

> the ticks (market movements) do not sit in the queue waiting to be

> attended. More so because the ticks arrive in the bursts; the next

> microsecond might bring no tick, or just one tick, or ten, or hundred,

> or might even eight hundred.

 

Why are the ticks not partitioned by type? Is it only possible to

receive ALL ticks available in the market at a single address:port? This

seems... very weird and not what I've seen in HFT, at least.

 

-Craig