Packets deduplication

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Packets deduplication

Alexander Petrovsky-2
Hi!

I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?

Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?

--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Danil Zagoskin-2
Hi!

If ID grows monotinically and you have no plans of recovering after packet reordering, then you can just keep the previous ID.
  - If CurID > PrevID, CurID is unique;
  - If CurID == PrevID, it is not unique;
  - If CurID < PrevID, it is a bug or reordering, let it crash.


On Thu, Feb 18, 2016 at 3:01 PM, Alexander Petrovsky <[hidden email]> wrote:
Hi!

I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?

Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?

--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Alexander Petrovsky-2
Ouch, I forgot to say, the IDs can be sparse, so they not really monotonically, they just grow, and can be reordered.

2016-02-18 16:20 GMT+03:00 Danil Zagoskin <[hidden email]>:
Hi!

If ID grows monotinically and you have no plans of recovering after packet reordering, then you can just keep the previous ID.
  - If CurID > PrevID, CurID is unique;
  - If CurID == PrevID, it is not unique;
  - If CurID < PrevID, it is a bug or reordering, let it crash.


On Thu, Feb 18, 2016 at 3:01 PM, Alexander Petrovsky <[hidden email]> wrote:
Hi!

I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?

Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?

--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Jesper Louis Andersen-2
A simple solution that may work in the end is to keep track of a window of bits:

* shift the window as the ID increases
* check older packets against the window

This gives you "SACK" support, is fast to check against and fast to expire old data due to the shifting.

Simply using an ordered list or a map could also work if you clean it up eventually. It depends on how fast your CPU core is.


On Thu, Feb 18, 2016 at 2:24 PM, Alexander Petrovsky <[hidden email]> wrote:
Ouch, I forgot to say, the IDs can be sparse, so they not really monotonically, they just grow, and can be reordered.

2016-02-18 16:20 GMT+03:00 Danil Zagoskin <[hidden email]>:
Hi!

If ID grows monotinically and you have no plans of recovering after packet reordering, then you can just keep the previous ID.
  - If CurID > PrevID, CurID is unique;
  - If CurID == PrevID, it is not unique;
  - If CurID < PrevID, it is a bug or reordering, let it crash.


On Thu, Feb 18, 2016 at 3:01 PM, Alexander Petrovsky <[hidden email]> wrote:
Hi!

I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?

Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?

--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Alexander Petrovsky-2
Yep, right now I try to do something similar with bitmap and sliding window, that will shift. Could you explain about maps?

Thanks!

2016-02-18 16:37 GMT+03:00 Jesper Louis Andersen <[hidden email]>:
A simple solution that may work in the end is to keep track of a window of bits:

* shift the window as the ID increases
* check older packets against the window

This gives you "SACK" support, is fast to check against and fast to expire old data due to the shifting.

Simply using an ordered list or a map could also work if you clean it up eventually. It depends on how fast your CPU core is.


On Thu, Feb 18, 2016 at 2:24 PM, Alexander Petrovsky <[hidden email]> wrote:
Ouch, I forgot to say, the IDs can be sparse, so they not really monotonically, they just grow, and can be reordered.

2016-02-18 16:20 GMT+03:00 Danil Zagoskin <[hidden email]>:
Hi!

If ID grows monotinically and you have no plans of recovering after packet reordering, then you can just keep the previous ID.
  - If CurID > PrevID, CurID is unique;
  - If CurID == PrevID, it is not unique;
  - If CurID < PrevID, it is a bug or reordering, let it crash.


On Thu, Feb 18, 2016 at 3:01 PM, Alexander Petrovsky <[hidden email]> wrote:
Hi!

I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?

Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?

--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
J.



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Jesper Louis Andersen-2
Keep a map, M :: #{ integer() => bool() } and ask if there is_key/2 on the ID's. Once in a while, you GC the map through maps:with(lists:seq(ID1, ID2), M)  and make it small again.

You would have to measure if this is actually faster/slower and by how much it is a slower solution. It isn't a priori given it would be either. jlouis/eministat will help determine if it is the case.


On Thu, Feb 18, 2016 at 2:43 PM, Alexander Petrovsky <[hidden email]> wrote:
Yep, right now I try to do something similar with bitmap and sliding window, that will shift. Could you explain about maps?

Thanks!

2016-02-18 16:37 GMT+03:00 Jesper Louis Andersen <[hidden email]>:
A simple solution that may work in the end is to keep track of a window of bits:

* shift the window as the ID increases
* check older packets against the window

This gives you "SACK" support, is fast to check against and fast to expire old data due to the shifting.

Simply using an ordered list or a map could also work if you clean it up eventually. It depends on how fast your CPU core is.


On Thu, Feb 18, 2016 at 2:24 PM, Alexander Petrovsky <[hidden email]> wrote:
Ouch, I forgot to say, the IDs can be sparse, so they not really monotonically, they just grow, and can be reordered.

2016-02-18 16:20 GMT+03:00 Danil Zagoskin <[hidden email]>:
Hi!

If ID grows monotinically and you have no plans of recovering after packet reordering, then you can just keep the previous ID.
  - If CurID > PrevID, CurID is unique;
  - If CurID == PrevID, it is not unique;
  - If CurID < PrevID, it is a bug or reordering, let it crash.


On Thu, Feb 18, 2016 at 3:01 PM, Alexander Petrovsky <[hidden email]> wrote:
Hi!

I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?

Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?

--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
J.



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815




--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Felix Gallo-2
Alexander -- what's the application supposed to do if it receives some packets in a non-monotonic order?  If the answer is "ignore", then you could just check that the packet ID is greater than the last packet ID and everything else just falls out naturally.


If you still need to process out-of-order packets, then consider each packet's "has been seen" value as a single bit.  There is some packet count window C beyond which duplicates are not possible.  Maintain a queue of bytes totalling at least C bits, as well as the id number at which the byte queue starts.  When a new packet comes in, if it would require using one or more new bytes, push those to the front and drop the equivalent number off the back, and update the low byte queue id.  Create a 'test byte' which has only the new bit set in the appropriate position for that byte.  use binary-and ("band") to check to see if that bit has been seen in the appropriate byte in the byte queue; if the resulting byte is greater than zero, the bit has already been seen; do whatever processing is necessary.  Otherwise, the bit has not been seen; set the bit in the byte queue, process, and continue.

This is probably fastest in C/C++ but Erlang should be fine as long as the sparsity factor between packet ids is not generally gigantic (i.e. it's usually 1, rarely 100).  If it's gigantic, then I'd probably use a queue of small pooled bloom filters in the same way so that I wouldn't have to do as much allocation.  As always, measure and compare.

F.

On Thu, Feb 18, 2016 at 6:58 AM, Jesper Louis Andersen <[hidden email]> wrote:
Keep a map, M :: #{ integer() => bool() } and ask if there is_key/2 on the ID's. Once in a while, you GC the map through maps:with(lists:seq(ID1, ID2), M)  and make it small again.

You would have to measure if this is actually faster/slower and by how much it is a slower solution. It isn't a priori given it would be either. jlouis/eministat will help determine if it is the case.


On Thu, Feb 18, 2016 at 2:43 PM, Alexander Petrovsky <[hidden email]> wrote:
Yep, right now I try to do something similar with bitmap and sliding window, that will shift. Could you explain about maps?

Thanks!

2016-02-18 16:37 GMT+03:00 Jesper Louis Andersen <[hidden email]>:
A simple solution that may work in the end is to keep track of a window of bits:

* shift the window as the ID increases
* check older packets against the window

This gives you "SACK" support, is fast to check against and fast to expire old data due to the shifting.

Simply using an ordered list or a map could also work if you clean it up eventually. It depends on how fast your CPU core is.


On Thu, Feb 18, 2016 at 2:24 PM, Alexander Petrovsky <[hidden email]> wrote:
Ouch, I forgot to say, the IDs can be sparse, so they not really monotonically, they just grow, and can be reordered.

2016-02-18 16:20 GMT+03:00 Danil Zagoskin <[hidden email]>:
Hi!

If ID grows monotinically and you have no plans of recovering after packet reordering, then you can just keep the previous ID.
  - If CurID > PrevID, CurID is unique;
  - If CurID == PrevID, it is not unique;
  - If CurID < PrevID, it is a bug or reordering, let it crash.


On Thu, Feb 18, 2016 at 3:01 PM, Alexander Petrovsky <[hidden email]> wrote:
Hi!

I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?

Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?

--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
J.



--
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Phone: +7 914 8 820 815




--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

dmkolesnikov
In reply to this post by Alexander Petrovsky-2
Hello Alexandr,

The right data structure is either bloom filter or scalable bloom filter.

I've played with standard bloom here
https://github.com/fogfish/feta/blob/master/src/bloom.erl

20k might require you to do tuning and reflect the filter to ETS.

The ETS as such or any cache based on ETS might help you as well. Cache is needed to implement TTL for your ID's

Best Regards,
Dmitry

Sent from my iPhone

> On 18 Feb 2016, at 14:01, Alexander Petrovsky <[hidden email]> wrote:
>
> Hi!
>
> I have the stream of packets with ID (int), and I need to check is the packet is uniq (by ID) or not?
>
> Incoming rate is about 20k pps and ID is monotonically grows. What's the best way and data structure fit for this problem?
>
> --
> Петровский Александр / Alexander Petrovsky,
>
> Skype: askjuise
> Phone: +7 914 8 820 815
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Jesper Louis Andersen-2

On Thu, Feb 18, 2016 at 7:28 PM, <[hidden email]> wrote:
The right data structure is either bloom filter or scalable bloom filter.

You also need to delete elements, which a cuckoo filter can do.

The problem with bloom filters in this situation is the probability of error as far as I see. They either say NOPE or MAYBE??? The other problem is that it probably wont beat integer operations on 60 bit words since it requires a hash.


--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Felix Gallo-2
you'd use the bloom filter first to determine if it's possible that you've seen it before, and then if you get a hit, just do a lookup into some other structure.  Depending on your hash functions and the sparsity, it could be significantly faster; consider, for example, if the sequence is

1,2,3,4,2^1024,2^1024+1,2^1024+2...

just about anything except a bloom filter is going to have possibly-interesting behavior when the 2^1024 shows up.  The bloom filter may give you some false positives and will generally be slower, but if you're trying to maintain a packet train for, like, remote medical surgery, you might accept the extra space and the slightly higher average latency for less jitter.

On Thu, Feb 18, 2016 at 10:45 AM, Jesper Louis Andersen <[hidden email]> wrote:

On Thu, Feb 18, 2016 at 7:28 PM, <[hidden email]> wrote:
The right data structure is either bloom filter or scalable bloom filter.

You also need to delete elements, which a cuckoo filter can do.

The problem with bloom filters in this situation is the probability of error as far as I see. They either say NOPE or MAYBE??? The other problem is that it probably wont beat integer operations on 60 bit words since it requires a hash.


--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Max Lapshin-2
Perhaps structure for keeping fetched packets can look like:

WindowBeginning

then list of binaries of N bytes each, which are bitmaps. Some of binaries can be replaced by atom 'true' if all bits are set

and WindowEnd


My question here is: when to send retransmit request?  I suppose that this question is very close to TCP problems, so it cannot have a simple answer, but maybe you have some good enough answer for you?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Packets deduplication

Alexander Petrovsky-2
In reply to this post by Felix Gallo-2
Right now I'm using something like bitmaps as Felix suggested, drop older bits and check uniques with band, and yes, I need to process reordered packets. I think bloom filter is not good solution here due the very-very old ID's can't be deleted and memory will grow, and looks like latency will grow too. But I will look at scalable bloom filter, it may be suitable.

Thanks!

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions