R18 Unbounded SSL Session ETS Table Growth

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

R18 Unbounded SSL Session ETS Table Growth

Ben Murphy-2
I've seen in production that the ssl_session_cache ETS table can
become very large which will start to cause new SSL connections to
take > 5 seconds to establish. The root cause of this is that multiple
SSL sessions are stored for a particular SSL connection configuration
even though only 1 (the most recent) is needed.

So the ETS table is keyed by {Host, Port, SessionID} but there a bunch
of other parameters that need to match for a session to be resumed for
example the client certificate and compression algorithm also need to
match. So what the current code does is create a new entry into the
table for each connection (even if session reuse is not enabled!!) and
then when you create a new connection it will iterate through all the
matching sessions for that {Host,Port} and check that the other
parameters match.

Looking at the code sessions are only removed from this table when a
lifetime is reached which is configurable but defaults to 24 hours or
if a FATAL error happens on a connection with that ID.

In the pathological case where a server supplies a session ID but
never supports resuming it this causes the session table to grow at
the same rate as new connections are established. This makes
establishing N connections take O(N^2) work. Also, in the case when
{reuse_sessions, false} has been supplied the session should not be
added to the table because a new entry will be added to the table
every time and will only be removed after 24 hours.

We've witnessed the catastrophic slow down occur when making a
requests against a server that normally resumes sessions properly. I
suspect this is because a) it started failing to resume sessions
because of some failure on their side or b) it's session lifetime was
considerably less than 24 hours and erlang started to try to resume
failed sessions and continuously created new sessions because of this.
I think it is also important to note that while the
register_unique_session fix would fix the memory leak if it worked in
this situation it would cause a new session to be created each time
and make ssl session caching pointless until the erlang session
expired. I think it would be preferable to create a new session and
delete the old one to preserve the uniqueness but I'm not sure how
this could be done ETS without creating a race that would generate
multiple sessions. The other alternative would be to delete sessions
that are known to not resume. For example if you try to resume a
session and the server no longer knows about it this is known by the
client because the client has to go through the whole handshake.

I think this was meant to be fixed in by register_unique_session
fucntion but the fix does not work because it assumes the return value
of select_session is [#session{}] when it is really [ [binary(),
#session{}] ].

https://github.com/erlang/otp/blob/maint/lib/ssl/src/ssl_manager.erl#L564

This is an example of the broken behaviour with reuse_sessions: false
(should work on R16B02 and R18).

1>  application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ets:info(element(2, sys:get_state(whereis(ssl_manager)))).
[{compressed,false},
 {memory,107},
 {owner,<0.45.0>},
 {heir,none},
 {name,ssl_otp_session_cache},
 {size,0},
 {node,nonode@nohost},
 {named_table,false},
 {type,ordered_set},
 {keypos,1},
 {protection,protected}]
3> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
4> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
5> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
6> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
7> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
8> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
9> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
10> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
11> ssl:close(element(2, ssl:connect("google.com", 443,
[{reuse_sessions, false}]))).
ok
12> ets:info(element(2, sys:get_state(whereis(ssl_manager)))).
[{compressed,false},
 {memory,881},
 {owner,<0.45.0>},
 {heir,none},
 {name,ssl_otp_session_cache},
 {size,9},
 {node,nonode@nohost},
 {named_table,false},
 {type,ordered_set},
 {keypos,1},
 {protection,protected}]
_______________________________________________
erlang-bugs mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-bugs
Reply | Threaded
Open this post in threaded view
|

Re: R18 Unbounded SSL Session ETS Table Growth

Stefan Grundmann
hi,

i think, that the default ssl session cache callback module
is unsuitable for erlang TLS servers that see connections from lots
of clients.

In use cases where session reuse is not needed:

  {session_cb,null_ssl_session_cb}

in ssl app environment and

  -module_null_ssl_session_cb).
  -behaviour(ssl_session_cache_api).
  -export([delete/2,foldl/3,init/1,lookup/2,
           select_session/2,terminate/1,update/3]).
  init(_) -> invalid.
  terminate(_) -> ok.
  lookup(_,_) -> undefined
  update(_,_,_) -> ok
  delete(_,_) -> ok
  foldl(_,Acc,_) -> Acc.
  select_session(_,_) -> [].

provides an easy workaround.

best regards

Stefan Grundmann

On Wed, Sep 02, 2015 at 03:15:33PM +0100, Ben Murphy wrote:

> I've seen in production that the ssl_session_cache ETS table can
> become very large which will start to cause new SSL connections to
> take > 5 seconds to establish. The root cause of this is that multiple
> SSL sessions are stored for a particular SSL connection configuration
> even though only 1 (the most recent) is needed.
>
> So the ETS table is keyed by {Host, Port, SessionID} but there a bunch
> of other parameters that need to match for a session to be resumed for
> example the client certificate and compression algorithm also need to
> match. So what the current code does is create a new entry into the
> table for each connection (even if session reuse is not enabled!!) and
> then when you create a new connection it will iterate through all the
> matching sessions for that {Host,Port} and check that the other
> parameters match.
>
> Looking at the code sessions are only removed from this table when a
> lifetime is reached which is configurable but defaults to 24 hours or
> if a FATAL error happens on a connection with that ID.
>
> In the pathological case where a server supplies a session ID but
> never supports resuming it this causes the session table to grow at
> the same rate as new connections are established. This makes
> establishing N connections take O(N^2) work. Also, in the case when
> {reuse_sessions, false} has been supplied the session should not be
> added to the table because a new entry will be added to the table
> every time and will only be removed after 24 hours.
>
> We've witnessed the catastrophic slow down occur when making a
> requests against a server that normally resumes sessions properly. I
> suspect this is because a) it started failing to resume sessions
> because of some failure on their side or b) it's session lifetime was
> considerably less than 24 hours and erlang started to try to resume
> failed sessions and continuously created new sessions because of this.
> I think it is also important to note that while the
> register_unique_session fix would fix the memory leak if it worked in
> this situation it would cause a new session to be created each time
> and make ssl session caching pointless until the erlang session
> expired. I think it would be preferable to create a new session and
> delete the old one to preserve the uniqueness but I'm not sure how
> this could be done ETS without creating a race that would generate
> multiple sessions. The other alternative would be to delete sessions
> that are known to not resume. For example if you try to resume a
> session and the server no longer knows about it this is known by the
> client because the client has to go through the whole handshake.
>
> I think this was meant to be fixed in by register_unique_session
> fucntion but the fix does not work because it assumes the return value
> of select_session is [#session{}] when it is really [ [binary(),
> #session{}] ].
>
> https://github.com/erlang/otp/blob/maint/lib/ssl/src/ssl_manager.erl#L564
>
> This is an example of the broken behaviour with reuse_sessions: false
> (should work on R16B02 and R18).
>
> 1>  application:ensure_all_started(ssl).
> {ok,[crypto,asn1,public_key,ssl]}
> 2> ets:info(element(2, sys:get_state(whereis(ssl_manager)))).
> [{compressed,false},
>  {memory,107},
>  {owner,<0.45.0>},
>  {heir,none},
>  {name,ssl_otp_session_cache},
>  {size,0},
>  {node,nonode@nohost},
>  {named_table,false},
>  {type,ordered_set},
>  {keypos,1},
>  {protection,protected}]
> 3> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 4> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 5> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 6> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 7> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 8> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 9> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 10> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 11> ssl:close(element(2, ssl:connect("google.com", 443,
> [{reuse_sessions, false}]))).
> ok
> 12> ets:info(element(2, sys:get_state(whereis(ssl_manager)))).
> [{compressed,false},
>  {memory,881},
>  {owner,<0.45.0>},
>  {heir,none},
>  {name,ssl_otp_session_cache},
>  {size,9},
>  {node,nonode@nohost},
>  {named_table,false},
>  {type,ordered_set},
>  {keypos,1},
>  {protection,protected}]
> _______________________________________________
> erlang-bugs mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-bugs
_______________________________________________
erlang-bugs mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-bugs
Reply | Threaded
Open this post in threaded view
|

Re: R18 Unbounded SSL Session ETS Table Growth

Ingela Anderton Andin-4
In reply to this post by Ben Murphy-2

Hi!

See in line comments below:

On 09/02/2015 04:15 PM, Ben Murphy wrote:

> I've seen in production that the ssl_session_cache ETS table can
> become very large which will start to cause new SSL connections to
> take > 5 seconds to establish. The root cause of this is that multiple
> SSL sessions are stored for a particular SSL connection configuration
> even though only 1 (the most recent) is needed.
>
> So the ETS table is keyed by {Host, Port, SessionID} but there a bunch
> of other parameters that need to match for a session to be resumed for
> example the client certificate and compression algorithm also need to
> match. So what the current code does is create a new entry into the
> table for each connection (even if session reuse is not enabled!!) and
> then when you create a new connection it will iterate through all the
> matching sessions for that {Host,Port} and check that the other
> parameters match.
>
> Looking at the code sessions are only removed from this table when a
> lifetime is reached which is configurable but defaults to 24 hours or
> if a FATAL error happens on a connection with that ID.
>
> In the pathological case where a server supplies a session ID but
> never supports resuming it this causes the session table to grow at
> the same rate as new connections are established. This makes
> establishing N connections take O(N^2) work. Also, in the case when
> {reuse_sessions, false} has been supplied the session should not be
> added to the table because a new entry will be added to the table
> every time and will only be removed after 24 hours.
>
> We've witnessed the catastrophic slow down occur when making a
> requests against a server that normally resumes sessions properly. I
> suspect this is because a) it started failing to resume sessions
> because of some failure on their side or b) it's session lifetime was
> considerably less than 24 hours and erlang started to try to resume
> failed sessions and continuously created new sessions because of this.
> I think it is also important to note that while the
> register_unique_session fix would fix the memory leak if it worked in
> this situation it would cause a new session to be created each time
> and make ssl session caching pointless until the erlang session
> expired. I think it would be preferable to create a new session and
> delete the old one to preserve the uniqueness but I'm not sure how
> this could be done ETS without creating a race that would generate
> multiple sessions. The other alternative would be to delete sessions
> that are known to not resume. For example if you try to resume a
> session and the server no longer knows about it this is known by the
> client because the client has to go through the whole handshake.
>
> I think this was meant to be fixed in by register_unique_session
> fucntion but the fix does not work because it assumes the return value
> of select_session is [#session{}] when it is really [ [binary(),
> #session{}] ].

Thank you for the detailed explanations and suggestions.
Well yes there is definitely a bug here. I am analyzing the best way to
fix it now.
I think in the short run we will fix so that register_unique_session
works as intended. And
we will  analyze if there are further improvements that can be
done to keep the session table "fresh" and small.

> https://github.com/erlang/otp/blob/maint/lib/ssl/src/ssl_manager.erl#L564


Regards Ingela Erlang/OTP team - Ericsson AB





_______________________________________________
erlang-bugs mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-bugs