crypto:hmac/3 using hardware acceleration

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

crypto:hmac/3 using hardware acceleration

Ben Browitt
Profiling my server with eprof show that my app spends most of the time on  crypto:hmac(sha, Key, Data) while crypto:stream_encrypt/2 using aes_ctr is much faster.

This answer [1] says that HMAC_* routines are software based and don't use hardware. The hmac nif seems to use them [2].

Is my resolution correct?
Is it possible to use EVP_* functions instead of HMAC_* functions in the nif?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

Hans Nilsson R (AL/EAB)
Yes it's possible to use the EVP_* functions. But it is nothing that we plan to do for the moment.

But pull-requests are always welcome :)

/Hans

On Tue, 2019-05-07 at 19:19 +0300, Ben Browitt wrote:
Profiling my server with eprof show that my app spends most of the time on  crypto:hmac(sha, Key, Data) while crypto:stream_encrypt/2 using aes_ctr is much faster.

This answer [1] says that HMAC_* routines are software based and don't use hardware. The hmac nif seems to use them [2].

Is my resolution correct?
Is it possible to use EVP_* functions instead of HMAC_* functions in the nif?
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

Ben Browitt
I've tested the speed with and without evp. evp is slower because Intel cpus don't have hardware acceleration for sha.
So it's best to leave it without evp for now. Thanks.
openssl speed sha1
openssl speed -evp sha1

On Wed, May 8, 2019 at 1:06 PM Hans Nilsson R <[hidden email]> wrote:
Yes it's possible to use the EVP_* functions. But it is nothing that we plan to do for the moment.

But pull-requests are always welcome :)

/Hans

On Tue, 2019-05-07 at 19:19 +0300, Ben Browitt wrote:
Profiling my server with eprof show that my app spends most of the time on  crypto:hmac(sha, Key, Data) while crypto:stream_encrypt/2 using aes_ctr is much faster.

This answer [1] says that HMAC_* routines are software based and don't use hardware. The hmac nif seems to use them [2].

Is my resolution correct?
Is it possible to use EVP_* functions instead of HMAC_* functions in the nif?
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

zxq9-2
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

Ben Browitt
AWS [1] and GCP [2] provide AMD EPYC servers  with SHA hardware accelerations.
Intel Ice Lake servers will also have SHA hardware accelerations [3].
Is there a chance OTP 23 could use EVP for SHA? This will give a large performance boost.


On Wed, May 8, 2019 at 4:34 PM <[hidden email]> wrote:
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Sv: [erlang-questions] crypto:hmac/3 using hardware acceleration

Hans Nilsson R (AL/EAB)
Crypto uses the EVP interfase for hash and mac (as well ass ciphers) with some conditions:

Since OTP-22.1:
The hash functions in crypto (hash,/2, hash_init/1, hash_update/2 and hash_final/1) use the EVP interface if the underlying cryptolib is OpenSSL 1.0.0 or higher.

Since OTP-22.1.3:
The mac functions (mac, macN, mac_init, mac_update, mac_final and mac_finalN) use the EVP interface if the underlying cryptolib is OpenSSL 1.1.1 or higher.

/Hans

Från: erlang-questions <[hidden email]> för Ben Browitt <[hidden email]>
Skickat: den 18 februari 2020 17:55
Till: [hidden email] <[hidden email]>
Kopia: [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
AWS [1] and GCP [2] provide AMD EPYC servers  with SHA hardware accelerations.
Intel Ice Lake servers will also have SHA hardware accelerations [3].
Is there a chance OTP 23 could use EVP for SHA? This will give a large performance boost.


On Wed, May 8, 2019 at 4:34 PM <[hidden email]> wrote:
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

Ben Browitt
Thank you Hans, that's great.
I probably missed it in the release notes.
I'll benchmark and compare hmac on a server with and without sha hardware accelerations.

On Wed, Feb 19, 2020 at 5:18 PM Hans Nilsson R <[hidden email]> wrote:
Crypto uses the EVP interfase for hash and mac (as well ass ciphers) with some conditions:

Since OTP-22.1:
The hash functions in crypto (hash,/2, hash_init/1, hash_update/2 and hash_final/1) use the EVP interface if the underlying cryptolib is OpenSSL 1.0.0 or higher.

Since OTP-22.1.3:
The mac functions (mac, macN, mac_init, mac_update, mac_final and mac_finalN) use the EVP interface if the underlying cryptolib is OpenSSL 1.1.1 or higher.

/Hans

Från: erlang-questions <[hidden email]> för Ben Browitt <[hidden email]>
Skickat: den 18 februari 2020 17:55
Till: [hidden email] <[hidden email]>
Kopia: [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
AWS [1] and GCP [2] provide AMD EPYC servers  with SHA hardware accelerations.
Intel Ice Lake servers will also have SHA hardware accelerations [3].
Is there a chance OTP 23 could use EVP for SHA? This will give a large performance boost.


On Wed, May 8, 2019 at 4:34 PM <[hidden email]> wrote:
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Sv: [erlang-questions] crypto:hmac/3 using hardware acceleration

Hans Nilsson R (AL/EAB)
Well, it isn't super clear in the release notes, so it is not strange you didn't know it.

I'm VERY interested in the results of your benchmarking!

/Hans

Från: Ben Browitt <[hidden email]>
Skickat: den 19 februari 2020 18:14
Till: Hans Nilsson R <[hidden email]>
Kopia: [hidden email] <[hidden email]>; [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
Thank you Hans, that's great.
I probably missed it in the release notes.
I'll benchmark and compare hmac on a server with and without sha hardware accelerations.

On Wed, Feb 19, 2020 at 5:18 PM Hans Nilsson R <[hidden email]> wrote:
Crypto uses the EVP interfase for hash and mac (as well ass ciphers) with some conditions:

Since OTP-22.1:
The hash functions in crypto (hash,/2, hash_init/1, hash_update/2 and hash_final/1) use the EVP interface if the underlying cryptolib is OpenSSL 1.0.0 or higher.

Since OTP-22.1.3:
The mac functions (mac, macN, mac_init, mac_update, mac_final and mac_finalN) use the EVP interface if the underlying cryptolib is OpenSSL 1.1.1 or higher.

/Hans

Från: erlang-questions <[hidden email]> för Ben Browitt <[hidden email]>
Skickat: den 18 februari 2020 17:55
Till: [hidden email] <[hidden email]>
Kopia: [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
AWS [1] and GCP [2] provide AMD EPYC servers  with SHA hardware accelerations.
Intel Ice Lake servers will also have SHA hardware accelerations [3].
Is there a chance OTP 23 could use EVP for SHA? This will give a large performance boost.


On Wed, May 8, 2019 at 4:34 PM <[hidden email]> wrote:
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

Ben Browitt
Hope to test soon. AMD servers on GCP will probably be available in the next few days.
This is how I'm going to benchmark unless someone have a better suggestion:
Key = crypto:strong_rand_bytes(20),
Data = crypto:strong_rand_bytes(1000),
MacLength = 10,
TC = fun(TC_M, TC_F, TC_A, TC_N) when TC_N > 0 -> TC_L = tl([begin {TC_T, _Result} = timer:tc(TC_M, TC_F, TC_A), TC_T end || _ <- lists:seq(1, TC_N)]), TC_Min = lists:min(TC_L), TC_Max = lists:max(TC_L), TC_Med = lists:nth(round((TC_N - 1) / 2), lists:sort(TC_L)), TC_Avg = round(lists:foldl(fun(TC_X, TC_Sum) -> TC_X + TC_Sum end, 0, TC_L) / (TC_N - 1)), io:format("Range: ~b - ~b mics~nMedian: ~b mics ~nAverage: ~b mics~n", [TC_Min, TC_Max, TC_Med, TC_Avg]), TC_Med end.
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000).

And with:
openssl speed -evp sha1

On Thu, Feb 20, 2020 at 2:15 PM Hans Nilsson R <[hidden email]> wrote:
Well, it isn't super clear in the release notes, so it is not strange you didn't know it.

I'm VERY interested in the results of your benchmarking!

/Hans

Från: Ben Browitt <[hidden email]>
Skickat: den 19 februari 2020 18:14
Till: Hans Nilsson R <[hidden email]>
Kopia: [hidden email] <[hidden email]>; [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
Thank you Hans, that's great.
I probably missed it in the release notes.
I'll benchmark and compare hmac on a server with and without sha hardware accelerations.

On Wed, Feb 19, 2020 at 5:18 PM Hans Nilsson R <[hidden email]> wrote:
Crypto uses the EVP interfase for hash and mac (as well ass ciphers) with some conditions:

Since OTP-22.1:
The hash functions in crypto (hash,/2, hash_init/1, hash_update/2 and hash_final/1) use the EVP interface if the underlying cryptolib is OpenSSL 1.0.0 or higher.

Since OTP-22.1.3:
The mac functions (mac, macN, mac_init, mac_update, mac_final and mac_finalN) use the EVP interface if the underlying cryptolib is OpenSSL 1.1.1 or higher.

/Hans

Från: erlang-questions <[hidden email]> för Ben Browitt <[hidden email]>
Skickat: den 18 februari 2020 17:55
Till: [hidden email] <[hidden email]>
Kopia: [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
AWS [1] and GCP [2] provide AMD EPYC servers  with SHA hardware accelerations.
Intel Ice Lake servers will also have SHA hardware accelerations [3].
Is there a chance OTP 23 could use EVP for SHA? This will give a large performance boost.


On Wed, May 8, 2019 at 4:34 PM <[hidden email]> wrote:
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

Ben Browitt
I've compared:
Intel Skylake - no SHA hardware extension, N1 machine type on GCP [1]
Second generation AMD EPYC Rome processor - has SHA hardware extension, N2D machine type on GCP [2]

Ubuntu 18.04
OpenSSL 1.1.1
OTP-22.2.7 (erlang-solutions deb package)

openssl speed -evp sha1 on AMD EPYC is about 2X faster than Intel Skylake.
crypto:macN/5 on AMD EPYC is about 25% faster than Intel Skylake.

It doesn't seem like crypto:macN/5 on AMD is using the SHA hardware extension. The 25% increase is probably just because Skylake is several years older than AMD EPYC second genration.
Is my test correct?

Tests:
Key = crypto:strong_rand_bytes(20),
Data = crypto:strong_rand_bytes(1000),
MacLength = 10,
TC = fun(TC_M, TC_F, TC_A, TC_N) when TC_N > 0 -> TC_L = tl([begin {TC_T, _Result} = timer:tc(TC_M, TC_F, TC_A), TC_T end || _ <- lists:seq(1, TC_N)]), TC_Min = lists:min(TC_L), TC_Max = lists:max(TC_L), TC_Med = lists:nth(round((TC_N - 1) / 2), lists:sort(TC_L)), TC_Avg = round(lists:foldl(fun(TC_X, TC_Sum) -> TC_X + TC_Sum end, 0, TC_L) / (TC_N - 1)), io:format("Range: ~b - ~b mics~nMedian: ~b mics ~nAverage: ~b mics~n", [TC_Min, TC_Max, TC_Med, TC_Avg]), TC_Med end.
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000000).

1) Intel Skylake
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000000).
Range: 3 - 729 mics
Median: 4 mics
Average: 4 mics

openssl speed sha1
Doing sha1 for 3s on 16 size blocks: 19026044 sha1's in 2.99s
Doing sha1 for 3s on 64 size blocks: 11512925 sha1's in 2.98s
Doing sha1 for 3s on 256 size blocks: 5769743 sha1's in 2.98s
Doing sha1 for 3s on 1024 size blocks: 1927668 sha1's in 2.98s
Doing sha1 for 3s on 8192 size blocks: 265026 sha1's in 2.98s
Doing sha1 for 3s on 16384 size blocks: 133488 sha1's in 2.98s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1            101811.61k   247257.45k   495655.77k   662393.30k   728554.70k   733915.23k

openssl speed -evp sha1
Doing sha1 for 3s on 16 size blocks: 11590063 sha1's in 2.99s
Doing sha1 for 3s on 64 size blocks: 8259388 sha1's in 2.97s
Doing sha1 for 3s on 256 size blocks: 4853323 sha1's in 2.99s
Doing sha1 for 3s on 1024 size blocks: 1796528 sha1's in 2.98s
Doing sha1 for 3s on 8192 size blocks: 259970 sha1's in 2.99s
Doing sha1 for 3s on 16384 size blocks: 131515 sha1's in 2.99s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1             62020.40k   177980.08k   415535.35k   617330.43k   712265.63k   720649.42k

2) AMD EPYC
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000000).
Range: 3 - 515 mics
Median: 3 mics
Average: 3 mics

openssl speed sha1
Doing sha1 for 3s on 16 size blocks: 39862496 sha1's in 3.00s
Doing sha1 for 3s on 64 size blocks: 25451866 sha1's in 3.00s
Doing sha1 for 3s on 256 size blocks: 13073739 sha1's in 3.00s
Doing sha1 for 3s on 1024 size blocks: 4463324 sha1's in 3.00s
Doing sha1 for 3s on 8192 size blocks: 622138 sha1's in 3.00s
Doing sha1 for 3s on 16384 size blocks: 314316 sha1's in 3.00s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1            212599.98k   542973.14k  1115625.73k  1523481.26k  1698851.50k  1716584.45k

openssl speed -evp sha1
Doing sha1 for 3s on 16 size blocks: 17719869 sha1's in 3.00s
Doing sha1 for 3s on 64 size blocks: 14559842 sha1's in 3.00s
Doing sha1 for 3s on 256 size blocks: 9433054 sha1's in 3.00s
Doing sha1 for 3s on 1024 size blocks: 3938020 sha1's in 3.00s
Doing sha1 for 3s on 8192 size blocks: 607605 sha1's in 2.99s
Doing sha1 for 3s on 16384 size blocks: 309279 sha1's in 3.00s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1             94505.97k   310609.96k   804953.94k  1344177.49k  1664715.77k  1689075.71k




On Thu, Feb 20, 2020 at 4:34 PM Ben Browitt <[hidden email]> wrote:
Hope to test soon. AMD servers on GCP will probably be available in the next few days.
This is how I'm going to benchmark unless someone have a better suggestion:
Key = crypto:strong_rand_bytes(20),
Data = crypto:strong_rand_bytes(1000),
MacLength = 10,
TC = fun(TC_M, TC_F, TC_A, TC_N) when TC_N > 0 -> TC_L = tl([begin {TC_T, _Result} = timer:tc(TC_M, TC_F, TC_A), TC_T end || _ <- lists:seq(1, TC_N)]), TC_Min = lists:min(TC_L), TC_Max = lists:max(TC_L), TC_Med = lists:nth(round((TC_N - 1) / 2), lists:sort(TC_L)), TC_Avg = round(lists:foldl(fun(TC_X, TC_Sum) -> TC_X + TC_Sum end, 0, TC_L) / (TC_N - 1)), io:format("Range: ~b - ~b mics~nMedian: ~b mics ~nAverage: ~b mics~n", [TC_Min, TC_Max, TC_Med, TC_Avg]), TC_Med end.
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000).

And with:
openssl speed -evp sha1

On Thu, Feb 20, 2020 at 2:15 PM Hans Nilsson R <[hidden email]> wrote:
Well, it isn't super clear in the release notes, so it is not strange you didn't know it.

I'm VERY interested in the results of your benchmarking!

/Hans

Från: Ben Browitt <[hidden email]>
Skickat: den 19 februari 2020 18:14
Till: Hans Nilsson R <[hidden email]>
Kopia: [hidden email] <[hidden email]>; [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
Thank you Hans, that's great.
I probably missed it in the release notes.
I'll benchmark and compare hmac on a server with and without sha hardware accelerations.

On Wed, Feb 19, 2020 at 5:18 PM Hans Nilsson R <[hidden email]> wrote:
Crypto uses the EVP interfase for hash and mac (as well ass ciphers) with some conditions:

Since OTP-22.1:
The hash functions in crypto (hash,/2, hash_init/1, hash_update/2 and hash_final/1) use the EVP interface if the underlying cryptolib is OpenSSL 1.0.0 or higher.

Since OTP-22.1.3:
The mac functions (mac, macN, mac_init, mac_update, mac_final and mac_finalN) use the EVP interface if the underlying cryptolib is OpenSSL 1.1.1 or higher.

/Hans

Från: erlang-questions <[hidden email]> för Ben Browitt <[hidden email]>
Skickat: den 18 februari 2020 17:55
Till: [hidden email] <[hidden email]>
Kopia: [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
AWS [1] and GCP [2] provide AMD EPYC servers  with SHA hardware accelerations.
Intel Ice Lake servers will also have SHA hardware accelerations [3].
Is there a chance OTP 23 could use EVP for SHA? This will give a large performance boost.


On Wed, May 8, 2019 at 4:34 PM <[hidden email]> wrote:
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: crypto:hmac/3 using hardware acceleration

Ben Browitt
I've run better test of just hash(sha, Data) instead of hmac that does more things and got 2x speed (1 microsecond vs 2 microseconds).
Is there a way to improve the crypto:macN/5 speed to get it closer to 100% gain instead of just 25% gain?

We can clear the SHA hardware extension to test its effect by running the shell with the following env variable:
OPENSSL_ia32cap=:~0x20000000 erl

Benchmark code:
Data = crypto:strong_rand_bytes(1024),
TC = fun(TC_M, TC_F, TC_A, TC_N) when TC_N > 0 -> TC_L = tl([begin {TC_T, _Result} = timer:tc(TC_M, TC_F, TC_A), TC_T end || _ <- lists:seq(1, TC_N)]), TC_Min = lists:min(TC_L), TC_Max = lists:max(TC_L), TC_Med = lists:nth(round((TC_N - 1) / 2), lists:sort(TC_L)), TC_Avg = round(lists:foldl(fun(TC_X, TC_Sum) -> TC_X + TC_Sum end, 0, TC_L) / (TC_N - 1)), io:format("Range: ~b - ~b mics~nMedian: ~b mics ~nAverage: ~b mics~n", [TC_Min, TC_Max, TC_Med, TC_Avg]), TC_Med end.
TC(crypto, hash, [sha, Data], 1000000).

On Fri, Feb 21, 2020 at 2:38 AM Ben Browitt <[hidden email]> wrote:
I've compared:
Intel Skylake - no SHA hardware extension, N1 machine type on GCP [1]
Second generation AMD EPYC Rome processor - has SHA hardware extension, N2D machine type on GCP [2]

Ubuntu 18.04
OpenSSL 1.1.1
OTP-22.2.7 (erlang-solutions deb package)

openssl speed -evp sha1 on AMD EPYC is about 2X faster than Intel Skylake.
crypto:macN/5 on AMD EPYC is about 25% faster than Intel Skylake.

It doesn't seem like crypto:macN/5 on AMD is using the SHA hardware extension. The 25% increase is probably just because Skylake is several years older than AMD EPYC second genration.
Is my test correct?

Tests:
Key = crypto:strong_rand_bytes(20),
Data = crypto:strong_rand_bytes(1000),
MacLength = 10,
TC = fun(TC_M, TC_F, TC_A, TC_N) when TC_N > 0 -> TC_L = tl([begin {TC_T, _Result} = timer:tc(TC_M, TC_F, TC_A), TC_T end || _ <- lists:seq(1, TC_N)]), TC_Min = lists:min(TC_L), TC_Max = lists:max(TC_L), TC_Med = lists:nth(round((TC_N - 1) / 2), lists:sort(TC_L)), TC_Avg = round(lists:foldl(fun(TC_X, TC_Sum) -> TC_X + TC_Sum end, 0, TC_L) / (TC_N - 1)), io:format("Range: ~b - ~b mics~nMedian: ~b mics ~nAverage: ~b mics~n", [TC_Min, TC_Max, TC_Med, TC_Avg]), TC_Med end.
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000000).

1) Intel Skylake
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000000).
Range: 3 - 729 mics
Median: 4 mics
Average: 4 mics

openssl speed sha1
Doing sha1 for 3s on 16 size blocks: 19026044 sha1's in 2.99s
Doing sha1 for 3s on 64 size blocks: 11512925 sha1's in 2.98s
Doing sha1 for 3s on 256 size blocks: 5769743 sha1's in 2.98s
Doing sha1 for 3s on 1024 size blocks: 1927668 sha1's in 2.98s
Doing sha1 for 3s on 8192 size blocks: 265026 sha1's in 2.98s
Doing sha1 for 3s on 16384 size blocks: 133488 sha1's in 2.98s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1            101811.61k   247257.45k   495655.77k   662393.30k   728554.70k   733915.23k

openssl speed -evp sha1
Doing sha1 for 3s on 16 size blocks: 11590063 sha1's in 2.99s
Doing sha1 for 3s on 64 size blocks: 8259388 sha1's in 2.97s
Doing sha1 for 3s on 256 size blocks: 4853323 sha1's in 2.99s
Doing sha1 for 3s on 1024 size blocks: 1796528 sha1's in 2.98s
Doing sha1 for 3s on 8192 size blocks: 259970 sha1's in 2.99s
Doing sha1 for 3s on 16384 size blocks: 131515 sha1's in 2.99s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1             62020.40k   177980.08k   415535.35k   617330.43k   712265.63k   720649.42k

2) AMD EPYC
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000000).
Range: 3 - 515 mics
Median: 3 mics
Average: 3 mics

openssl speed sha1
Doing sha1 for 3s on 16 size blocks: 39862496 sha1's in 3.00s
Doing sha1 for 3s on 64 size blocks: 25451866 sha1's in 3.00s
Doing sha1 for 3s on 256 size blocks: 13073739 sha1's in 3.00s
Doing sha1 for 3s on 1024 size blocks: 4463324 sha1's in 3.00s
Doing sha1 for 3s on 8192 size blocks: 622138 sha1's in 3.00s
Doing sha1 for 3s on 16384 size blocks: 314316 sha1's in 3.00s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1            212599.98k   542973.14k  1115625.73k  1523481.26k  1698851.50k  1716584.45k

openssl speed -evp sha1
Doing sha1 for 3s on 16 size blocks: 17719869 sha1's in 3.00s
Doing sha1 for 3s on 64 size blocks: 14559842 sha1's in 3.00s
Doing sha1 for 3s on 256 size blocks: 9433054 sha1's in 3.00s
Doing sha1 for 3s on 1024 size blocks: 3938020 sha1's in 3.00s
Doing sha1 for 3s on 8192 size blocks: 607605 sha1's in 2.99s
Doing sha1 for 3s on 16384 size blocks: 309279 sha1's in 3.00s
OpenSSL 1.1.1  11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTC
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1             94505.97k   310609.96k   804953.94k  1344177.49k  1664715.77k  1689075.71k




On Thu, Feb 20, 2020 at 4:34 PM Ben Browitt <[hidden email]> wrote:
Hope to test soon. AMD servers on GCP will probably be available in the next few days.
This is how I'm going to benchmark unless someone have a better suggestion:
Key = crypto:strong_rand_bytes(20),
Data = crypto:strong_rand_bytes(1000),
MacLength = 10,
TC = fun(TC_M, TC_F, TC_A, TC_N) when TC_N > 0 -> TC_L = tl([begin {TC_T, _Result} = timer:tc(TC_M, TC_F, TC_A), TC_T end || _ <- lists:seq(1, TC_N)]), TC_Min = lists:min(TC_L), TC_Max = lists:max(TC_L), TC_Med = lists:nth(round((TC_N - 1) / 2), lists:sort(TC_L)), TC_Avg = round(lists:foldl(fun(TC_X, TC_Sum) -> TC_X + TC_Sum end, 0, TC_L) / (TC_N - 1)), io:format("Range: ~b - ~b mics~nMedian: ~b mics ~nAverage: ~b mics~n", [TC_Min, TC_Max, TC_Med, TC_Avg]), TC_Med end.
TC(crypto, macN, [hmac, sha, Key, Data, MacLength], 1000).

And with:
openssl speed -evp sha1

On Thu, Feb 20, 2020 at 2:15 PM Hans Nilsson R <[hidden email]> wrote:
Well, it isn't super clear in the release notes, so it is not strange you didn't know it.

I'm VERY interested in the results of your benchmarking!

/Hans

Från: Ben Browitt <[hidden email]>
Skickat: den 19 februari 2020 18:14
Till: Hans Nilsson R <[hidden email]>
Kopia: [hidden email] <[hidden email]>; [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
Thank you Hans, that's great.
I probably missed it in the release notes.
I'll benchmark and compare hmac on a server with and without sha hardware accelerations.

On Wed, Feb 19, 2020 at 5:18 PM Hans Nilsson R <[hidden email]> wrote:
Crypto uses the EVP interfase for hash and mac (as well ass ciphers) with some conditions:

Since OTP-22.1:
The hash functions in crypto (hash,/2, hash_init/1, hash_update/2 and hash_final/1) use the EVP interface if the underlying cryptolib is OpenSSL 1.0.0 or higher.

Since OTP-22.1.3:
The mac functions (mac, macN, mac_init, mac_update, mac_final and mac_finalN) use the EVP interface if the underlying cryptolib is OpenSSL 1.1.1 or higher.

/Hans

Från: erlang-questions <[hidden email]> för Ben Browitt <[hidden email]>
Skickat: den 18 februari 2020 17:55
Till: [hidden email] <[hidden email]>
Kopia: [hidden email] <[hidden email]>
Ämne: Re: [erlang-questions] crypto:hmac/3 using hardware acceleration
 
AWS [1] and GCP [2] provide AMD EPYC servers  with SHA hardware accelerations.
Intel Ice Lake servers will also have SHA hardware accelerations [3].
Is there a chance OTP 23 could use EVP for SHA? This will give a large performance boost.


On Wed, May 8, 2019 at 4:34 PM <[hidden email]> wrote:
On 2019年5月8日水曜日 14時15分51秒 JST Ben Browitt wrote:
> I've tested the speed with and without evp. evp is slower because Intel
> cpus don't have hardware acceleration for sha.
> So it's best to leave it without evp for now. Thanks.
> openssl speed sha1
> openssl speed -evp sha1

I think it depends on how your openssl was built and which processor
family you have. IIRC Intel has SHA1 hardware support, and AMD has
SHA1 and SHA256 hardware instructions since RyZen.

May also depend on if you are running virtualized and whether the
hypervisor is exposing the instructions.

In the base case I imagine it would "just work", but not if this is
disabled in a vanilla Linux/BSD/whatever distribution binary, or if
your system is set to a mode that restricts some instructions.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions