-#!/usr/bin/env perl
+#! /usr/bin/env perl
+# Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License"). You may not use
+# this file except in compliance with the License. You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
#
# ====================================================================
# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
# The module implements "4-bit" GCM GHASH function and underlying
# single multiplication operation in GF(2^128). "4-bit" means that it
# uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two
-# code paths: vanilla x86 and vanilla MMX. Former will be executed on
-# 486 and Pentium, latter on all others. MMX GHASH features so called
+# code paths: vanilla x86 and vanilla SSE. Former will be executed on
+# 486 and Pentium, latter on all others. SSE GHASH features so called
# "528B" variant of "4-bit" method utilizing additional 256+16 bytes
# of per-key storage [+512 bytes shared table]. Performance results
# are for streamed GHASH subroutine and are expressed in cycles per
# processed byte, less is better:
#
-# gcc 2.95.3(*) MMX assembler x86 assembler
+# gcc 2.95.3(*) SSE assembler x86 assembler
#
-# Pentium 100/112(**) - 50
-# PIII 63 /77 12.2 24
-# P4 96 /122 18.0 84(***)
-# Opteron 50 /71 10.1 30
-# Core2 54 /68 8.6 18
+# Pentium 105/111(**) - 50
+# PIII 68 /75 12.2 24
+# P4 125/125 17.8 84(***)
+# Opteron 66 /70 10.1 30
+# Core2 54 /67 8.4 18
+# Atom 105/105 16.8 53
+# VIA Nano 69 /71 13.0 27
#
# (*) gcc 3.4.x was observed to generate few percent slower code,
# which is one of reasons why 2.95.3 results were chosen,
# another reason is lack of 3.4.x results for older CPUs;
+# comparison with SSE results is not completely fair, because C
+# results are for vanilla "256B" implementation, while
+# assembler results are for "528B";-)
# (**) second number is result for code compiled with -fPIC flag,
# which is actually more relevant, because assembler code is
# position-independent;
#
# To summarize, it's >2-5 times faster than gcc-generated code. To
# anchor it to something else SHA1 assembler processes one byte in
-# 11-13 cycles on contemporary x86 cores. As for choice of MMX in
-# particular, see comment at the end of the file...
+# ~7 cycles on contemporary x86 cores. As for choice of MMX/SSE
+# in particular, see comment at the end of the file...
# May 2010
#
-# Add PCLMULQDQ version performing at 2.13 cycles per processed byte.
+# Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
# The question is how close is it to theoretical limit? The pclmulqdq
# instruction latency appears to be 14 cycles and there can't be more
# than 2 of them executing at any given time. This means that single
# Before we proceed to this implementation let's have closer look at
# the best-performing code suggested by Intel in their white paper.
# By tracing inter-register dependencies Tmod is estimated as ~19
-# cycles and Naggr is 4, resulting in 2.05 cycles per processed byte.
-# As implied, this is quite optimistic estimate, because it does not
-# account for Karatsuba pre- and post-processing, which for a single
-# multiplication is ~5 cycles. Unfortunately Intel does not provide
-# performance data for GHASH alone, only for fused GCM mode. But
-# we can estimate it by subtracting CTR performance result provided
-# in "AES Instruction Set" white paper: 3.54-1.38=2.16 cycles per
-# processed byte or 5% off the estimate. It should be noted though
-# that 3.54 is GCM result for 16KB block size, while 1.38 is CTR for
-# 1KB block size, meaning that real number is likely to be a bit
-# further from estimate.
+# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
+# processed byte. As implied, this is quite optimistic estimate,
+# because it does not account for Karatsuba pre- and post-processing,
+# which for a single multiplication is ~5 cycles. Unfortunately Intel
+# does not provide performance data for GHASH alone. But benchmarking
+# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
+# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
+# the result accounts even for pre-computing of degrees of the hash
+# key H, but its portion is negligible at 16KB buffer size.
#
# Moving on to the implementation in question. Tmod is estimated as
# ~13 cycles and Naggr is 2, giving asymptotic performance of ...
# 2.16. How is it possible that measured performance is better than
# optimistic theoretical estimate? There is one thing Intel failed
-# to recognize. By fusing GHASH with CTR former's performance is
-# really limited to above (Tmul + Tmod/Naggr) equation. But if GHASH
-# procedure is detached, the modulo-reduction can be interleaved with
-# Naggr-1 multiplications and under ideal conditions even disappear
-# from the equation. So that optimistic theoretical estimate for this
-# implementation is ... 28/16=1.75, and not 2.16. Well, it's probably
-# way too optimistic, at least for such small Naggr. I'd argue that
-# (28+Tproc/Naggr), where Tproc is time required for Karatsuba pre-
-# and post-processing, is more realistic estimate. In this case it
-# gives ... 1.91 cycles per processed byte. Or in other words,
-# depending on how well we can interleave reduction and one of the
-# two multiplications the performance should be betwen 1.91 and 2.16.
-# As already mentioned, this implementation processes one byte [out
-# of 1KB buffer] in 2.13 cycles, while x86_64 counterpart - in 2.07.
-# x86_64 performance is better, because larger register bank allows
-# to interleave reduction and multiplication better.
+# to recognize. By serializing GHASH with CTR in same subroutine
+# former's performance is really limited to above (Tmul + Tmod/Naggr)
+# equation. But if GHASH procedure is detached, the modulo-reduction
+# can be interleaved with Naggr-1 multiplications at instruction level
+# and under ideal conditions even disappear from the equation. So that
+# optimistic theoretical estimate for this implementation is ...
+# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
+# at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
+# where Tproc is time required for Karatsuba pre- and post-processing,
+# is more realistic estimate. In this case it gives ... 1.91 cycles.
+# Or in other words, depending on how well we can interleave reduction
+# and one of the two multiplications the performance should be betwen
+# 1.91 and 2.16. As already mentioned, this implementation processes
+# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
+# - in 2.02. x86_64 performance is better, because larger register
+# bank allows to interleave reduction and multiplication better.
#
# Does it make sense to increase Naggr? To start with it's virtually
# impossible in 32-bit mode, because of limited register bank
# providing access to a Westmere-based system on behalf of Intel
# Open Source Technology Centre.
+# January 2010
+#
+# Tweaked to optimize transitions between integer and FP operations
+# on same XMM register, PCLMULQDQ subroutine was measured to process
+# one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere.
+# The minor regression on Westmere is outweighed by ~15% improvement
+# on Sandy Bridge. Strangely enough attempt to modify 64-bit code in
+# similar manner resulted in almost 20% degradation on Sandy Bridge,
+# where original 64-bit code processes one byte in 1.95 cycles.
+
+#####################################################################
+# For reference, AMD Bulldozer processes one byte in 1.98 cycles in
+# 32-bit mode and 1.89 in 64-bit.
+
+# February 2013
+#
+# Overhaul: aggregate Karatsuba post-processing, improve ILP in
+# reduction_alg9. Resulting performance is 1.96 cycles per byte on
+# Westmere, 1.95 - on Sandy/Ivy Bridge, 1.76 - on Bulldozer.
+
$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
push(@INC,"${dir}","${dir}../../perlasm");
require "x86asm.pl";
+$output=pop;
+open STDOUT,">$output";
+
&asm_init($ARGV[0],"ghash-x86.pl",$x86only = $ARGV[$#ARGV] eq "386");
$sse2=0;
&static_label("rem_4bit");
-if (0) {{ # "May" MMX version is kept for reference...
+if (!$sse2) {{ # pure-MMX "May" version...
$S=12; # shift factor for rem_4bit
# effective address calculation and finally merge of value to Z.hi.
# Reference to rem_4bit is scheduled so late that I had to >>4
# rem_4bit elements. This resulted in 20-45% procent improvement
-# on contemporary µ-archs.
+# on contemporary µ-archs.
{
my $cnt;
my $rem_4bit = "eax";
&function_end("gcm_ghash_4bit_mmx");
\f
}} else {{ # "June" MMX version...
- # ... has "April" gcm_gmult_4bit_mmx with folded loop.
- # This is done to conserve code size...
+ # ... has slower "April" gcm_gmult_4bit_mmx with folded
+ # loop. This is done to conserve code size...
$S=16; # shift factor for rem_4bit
sub mmx_loop() {
######################################################################
# Below subroutine is "528B" variant of "4-bit" GCM GHASH function
# (see gcm128.c for details). It provides further 20-40% performance
-# improvement over *previous* version of this module.
+# improvement over above mentioned "May" version.
&static_label("rem_8bit");
{ my @lo = ("mm0","mm1","mm2");
my @hi = ("mm3","mm4","mm5");
my @tmp = ("mm6","mm7");
- my $off1=0,$off2=0,$i;
+ my ($off1,$off2,$i) = (0,0,);
&add ($Htbl,128); # optimize for size
&lea ("edi",&DWP(16+128,"esp"));
&lea ("ebp",&DWP(16+256+128,"esp"));
# decompose Htable (low and high parts are kept separately),
- # generate Htable>>4, save to stack...
+ # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack...
for ($i=0;$i<18;$i++) {
&mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16);
my @red = ("mm0","mm1","mm2");
my $tmp = "mm3";
- &xor ($dat,&DWP(12,"ecx")); # merge input
+ &xor ($dat,&DWP(12,"ecx")); # merge input data
&xor ("ebx",&DWP(8,"ecx"));
&pxor ($Zhi,&QWP(0,"ecx"));
&lea ("ecx",&DWP(16,"ecx")); # inp+=16
&and (&LB($nlo),0x0f);
&shr ($nhi[1],4);
&pxor ($red[0],$red[0]);
- &rol ($dat,8); # next byte
+ &rol ($dat,8); # next byte
&pxor ($red[1],$red[1]);
&pxor ($red[2],$red[2]);
# Just like in "May" verson modulo-schedule for critical path in
- # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final xor
- # is scheduled so late that rem_8bit is shifted *right* by 16,
- # which is why last argument to pinsrw is 2, which corresponds to
- # <<32...
+ # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor'
+ # is scheduled so late that rem_8bit[] has to be shifted *right*
+ # by 16, which is why last argument to pinsrw is 2, which
+ # corresponds to <<32=<<48>>16...
for ($j=11,$i=0;$i<15;$i++) {
if ($i>0) {
&pxor ($Zlo,$tmp);
&pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
- &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^H[nhi]<<4
+ &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4)
} else {
&movq ($Zlo,&QWP(16,"esp",$nlo,8));
&movq ($Zhi,&QWP(16+128,"esp",$nlo,8));
}
&mov (&LB($nlo),&LB($dat));
- &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0);
+ &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0);
&movd ($rem[0],$Zlo);
- &movz ($rem[1],&LB($rem[1])) if ($i>0);
- &psrlq ($Zlo,8);
+ &movz ($rem[1],&LB($rem[1])) if ($i>0);
+ &psrlq ($Zlo,8); # Z>>=8
&movq ($tmp,$Zhi);
&mov ($nhi[0],$nlo);
&pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo]
&pxor ($Zhi,&QWP(16+128,"esp",$nlo,8));
- &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); #$rem[0]); # rem^H[nhi]<<4
+ &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4)
&pxor ($Zlo,$tmp);
&pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
&psllq ($red[1],4);
&movd ($rem[0],$Zlo);
- &psrlq ($Zlo,4);
+ &psrlq ($Zlo,4); # Z>>=4
&movq ($tmp,$Zhi);
&psrlq ($Zhi,4);
- &shl ($rem[0],4);
+ &shl ($rem[0],4); # rem<<4
&pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi]
&psllq ($tmp,60);
&pxor ($Zhi,$red[1]);
&movd ($dat,$Zlo);
- &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3);
+ &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48
- &psllq ($red[0],12);
+ &psllq ($red[0],12); # correct by <<16>>4
&pxor ($Zhi,$red[0]);
&psrlq ($Zlo,32);
&pxor ($Zhi,$red[2]);
&static_label("bswap");
sub clmul64x64_T2 { # minimal "register" pressure
-my ($Xhi,$Xi,$Hkey)=@_;
+my ($Xhi,$Xi,$Hkey,$HK)=@_;
&movdqa ($Xhi,$Xi); #
&pshufd ($T1,$Xi,0b01001110);
- &pshufd ($T2,$Hkey,0b01001110);
+ &pshufd ($T2,$Hkey,0b01001110) if (!defined($HK));
&pxor ($T1,$Xi); #
- &pxor ($T2,$Hkey);
+ &pxor ($T2,$Hkey) if (!defined($HK));
+ $HK=$T2 if (!defined($HK));
&pclmulqdq ($Xi,$Hkey,0x00); #######
&pclmulqdq ($Xhi,$Hkey,0x11); #######
- &pclmulqdq ($T1,$T2,0x00); #######
- &pxor ($T1,$Xi); #
- &pxor ($T1,$Xhi); #
+ &pclmulqdq ($T1,$HK,0x00); #######
+ &xorps ($T1,$Xi); #
+ &xorps ($T1,$Xhi); #
&movdqa ($T2,$T1); #
&psrldq ($T1,8);
# below. Algorithm 9 was therefore chosen for
# further optimization...
-sub reduction_alg9 { # 17/13 times faster than Intel version
+sub reduction_alg9 { # 17/11 times faster than Intel version
my ($Xhi,$Xi) = @_;
# 1st phase
- &movdqa ($T1,$Xi) #
+ &movdqa ($T2,$Xi); #
+ &movdqa ($T1,$Xi);
+ &psllq ($Xi,5);
+ &pxor ($T1,$Xi); #
&psllq ($Xi,1);
&pxor ($Xi,$T1); #
- &psllq ($Xi,5); #
- &pxor ($Xi,$T1); #
&psllq ($Xi,57); #
- &movdqa ($T2,$Xi); #
+ &movdqa ($T1,$Xi); #
&pslldq ($Xi,8);
- &psrldq ($T2,8); #
- &pxor ($Xi,$T1);
- &pxor ($Xhi,$T2); #
+ &psrldq ($T1,8); #
+ &pxor ($Xi,$T2);
+ &pxor ($Xhi,$T1); #
# 2nd phase
&movdqa ($T2,$Xi);
+ &psrlq ($Xi,1);
+ &pxor ($Xhi,$T2); #
+ &pxor ($T2,$Xi);
&psrlq ($Xi,5);
&pxor ($Xi,$T2); #
&psrlq ($Xi,1); #
- &pxor ($Xi,$T2); #
- &pxor ($T2,$Xhi);
- &psrlq ($Xi,1); #
- &pxor ($Xi,$T2); #
+ &pxor ($Xi,$Xhi) #
}
&function_begin_B("gcm_init_clmul");
&clmul64x64_T2 ($Xhi,$Xi,$Hkey);
&reduction_alg9 ($Xhi,$Xi);
+ &pshufd ($T1,$Hkey,0b01001110);
+ &pshufd ($T2,$Xi,0b01001110);
+ &pxor ($T1,$Hkey); # Karatsuba pre-processing
&movdqu (&QWP(0,$Htbl),$Hkey); # save H
+ &pxor ($T2,$Xi); # Karatsuba pre-processing
&movdqu (&QWP(16,$Htbl),$Xi); # save H^2
+ &palignr ($T2,$T1,8); # low part is H.lo^H.hi
+ &movdqu (&QWP(32,$Htbl),$T2); # save Karatsuba "salt"
&ret ();
&function_end_B("gcm_init_clmul");
&movdqu ($Xi,&QWP(0,$Xip));
&movdqa ($T3,&QWP(0,$const));
- &movdqu ($Hkey,&QWP(0,$Htbl));
+ &movups ($Hkey,&QWP(0,$Htbl));
&pshufb ($Xi,$T3);
+ &movups ($T2,&QWP(32,$Htbl));
- &clmul64x64_T2 ($Xhi,$Xi,$Hkey);
+ &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$T2);
&reduction_alg9 ($Xhi,$Xi);
&pshufb ($Xi,$T3);
&movdqu ($Xn,&QWP(16,$inp)); # Ii+1
&pshufb ($T1,$T3);
&pshufb ($Xn,$T3);
+ &movdqu ($T3,&QWP(32,$Htbl));
&pxor ($Xi,$T1); # Ii+Xi
- &clmul64x64_T2 ($Xhn,$Xn,$Hkey); # H*Ii+1
- &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2
-
+ &pshufd ($T1,$Xn,0b01001110); # H*Ii+1
+ &movdqa ($Xhn,$Xn);
+ &pxor ($T1,$Xn); #
&lea ($inp,&DWP(32,$inp)); # i+=2
+
+ &pclmulqdq ($Xn,$Hkey,0x00); #######
+ &pclmulqdq ($Xhn,$Hkey,0x11); #######
+ &pclmulqdq ($T1,$T3,0x00); #######
+ &movups ($Hkey,&QWP(16,$Htbl)); # load H^2
+ &nop ();
+
&sub ($len,0x20);
&jbe (&label("even_tail"));
+ &jmp (&label("mod_loop"));
-&set_label("mod_loop");
- &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
- &movdqu ($T1,&QWP(0,$inp)); # Ii
- &movdqu ($Hkey,&QWP(0,$Htbl)); # load H
+&set_label("mod_loop",32);
+ &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi)
+ &movdqa ($Xhi,$Xi);
+ &pxor ($T2,$Xi); #
+ &nop ();
- &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
- &pxor ($Xhi,$Xhn);
+ &pclmulqdq ($Xi,$Hkey,0x00); #######
+ &pclmulqdq ($Xhi,$Hkey,0x11); #######
+ &pclmulqdq ($T2,$T3,0x10); #######
+ &movups ($Hkey,&QWP(0,$Htbl)); # load H
- &movdqu ($Xn,&QWP(16,$inp)); # Ii+1
- &pshufb ($T1,$T3);
- &pshufb ($Xn,$T3);
+ &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
+ &movdqa ($T3,&QWP(0,$const));
+ &xorps ($Xhi,$Xhn);
+ &movdqu ($Xhn,&QWP(0,$inp)); # Ii
+ &pxor ($T1,$Xi); # aggregated Karatsuba post-processing
+ &movdqu ($Xn,&QWP(16,$inp)); # Ii+1
+ &pxor ($T1,$Xhi); #
- &movdqa ($T3,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1
- &movdqa ($Xhn,$Xn);
- &pxor ($Xhi,$T1); # "Ii+Xi", consume early
+ &pshufb ($Xhn,$T3);
+ &pxor ($T2,$T1); #
- &movdqa ($T1,$Xi) #&reduction_alg9($Xhi,$Xi); 1st phase
+ &movdqa ($T1,$T2); #
+ &psrldq ($T2,8);
+ &pslldq ($T1,8); #
+ &pxor ($Xhi,$T2);
+ &pxor ($Xi,$T1); #
+ &pshufb ($Xn,$T3);
+ &pxor ($Xhi,$Xhn); # "Ii+Xi", consume early
+
+ &movdqa ($Xhn,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1
+ &movdqa ($T2,$Xi); #&reduction_alg9($Xhi,$Xi); 1st phase
+ &movdqa ($T1,$Xi);
+ &psllq ($Xi,5);
+ &pxor ($T1,$Xi); #
&psllq ($Xi,1);
&pxor ($Xi,$T1); #
- &psllq ($Xi,5); #
- &pxor ($Xi,$T1); #
&pclmulqdq ($Xn,$Hkey,0x00); #######
+ &movups ($T3,&QWP(32,$Htbl));
&psllq ($Xi,57); #
- &movdqa ($T2,$Xi); #
+ &movdqa ($T1,$Xi); #
&pslldq ($Xi,8);
- &psrldq ($T2,8); #
- &pxor ($Xi,$T1);
- &pshufd ($T1,$T3,0b01001110);
+ &psrldq ($T1,8); #
+ &pxor ($Xi,$T2);
+ &pxor ($Xhi,$T1); #
+ &pshufd ($T1,$Xhn,0b01001110);
+ &movdqa ($T2,$Xi); # 2nd phase
+ &psrlq ($Xi,1);
+ &pxor ($T1,$Xhn);
&pxor ($Xhi,$T2); #
- &pxor ($T1,$T3);
- &pshufd ($T3,$Hkey,0b01001110);
- &pxor ($T3,$Hkey); #
-
&pclmulqdq ($Xhn,$Hkey,0x11); #######
- &movdqa ($T2,$Xi); # 2nd phase
+ &movups ($Hkey,&QWP(16,$Htbl)); # load H^2
+ &pxor ($T2,$Xi);
&psrlq ($Xi,5);
&pxor ($Xi,$T2); #
&psrlq ($Xi,1); #
- &pxor ($Xi,$T2); #
- &pxor ($T2,$Xhi);
- &psrlq ($Xi,1); #
- &pxor ($Xi,$T2); #
-
+ &pxor ($Xi,$Xhi) #
&pclmulqdq ($T1,$T3,0x00); #######
- &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2
- &pxor ($T1,$Xn); #
- &pxor ($T1,$Xhn); #
-
- &movdqa ($T3,$T1); #
- &psrldq ($T1,8);
- &pslldq ($T3,8); #
- &pxor ($Xhn,$T1);
- &pxor ($Xn,$T3); #
- &movdqa ($T3,&QWP(0,$const));
&lea ($inp,&DWP(32,$inp));
&sub ($len,0x20);
&ja (&label("mod_loop"));
&set_label("even_tail");
- &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
+ &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi)
+ &movdqa ($Xhi,$Xi);
+ &pxor ($T2,$Xi); #
- &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
- &pxor ($Xhi,$Xhn);
+ &pclmulqdq ($Xi,$Hkey,0x00); #######
+ &pclmulqdq ($Xhi,$Hkey,0x11); #######
+ &pclmulqdq ($T2,$T3,0x10); #######
+ &movdqa ($T3,&QWP(0,$const));
+
+ &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi)
+ &xorps ($Xhi,$Xhn);
+ &pxor ($T1,$Xi); # aggregated Karatsuba post-processing
+ &pxor ($T1,$Xhi); #
+
+ &pxor ($T2,$T1); #
+
+ &movdqa ($T1,$T2); #
+ &psrldq ($T2,8);
+ &pslldq ($T1,8); #
+ &pxor ($Xhi,$T2);
+ &pxor ($Xi,$T1); #
&reduction_alg9 ($Xhi,$Xi);
&test ($len,$len);
&jnz (&label("done"));
- &movdqu ($Hkey,&QWP(0,$Htbl)); # load H
+ &movups ($Hkey,&QWP(0,$Htbl)); # load H
&set_label("odd_tail");
&movdqu ($T1,&QWP(0,$inp)); # Ii
&pshufb ($T1,$T3);
&set_label("bswap",64);
&data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
&data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial
-}} # $sse2
-
-&set_label("rem_4bit",64);
- &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S);
- &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S);
- &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S);
- &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S);
&set_label("rem_8bit",64);
&data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E);
&data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E);
&data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E);
&data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE);
&data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE);
+}} # $sse2
+
+&set_label("rem_4bit",64);
+ &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S);
+ &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S);
+ &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S);
+ &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S);
}}} # !$x86only
&asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>");
&asm_finish();
+close STDOUT;
+
# A question was risen about choice of vanilla MMX. Or rather why wasn't
# SSE2 chosen instead? In addition to the fact that MMX runs on legacy
# CPUs such as PIII, "4-bit" MMX version was observed to provide better
# per-invocation lookup table setup. Latter means that table size is
# chosen depending on how much data is to be hashed in every given call,
# more data - larger table. Best reported result for Core2 is ~4 cycles
-# per processed byte out of 64KB block. Recall that this number accounts
-# even for 64KB table setup overhead. As discussed in gcm128.c we choose
-# to be more conservative in respect to lookup table sizes, but how
-# do the results compare? Minimalistic "256B" MMX version delivers ~11
-# cycles on same platform. As also discussed in gcm128.c, next in line
-# "8-bit Shoup's" method should deliver twice the performance of "4-bit"
-# one. It should be also be noted that in SSE2 case improvement can be
-# "super-linear," i.e. more than twice, mostly because >>8 maps to
-# single instruction on SSE2 register. This is unlike "4-bit" case when
-# >>4 maps to same amount of instructions in both MMX and SSE2 cases.
+# per processed byte out of 64KB block. This number accounts even for
+# 64KB table setup overhead. As discussed in gcm128.c we choose to be
+# more conservative in respect to lookup table sizes, but how do the
+# results compare? Minimalistic "256B" MMX version delivers ~11 cycles
+# on same platform. As also discussed in gcm128.c, next in line "8-bit
+# Shoup's" or "4KB" method should deliver twice the performance of
+# "256B" one, in other words not worse than ~6 cycles per byte. It
+# should be also be noted that in SSE2 case improvement can be "super-
+# linear," i.e. more than twice, mostly because >>8 maps to single
+# instruction on SSE2 register. This is unlike "4-bit" case when >>4
+# maps to same amount of instructions in both MMX and SSE2 cases.
# Bottom line is that switch to SSE2 is considered to be justifiable
# only in case we choose to implement "8-bit" method...