+
+# A question was risen about choice of vanilla MMX. Or rather why wasn't
+# SSE2 chosen instead? In addition to the fact that MMX runs on legacy
+# CPUs such as PIII, "4-bit" MMX version was observed to provide better
+# performance than *corresponding* SSE2 one even on contemporary CPUs.
+# SSE2 results were provided by Peter-Michael Hager. He maintains SSE2
+# implementation featuring full range of lookup-table sizes, but with
+# per-invocation lookup table setup. Latter means that table size is
+# chosen depending on how much data is to be hashed in every given call,
+# more data - larger table. Best reported result for Core2 is ~4 cycles
+# per processed byte out of 64KB block. Recall that this number accounts
+# even for 64KB table setup overhead. As discussed in gcm128.c we choose
+# to be more conservative in respect to lookup table sizes, but how
+# do the results compare? As per table in the beginning, minimalistic
+# MMX version delivers ~11 cycles on same platform. As also discussed in
+# gcm128.c, next in line "8-bit Shoup's" method should deliver twice the
+# performance of "4-bit" one. It should be also be noted that in SSE2
+# case improvement can be "super-linear," i.e. more than twice, mostly
+# because >>8 maps to single instruction on SSE2 register. This is
+# unlike "4-bit" case when >>4 maps to same amount of instructions in
+# both MMX and SSE2 cases. Bottom line is that switch to SSE2 is
+# considered to be justifiable only in case we choose to implement
+# "8-bit" method...