.explicit .text .ident "ia64.S, Version 2.0" .ident "IA-64 ISA artwork by Andy Polyakov " // // ==================================================================== // Written by Andy Polyakov for the OpenSSL // project. // // Rights for redistribution and usage in source and binary forms are // granted according to the OpenSSL license. Warranty of any kind is // disclaimed. // ==================================================================== // // Version 2.x is Itanium2 re-tune. Few words about how Itanum2 is // different from Itanium to this module viewpoint. Most notably, is it // "wider" than Itanium? Can you experience loop scalability as // discussed in commentary sections? Not really:-( Itanium2 has 6 // integer ALU ports, i.e. it's 2 ports wider, but it's not enough to // spin twice as fast, as I need 8 IALU ports. Amount of floating point // ports is the same, i.e. 2, while I need 4. In other words, to this // module Itanium2 remains effectively as "wide" as Itanium. Yet it's // essentially different in respect to this module, and a re-tune was // required. Well, because some intruction latencies has changed. Most // noticeably those intensively used: // // Itanium Itanium2 // ldf8 9 6 L2 hit // ld8 2 1 L1 hit // getf 2 5 // xma[->getf] 7[+1] 4[+0] // add[->st8] 1[+1] 1[+0] // // What does it mean? You might ratiocinate that the original code // should run just faster... Because sum of latencies is smaller... // Wrong! Note that getf latency increased. This means that if a loop is // scheduled for lower latency (and they are), then it will suffer from // stall condition and the code will therefore turn anti-scalable, e.g. // original bn_mul_words spun at 5*n or 2.5 times slower than expected // on Itanium2! What to do? Reschedule loops for Itanium2? But then // Itanium would exhibit anti-scalability. So I've chosen to reschedule // for worst latency for every instruction aiming for best *all-round* // performance. // Q. How much faster does it get? // A. Here is the output from 'openssl speed rsa dsa' for vanilla // 0.9.6a compiled with gcc version 2.96 20000731 (Red Hat // Linux 7.1 2.96-81): // // sign verify sign/s verify/s // rsa 512 bits 0.0036s 0.0003s 275.3 2999.2 // rsa 1024 bits 0.0203s 0.0011s 49.3 894.1 // rsa 2048 bits 0.1331s 0.0040s 7.5 250.9 // rsa 4096 bits 0.9270s 0.0147s 1.1 68.1 // sign verify sign/s verify/s // dsa 512 bits 0.0035s 0.0043s 288.3 234.8 // dsa 1024 bits 0.0111s 0.0135s 90.0 74.2 // // And here is similar output but for this assembler // implementation:-) // // sign verify sign/s verify/s // rsa 512 bits 0.0021s 0.0001s 549.4 9638.5 // rsa 1024 bits 0.0055s 0.0002s 183.8 4481.1 // rsa 2048 bits 0.0244s 0.0006s 41.4 1726.3 // rsa 4096 bits 0.1295s 0.0018s 7.7 561.5 // sign verify sign/s verify/s // dsa 512 bits 0.0012s 0.0013s 891.9 756.6 // dsa 1024 bits 0.0023s 0.0028s 440.4 376.2 // // Yes, you may argue that it's not fair comparison as it's // possible to craft the C implementation with BN_UMULT_HIGH // inline assembler macro. But of course! Here is the output // with the macro: // // sign verify sign/s verify/s // rsa 512 bits 0.0020s 0.0002s 495.0 6561.0 // rsa 1024 bits 0.0086s 0.0004s 116.2 2235.7 // rsa 2048 bits 0.0519s 0.0015s 19.3 667.3 // rsa 4096 bits 0.3464s 0.0053s 2.9 187.7 // sign verify sign/s verify/s // dsa 512 bits 0.0016s 0.0020s 613.1 510.5 // dsa 1024 bits 0.0045s 0.0054s 221.0 183.9 // // My code is still way faster, huh:-) And I believe that even // higher performance can be achieved. Note that as keys get // longer, performance gain is larger. Why? According to the // profiler there is another player in the field, namely // BN_from_montgomery consuming larger and larger portion of CPU // time as keysize decreases. I therefore consider putting effort // to assembler implementation of the following routine: // // void bn_mul_add_mont (BN_ULONG *rp,BN_ULONG *np,int nl,BN_ULONG n0) // { // int i,j; // BN_ULONG v; // // for (i=0; i