bfsha1 0.1.2a

This is currently the fastest single hash SHA1 brute forcer (on a GTS 450 the next fastest is Hashcat-lite v0.10 mine is about 15.6% faster). I learned something from Atom back in September 28-29, 2012, using one constant is better than multiple constants. I didn't do a release because I was too lazy anyway here's the faster version, cleaner code, and oh did I mention source code?

bfsha1 {--benchmark|hash} [gpu-device-num]

So the four ways to use this are:
  • bfsha1 --benchmark
  • bfsha1 --benchmark 0
  • bfsha1 ffffffffffffffffffffffffffffffffffffffff
  • bfsha1 ffffffffffffffffffffffffffffffffffffffff 0

You only get stats when it finishes:
  • Cracks the password
  • Finishes brute forcing "[ -~]{8}" (95^8 = 6,634,204,312,890,625 ~ 2^52.56)
  • If you use --benchmark (60 * 95 * 95 * blocks * threadsPerBlock passwords)
**** THIS ONLY WORKS WITH NEWER CARDS (Compute Capability 2.x) ****
This may use too high of a block or threads/block to run. All I know is that it works on a GTS 450 and runs at about 244 M/s (after a fresh restart) on Windows 7, 64 bit with driver version 306.23 and this was compiled with CUDA 4.1.
I did a few things that made it faster but made no sense such as:
d_foundPw[0] = pw0;
d_foundPw[1] = pw1;
d_foundPw[2] = pw2;
vs
d_foundPw[0] = 1;
Apparently writing 12 bytes is faster than writing 4 bytes to global memory.
#define ROL(x,s) (((x) << (s)) + ((x) >> (32 - (s))))
vs
#define ROL(x,s)  ((x) << (s)) + ((x) >> (32 - (s)))
I checked to see if I needed parentheses around this and took them out because I though it might be faster but it's slower. I don't know if things like these are the same across all cards or is just for GTS 450s. So this may very well be slower than Hashcat-lite on other cards.