There is definitely some data weirdness.
My old LinkedIn password plaintext is
cud5dfyy – yes I know that publishing a plaintext is probably stupid but it’s short enough to bruteforce in Hashcat, has been changed, was only used for LinkedIn and is really really old, so I might as well save you the trouble.
If somehow I get hacked because of this, so be it; at least morally I am OK.
Anyway – following a comment by Chris Goggans on Facebook:
Most of the hashes are valid SHA1 but with the first six positions of the hash apparently overwritten with 000000, 000001, 000002, etc.
The SHA1 for “password” is 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8, but this does not appear in the file.
The string “000001e4c9b93f3f0682250b6cf8331b7ee68fd8” does.
Either 1) LinkedIn was trying to be clever and used a modified SHA1 routine which never actually saved the whole hash and does a crypt&compare of the hash from position 6 onward, or 2) The guy who put out the dump purposely screwed with it before releasing.
Now, there are still 2.9M hashes in the dump without the 00000. I haven’t done anything with those yet to see if they are real.
Chris is sound so I decided to run a little test:
+ grep b85868504e74bbcd6e58ab4f86d9d20b83 combo_not.txt
+ echo :::: found cud5dfyy
Yep, if you accept Goggans’ theory and if losing six chars (actually five) bytes of prefix are not enough to induce hash collisions, then my old LinkedIn password is in there twice.
Thus lashing-up a mini password cracker:
$ more test-stupid.sh
while read word
pattern=`echo "$word\c" | openssl sha1 | sed -e 's/^......//'`
grep $pattern $file && echo :::: found $word
$ ./test-stupid.sh < /usr/share/dict/words 00000bca9701606b01b6245d587d26c31b63a433 :::: found aardvark 000006b960572398e02f82878e2dfeadb4518899 :::: found aardwolf 00000c1e41f74b4e4a5950a0dda602fda275e4a1 :::: found abacate 0000058b1c71d517644ff6a4ed5e5421b83c4fca :::: found abacinate 00000267f9f1e4469f8eb7bf45704218293412db :::: found abacus 00000604ba82485d494fbc5fd8365509f36ee259 :::: found abalone 0000059e3099495023c7f4c15223e146e3fb6fdd :::: found abandon 00000d0fec22d3282d0e70911e563402b8429cfc :::: found abandoned 00000906d39b74998716738fbb2b6fa3620079f2 :::: found abased 0000059e4c7fcba827f22a25fe506baa6d011737 :::: found abattoir 000006e2be4ada6c7ce5b76554311a3330855949 :::: found abbacy 000001e35b00e6675efeef5d813dbf1ce62300cd :::: found abbasi 0000021312a4ec34d96bce4eca98a879c684878a :::: found abbess [...]
...which lends a little weight to the theory that the file primarily contains hashes which some script kiddie could not crack with basic tools, and hence makes us wonder what he's done with all the ones which he did crack - and how much of the LinkedIn corpus that would represent?