Lastweek I wrote about Passwords Evolved: Authentication Guidance for the Modern Era with the aim of helping those building services which require authentication to move into the modern era of how we think about protecting accounts. In that post, I talked about NIST's Digital Identity Guidelines which were recently released. Of particular interest to me was the section advising organisations to block subscribers from using passwords that have previously appeared in a data breach. Here's the full excerpt from the authentication & lifecycle management doc (CSP is "Credential Service Provider"):
NIST isn't mincing words here, in fact they're quite clearly saying that you shouldn't be allowing people to use a password that's been breached before, among other types of passwords they shouldn't be using. The reasons for this should be obvious but just in case you're not fully aware of the risks, have a read of my recent post on password reuse, credential stuffing and another billion records in Have I been pwned (HIBP). As I read NIST's guidance, I realised I was in a unique position to help do something about the problem they're trying to address due to the volume of data I've obtained in running HIBP. Others picked up on this too:
This blog post introduces a new service I call "Pwned Passwords", gives you guidance on how to use it and ultimately, provides you with 306 million passwords you can download for free and use to protect your own systems. If you're impatient you can go and play with it right now, otherwise let me explain what I've created.
Before I go any further, I've always been pretty clear about not redistributing data from breaches and this doesn't change that one little bit. I'll get into the nuances of that shortly but I wanted to make it crystal clear up front: I'm providing this data in a way that will not disadvantage those who used the passwords I'm providing. As such, they're not in clear text and whilst I appreciate that will mean some use cases aren't feasible, protecting the individuals still using these passwords is the first priority.
I've aggregated these passwords from a variety of different sources, starting with the massive combo lists I wrote about in May. These contain all the sorts of terrible passwords you'd expect from real world examples and you can read an analysis in BinaryEdge's post on how users are choosing their passwords on the internet. I began with the Exploit.in list which has 805,499,391 rows of email address and plain text password pairs. That actually "only" had 593,427,119 unique email addresses in it so what we're seeing here is a heap of email accounts with more than one password. This is the reality of these combo lists: they're often providing multiple different alternate passwords which could be used to break into the one account.
I grabbed the passwords from the Exploit.in list which gave me 197,602,390 unique values. Think about this for a moment: 75% of the passwords in that one data set had been used more than once. This is really important as it starts to put shape around the scale of the problem we're facing.
I moved on to the Anti Public list which contained 562,077,488 rows with 457,962,538 unique email addresses. This gave me a further 96,684,629 unique passwords not already in the Exploit.in data. Looking at it the other way, 83% of the passwords in that set had already been seen before. This is entirely expected: as more data is added, a smaller proportion of the passwords are previously unseen.
From there, I moved through a variety of other data sources adding more and more passwords albeit with a steadily decreasing rate of new ones appearing. I was adding sources with tens of millions of passwords and finding "only" a 6-figure number of new ones. Whilst you could say that the data I'm providing is largely comprised of those two combo lists, you could also say that once you have hundreds of millions of passwords, new data breaches are simply not turning up too much stuff we haven't already seen. (Keep that last point in mind for when I later talk about updates.)
Edit: And then I added another 13,675,934 the following day to bring the total to 319,935,446 (let's just call it 320 million). Whilst this increase is only 4%, it's important because the initial processing I performed caused only one version of multiple passwords with different cases to be loaded. For example, "p@55w0rd" was loaded but not "P@55w0rd" with a capital "p". I'll explain these concepts in full shortly, but the online system is now properly case sensitive and the downloadable passwords have their first incremental update so you'll see both the initial 306 million plus "Update 1".
For quite some time now, I've had suggestions along the lines of that earlier tweet saying "you should build a service for websites to check passwords against when customers sign up". I want to explain why this is a bad idea, why I've done it anyway and why that's not how you should use the service.
To the first point, there is now a link on the nav of HIBP titled Passwords. On that page, there's a search box where you can enter a password and it will tell you if it exists on the service. For example, if you test the password "p@55w0rd":
It goes without saying (although I say it anyway on that page), but don't enter a password you currently use into any third-party service like this! I don't explicitly log them and I'm a trustworthy guy but yeah, don't. The point of the web-based service is so that people who have been guilty of using sloppy passwords have a means of independent verification that it's not one they should be using any more. Mind you, someone could actually have an exceptionally good password but if the website stored it in plain text then leaked it, that password has still been "burned".
My hope is that an easily accessible online service like this also partially addresses the age-old request I've had to provide email address and password pairs; if the password alone comes back with a hit on this service, that's a very good reason to no longer use it regardless of whose account it originally appeared against.
As well people checking passwords they themselves may have used, I'm envisaging more tech-savvy people using this service to demonstrate a point to friends, relatives and co-workers: "you see, this password has been breached before, don't use it!" If this one thing I've learned over the years of running this service, it's that nothing hits home like seeing your own data pwned.
The service auto-detects SHA1 hashes in the web UI so if your actual password was a SHA1 hash, that's not going to work for you. This is where you need the API which is per the existing APIs on the service, is fully documented. Using this you can perform a search as follows:
That will actually return a 404 as nobody used the hash of "p@55w0rd" as their actual password (at least if they did, it hasn't appeared in plain text or was readily crackable). There's no response body when hitting the API, just 404 when the password isn't found and 200 when it is, for example when just searching for "p@55w0rd" via its hash:
Just like the other APIs on HIBP, the Pwned Passwords service fully supports CORS so if you really did want to integrate it into a web front end somewhere, you can (I suggest sending only a SHA1 hash if you want to do that, at least it's some additional protection). Also like the other APIs, it's rate limited to one request every 1,500ms per IP address. This is heaps for legitimate web-based use cases.
One quick caveat on the search feature: absence of evidence is not evidence of absence or in other words, just because a password doesn't return a hit doesn't mean it hasn't been previously exposed. For example, the password I used on Dropbox is out there as a bcrypt hash and given it's a randomly generated string out of 1Password, it's simply not getting cracked. I say this because some people will inevitably say "I was in the XX breach and used YY password but your service doesn't say it was pwned". Now you know why!
The entire collection of 306 million hashed passwords can be directly downloaded from the Pwned Passwords page. It's a single 7-Zip file that's 5.3GB which you can then download and extract into whatever data structure you want to work with (it's 11.9GB once expanded). This allows you to use the passwords in whatever fashion you see fit and I'll give you a few sample scenarios in a moment.
Providing data in this fashion wasn't easy, primarily due to the size of the zip file. Actually, let me rephrase that: it wouldn't be easy if I wanted to do it without spending a heap for other people to download the data! I asked for some advice on this whilst preparing the service:
There were lots of well-intentioned suggestions which wouldn't fly. For example, Dropbox and OneDrive aren't intended for sharing files with a large audience and they'll pull your ability to do so if you try (believe me). Hosting models which require me to administer a server are also out as that's a bunch of other responsibility I'm unwilling to take on. Lots of people pointed to file hosting models where the storage was cheap but then the bandwidth stung so those were out too. Backblaze's B2 was the most cost effective but at 2c a GB for downloads, I could easily see myself paying north of a thousand dollars over time. Amazon has got a neat Requestor Pays Feature but as soon as there's a cost - any cost - there's a barrier to entry. In fact, both this model and torrenting it were out because they make access to data harder; many organisations block torrents (for obvious reasons) and I know, for example, that either of these options would have posed insurmountable hurdles at my previous employment. (Actually, I probably would have ended up just paying for it myself due to the procurement challenges of even a single-digit dollar amount, but let's not get me started on that!)
After that tweet, I got several offers of support which was awesome given it wasn't even clear what I was doing! One of those offers came from Cloudflare who I've written about many times before. I'm a big supporter of what they do for all the sorts of reasons mentioned in those posts, plus their offer of support would mean the data would be aggressively cached in their 115 edge nodes around the world. What this means over and above simple hosting of the files itself is that downloads should be super fast for everyone because it's always being served from somewhere very close to them. The source file actually sits in Azure blob storage but regardless of how many times you guys download it, I'll only see a few requests a month at most. So big thanks to Cloudflare for not just making this possible in the first place, but for making it a better experience for everyone.
3a8082e126