Jacques Mattheij

Technology, Coding and Business

Free, Public Data Sets

I’m a data junkie, I have to confess to that.

Hi, my name is Jacques and I have a problem. Whenever I see a large chunk of structured (or even unstructured!) data pass by I just have to have a copy. It’s not that I’m a packrat, it’s just that large gobs of data are always inspirational in some way or other.

What could you do with that data, in what new ways could you slice and dice it to get new insights, in what ways can you combine it with data that you already have to enable you to do new things. This is where in my opinion the real value of these datasets lies, the sum is much bigger than the parts.

Because of that I keep a list of public data sets handy, and a bunch of harddrives full with interesting stuff that I’ve collected over the years.

Some of it is at first glance boring, on second thought fascinating, other times it’s the other way around. Whatever comes out of it you’ll always learn and there are always little surprises.

Here are some pointers if you want to start your own collection, it’s nothing but a starting point but be warned, before you know it you’ll be drowning in data and ideas on what you could do with it, you’ll get pulled in before you know it ;)

Google has a number of public data sets. These are fairly US centric but contain lots of interesting information:

http://www.google.com/publicdata/directory

Amazon has a few datasets, the annotated human genome and other bioinformatics data, some US census databases and a dump of freebase. If you’re already on EC2 you should have easy access to this data.

http://aws.amazon.com/publicdatasets/

The project Gutenberg dvd ISO file:

http://www.gutenberg.org/cdproject/pgdvd042010.iso.torrent

This contains all the books in the collection, they’re older books (out of copyright) so you won’t be training your spam filter on them but it is a vast corpus of written text.

All of WikiPedia can be downloaded in bulk:

http://dumps.wikimedia.org/

This contains articles, metadata and so on.

MaxMind has a pretty useful GeoIp database available for download:

http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip

MusicBrainz has a huge database of information with music related data:

http://musicbrainz.org/doc/Database_Download

It’s in the postgres ‘copy to’ format so you’ll need postgres to use it, of course once you have the data imported in postgres you can convert it to any format that you want.

America Online released a set of files containing millions of search queries in 2005. It’s hard to find copies of this but even though it’s dated it is still quite a goldmine if you’re trying to gain insight in how people search and what they search for. Technically speaking that’s not a ‘free, public dataset’ because AOL has done a lot of work to try to put the genie back in to the bottle but 5 minutes of googling will turn up a few copies.

There is a huge collection of links to datasets here:

http://www.datawrangling.com/some-datasets-available-on-the-web

Not all of them work but most do.

Another page with some links to datasets:

http://www.kdnuggets.com/datasets/

http://www.guardian.co.uk/uk/datablog/2011/feb/02/ukcrime-data-store

http://www.guardian.co.uk/data

An interesting side effect of writing this is that there is now a much better place to look for even more data: http://news.ycombinator.com/item?id=2165497

And another one here: http://news.ycombinator.com/item?id=2414614 (tx jcr!)