A short while ago I was playing with the data from "1,000 songs to hear before you die" from the Guardian's Datablog. The idea being that I'd find the artists they'd missed by crunching all the bands/artists though last.fm looking for the similar artists that appeared most often that weren't already on the list.

I got as far as cleaning up (more on that later) the data, adding the MusicBrainz IDs (mbid) for each artist and converting it to JSON, a format I tend to work with a lot. But it turns out that moving country actually takes up a surprising amount of time, so I put it to one side for later.

Guardian 1000 songs json objects

Now I see that they're running a Competition to visualize some of their data, on the grounds that someone may find this useful and I wanted to give GitHub a try, I'm throwing out the data I cleaned up. Hopefully, if I followed the instructions correctly it's over here ...

Guardians-1-000-songs-to-hear-before-you-die

The file you're specifically after is js/guardian_1000_songs.js which will give you these handy js object thingys;

guardian_1000_songs.data
The original data, cleaned up from the Guardian's spread sheet
guardian_1000_songs.artists
All the artists, along with the number of times they're referenced, the Music Brainz ID (mbid), if last.fm knows the mbid and the tracks
guardian_1000_songs.artists_a
An array to hold the key values for the artists object

Obviously this is geared towards JavaScript and AJAX api type calls, but it should take much effort to convert it back into spreadsheety formats.

I've also included an example file datagrab.html that should work in Firefox, untested on anything else.

A note on the data and what to do with it

Each artist has a MusicBrainz id for it, you can call the last.fm API passing over the mbid to get more information about that artist, including similar artists, like this ...

http://ws.audioscrobbler.com/2.0/?method=artist.getinfo
&mbid=66c662b6-6e2f-4930-8610-912e24c63ed1
&api_key=[api_key]

However there are a few records that don't have mbids, these three;

  1. Crystal Mansion
  2. Grange Hill Cast
  3. Sheffield Socialist Choir

Don't have records at MusicBrainz, although you can still find them via a normal artist search at last.fm, they have a mbid of "0".

These six have ids at MusicBrainz, but last.fm doesn't know what it is, I've flagged this in the data as "mbid_known_by_lastfm";

  1. Bon Iver
  2. Glasvegas
  3. Lou Reed & John Cale
  4. Sam Mayo
  5. The Ting Tings
  6. William Blake

For the life of me I can't work out why last.fm doesn't have an ID for The Ting Tings but there you go.

So out of 994, you now have 985 music tracks for which you have an artist's mbid.

Cleaning the data

A small aside about getting to this point.

The original Guardian data is over here and frankly a bit funky. I grabbed it as a CVS and did a rough conversion to JSON. I then ran through each row in turn searching for the artist/band via the last.fm API.

This pulled up a number of records where there was either more than one match (where the first match was normally the correct one) or no matches at all. Of the no-matches a closer look showed that there were a number of places where the artist and trackname are swapped over, the odd typo and so on.

Often it turned out that while The Guardian had "Kid Creole and the Coconuts" last.fm likes "Kid Creole & The Coconuts" so I fixed all those up and went over it again.

That left me with about 60 or so to go though by hand, which took a couple of evenings. I would convert it back to a Google Docs SpreadSheet but don't actually have the time atm.

But if you find any mistakes, omissions or use the data in any interesting ways let me know in the comments or fix it up on github :)