Download Goodreads Data

0 views

Skip to first unread message

Cori Lenon

unread,

Aug 4, 2024, 5:47:26 PM8/4/24

to serlisuce

Myreading certainly slowed down during the summer months. Most of this is due to me doing other things during a beautiful Wisconsin summer like playing golf and riding my bike. Between January and May, I read an average of 39.6 pages per day, between June and September, I read about 14.8 pages per day, and finishing off the year, I read 31.3 pages per day from October through the end of the year.

For most of the year, I had a fairly consistent book-finishing pace. I think a lot of this can be explained by choosing shorter books in 2020. 70% of the books I read this year were less than 400 pages long.

There were a few clear outliers with respect to reading pace throughout the year. I read two novels (The Remains of the Day and Never Let Me Go, both by Kazuo Ishiguro) very slowly, taking 43 and 28 days to finish those books, respectively. I also read two books at a very fast pace (Red Queen and The Art of Solitude), where I was reading at a pace of 76.6 and 66.7 pages per day, respectively.

This was a fun way to look back on my year in books for 2020. There are a few aspects of this data that I could look into like the distribution of genres, the text summary of the book, and text reviews from other Goodreads users. That analysis will have to wait for another day!

I like the processes of other people: Katy Decorah uses GitHub actions, Nienke uses her own What.pm and Jeremy Keith tags notes with ISBN numbers. My process is still different from each of them though, for no particular reason.

Goodreads provides reading data in CSV files, which are reasonably well structured. For the Eleventy site, I needed a folder full of Markdown files, one for each book, with basically the metadata from the Goodreads export as Yaml front matter:

I did this in Node using a slightly modified version of CSV to Markdown. The project is archived and I could not get the Noderize wrapper to work, but the provided script in index.js did the job for me.

This site has a lot of images, and resizing or optimising would not be my definition of fun. I used the official Image plugin for Eleventy to generate correctly sized images from my source images, and the documented Nunjucks shortcode to output a picture element with webp and jpg versions in various sizes.

There was not enough complexity to warrant any form of CSS processing, so I just created a single CSS file and started adding styles. I used no methodology or framework. I did mostly avoid classes, because in my day to day work I overuse them and it seemed like a fun challenge.

So, my book site is available on books.hiddedevries.nl, the source is on GitHub. This has been a fun weekend project, and to be honest, I am very much looking forward to continue expanding the existing data and add new stuff.

Hidde de Vries (@h...@front-end.social) is a web enthusiast and accessibility specialist from Rotterdam (The Netherlands). He currently works with the NL Design System team and is a participant in the Open UI Community Group. Previously, he worked for W3C (WAI), Mozilla, the Dutch government and others as a freelancer. Hidde spoke at 68 events, most recently in Rotterdam, Netherlands.

I've been a Goodreads user for a few years now, and much like how I use other 'tracking' services, I'm not there for other folks' reading lists or recommendations, but instead, as a way to track what I've read. I especially like looking back over the past year and being reminded of the books I really enjoyed. Recently, myself and others were talking on Mastodon about how to work with this kind of data, other services, and so forth. Goodreads does not have an API unfortunately (it used to, but it shut it down) but they do let you export your data. I decided to take a look at this and see if (and how) it could be used in Eleventy. Here's what I found.

So, according to this web page, you can request a copy of your data at any time. I followed the directions there and was presented with a cheerful warning that it could take up to thirty days for my request to be processed.

Surprisingly, I requested my data on Sunday afternoon and it was ready by Monday. By no means should you assume that's a standard response rate, but it's probably closer to the normal response time than thirty days.

In case you don't feel like counting, that's thirty-nine different files. Honestly, I wasn't sure which file was the one I needed, but I found the relevant information in review.json. I don't typically write reviews for books on Goodreads. I'll do a quick start rating, but as I said, I use Goodreads more as a personal log and assume no one else but me gives a darn about what I've read.

After this is a long (well for me, as I said I've been using it for a while) list of books. My particular data set begins with a lot of books that I set as have been previously read. I believe I did this when I first started. I didn't try to log every single book I've read, that would be impossible, but I probably spent a few minutes adding the ones that came to mind. I only point this out because many of these records don't have data about when I read them.

For my first demo, I simply copied review.json to my project root and then added a new file, goodreads.js, to the _data directory. This file reads in the JSON and helps simplify it a bit for Eleventy:

The first thing I do is filter to items that are marked read or currently-reading. I had a few records in my data set for books I wanted to read and this clears that out. It also removes that first 'meta' item from the array.

Next, I rewrite the data to be a bit simpler. The original data uses (not provided) a lot for null values, so you can see where I check for that. I also go ahead and parse the dates. Finally, I rename book to title.

As I said, the actual book data is limited to just the title. I thought it would be cool if I could get more information. Shockingly, there doesn't seem to be an Amazon API for this. I did find a "Product Advertising API", but it didn't feel right to me. Shockingly (yes I like using that word), Google actually has an API for this and it's free: Google Books API.

The Google Books API lets you search for books and returned detailed information for them. This includes cover images and I thought that would be great to add to the display. I created an .eleventy.js file and built a short code:

To use the API, I request a "title" match to help ensure it matches right, I also set the printType to book to differentiate from magazines and other publications. My code assumes the first result is right (more on that in a second) and returns an image pointing to the cover thumbnail.

So how well did it work? Pretty bad! From what I can tell, the issue is that many of my books are part of a series. So for example, the book may be called "Ruin and Rising", but Goodreads marks it as "Ruin and Rising (The Shadow and Bone Trilogy, #3)". This made most of my tests return nothing.

I got something working, but honestly, I'm not sure how much I'd trust this in production. My hope is that maybe someone sees this code and does the work to make it a bit more stable. With that in mind, feel free to take the code from here: -demos/tree/master/goodreadstest

This gave me more insight when checking out a book on the platform. It also came in handy for my prediction model as it would guarantee that all books would be classified in at least one genre, but even better, in multiple genres. I therefore attempted to obtain that information for every book by using "web scraping".

So the question now becomes, what does this random number correspond to and how can we find that number for each book in our Library? By going back to the data we originally imported from Goodreads we find that the first column in our dataset actually corresponded to the Book's Goodreads Id. That could be our answer! After trying it out we realize that the "Random Number" is in fact the book's Goodreads Id.

That's all the code we need! However, when we look at our results, we see that there's a lot of issues! While we obtained what we wanted for our first results, we seem to only obtain empty values for the later ones. ?

This seems to be a voluntary scraping defense created by Goodreads developers to prevent people from scraping Goodreads too much. Thus, while we almost got all of the results we needed, it's sadly not the appropriate solution to our issue.

While I was aware that Goodreads had developed an API, its website indicated that the API remained active but that they had ceased to give out the access tokens required for using the API. Therefore, I originally looked for alternative solutions as the ones mentioned above. Luckily, I stumbled on an API request made by one of the users on a forum. The request included the access token they had used, I was therefore able to use that access token and make my own API requests as seen in the code below!

The only non-binary result in the model was that of the Average Rating whose values range from 1 to 5. The idea is that for every 1 you add to the Average Rating, it adds 0.522... points to the Predicted Rating.

To make it more accurate, I started using other models. One of them was a Logistic Regression. Check out this article if you want to learn more about the difference between Linear Regressions and Logistic Regressions. The basic idea being that while Linear Regressions construct a linear equation that describes the relationship between dependent and independent variables, Logistic Regressions classify elements of a set into two groups (or more) by calculating the probability of each element of the set.

Ultimately, the goal would be to create a mini-website where people can upload their Goodreads data (or rate a certain amount of books) and be given personalized book recommendations. Another possibility could be creating a Chrome extension to make the user experience even better and simpler.

I recently realized that I didn\u2019t spend enough time reading. So last year I started challenging myself to read more. It\u2019s around that time that I came across Goodreads, a social network for books. Thanks to this platform, you can see what your friends are reading, have read, how they graded various books etc. You can also record all books you\u2019ve read, save books you want to read and so much more.