week 11 was quite busy. I didn't get a lot done with respect to the GSoC project, but it was still a very exciting week for MetaCPAN. It all started with implementing a diff endpoint for the API server. The missing diff feature came up several times and people said that it's the only thing that keeps them from using metacpan. Now the implementation was quite straight forward. Once I figured out a good way to invoke git diff, I wrote a simple parser that extracted information like number of deletions/insertions. The parser also splits the diff of two tarballs into segments of files.  gives you an example output. It diff the latest release with its predecessor. YOu can also diff two files or two releases by its full name. A raw diff can be requested by adding a content-type .
With ElasticSearch v0.17.2 nested documents were finally introduced. Before that, nested structures in a document couldn't be queried as good as one would expect. For example, each author has a profile property, which consists of a list of (id, name) pairs. Now, one would expect that you could query for a author, whose profile.id is "stackoverflow" and name is "perler". But since ElasticSearch flattens the data structure, you would get authors where any of his profile.ids is "stackoverflow" and any of the name is "perler". Now, with nested documents, that flaw has been fixed and you can actually query that user. However, this change requires to reindex all the data of the affected types (in this case "author"). With ElasticSearch, you first PUT the new mapping that includes the schema. This usually upgrades the schema, if the two schemas are compatible. If not, it will throw an error and you'd have to delete the old mapping (which also deletes the data) and recreate it.
And this is where things went bad. I accidentally PUT the mapping for all types (i.e. file, release et al) which caused all data to be lost. This was on Wed, 3rd August. I immediately spoke to Clinton and Olaf about my mistake. But there was no way to recover the data. Since we are low on disk space, we didn't do any backups. The only option left was to reindex the CPAN. To speed things up, I first reindexed minicpan. This was done in less than two hours. Then I started to index the rest of CPAN and BackPAN. At some point the box went out of memory and we had to contact our provider speedchilli.com to reset the box. They reacted very fast and the box was back online within minutes. Now it was on me to investigate what happened. Apparently the indexer had a memory leak. I restarted the indexer to see which release caused the memory leak. I was quite surprised when I found it. It had a circular directory structure (i.e. symlinks pointing to a parent directory) which caused Path::Class::Dir->recurse to loop indefinitely and filling up the ram. A quick fix was to replace ->recurse with File::Find. I filed a ticket against Path::Class, but using File::Find seems more robust anyway.
While I indexed minicpan at full speed (i.e. 5 child processes) I decided to use only 2 for indexing the rest of CPAN, such that the page load times won't be impacted. After a day, the data was back. However, we lost all user data, all +1s and the session data. I'm still very sorry about that lost. Naturally, I started on writing a backup script, that now runs once an hour and backs up the user data, i.e. the data that we cannot restore from the CPAN. Furthermore, Clinton contacted speedchilli.com which are now doing a backup of our box regularly. A second pair of disks will be added sometime next week, so we will be able to do local backups before I roll out updates.
With regard to my GSoC schedule: I'm still on track. I will finish the PageRank calculation this week and implement a scoring script for the front-end to improve the search results.