Dear All,
let's postpone to next week; I am unfortunately swamped with a paper deadline today and through the weekend.
But let's continue the discussion here. I like the suggestions below, with some exceptions.
In particular, I don't think we should use the dumps indicated there. If you look at the size, for enwiki, it's 700+ MB the latest. I am not sure what period they cover.
Rather, I would use the API to fetch changes only to set of pages that we observe. And I would fetch those pages in XML = markup, not in HTML; we work on markup internally in the algorithms.
This will enable us to grow in a more gradual way, as we can initially start with small sets of pages under observation.
I would like next week to work to define the storage structure; best days for me are Monday and Tuesday. Any time that works for you? If you are too busy with exams, we can also do the week after even though it's break.
Luca