Hi all,
Next time when you look for an archived Twitter page and find it in an arbitrary non-English language, don't be surprised as this problem is more common than you might think. We found that about half of the Twitter captures are in 46 non-English languages of which about half are in Kannada language (a regional Indian language) alone. While language diversity in web archives is generally a good thing, but not in this case, as it is an unintended result of sticky cookies in crawlers.
We, at WS-DL Research Group have been investigating this issue for a long time. Last month, we published our findings on our blog. We also proposed some potential approaches to minimize the impact. Those who are implementing or running crawlers might find it useful.
Best,
--
Department of Computer Science
Old Dominion University
Norfolk VA 23529