Half of the Twitter mementos are non-English and this is bad

16 views
Skip to first unread message

Sawood Alam

unread,
Apr 20, 2018, 10:46:08 AM4/20/18
to IIPC Members, memento-dev
Hi all,

Next time when you look for an archived Twitter page and find it in an arbitrary non-English language, don't be surprised as this problem is more common than you might think. We found that about half of the Twitter captures are in 46 non-English languages of which about half are in Kannada language (a regional Indian language) alone. While language diversity in web archives is generally a good thing, but not in this case, as it is an unintended result of sticky cookies in crawlers.

We, at WS-DL Research Group have been investigating this issue for a long time. Last month, we published our findings on our blog. We also proposed some potential approaches to minimize the impact. Those who are implementing or running crawlers might find it useful.


Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529

Reply all
Reply to author
Forward
0 new messages