Half of the Twitter mementos are non-English and this is bad

21 views

Skip to first unread message

Sawood Alam

unread,

Apr 20, 2018, 10:46:08 AM4/20/18

to IIPC Members, memento-dev

Hi all,

Next time when you look for an archived Twitter page and find it in an arbitrary non-English language, don't be surprised as this problem is more common than you might think. We found that about half of the Twitter captures are in 46 non-English languages of which about half are in Kannada language (a regional Indian language) alone. While language diversity in web archives is generally a good thing, but not in this case, as it is an unintended result of sticky cookies in crawlers.

We, at WS-DL Research Group have been investigating this issue for a long time. Last month, we published our findings on our blog. We also proposed some potential approaches to minimize the impact. Those who are implementing or running crawlers might find it useful.

http://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html

Best,

Sawood Alam

Department of Computer Science

Old Dominion University

Norfolk VA 23529

Reply all

Reply to author

Forward

0 new messages