Sharing dirty data sets

Jerry Flatto

unread,

Mar 2, 2017, 10:08:28 AM3/2/17

to OpenRefine

Hi all. I am teaching OpenRefine in the classroom to undergraduates in an analytics curriculum since data cleaning is such a big part of performing analytics. I am using the excellent “Using OpenRefine” book in the classroom but have run into an issue. Since I have limited dirty data sets, students in my class are using the same data sets and I am concerned that some students might be “borrowing” the work of other students.

I would like to see if folks could share any dirty data files they have with me or send me links to file so that I can use these files next time I teach the class to prevent the possibility of collusion. If there are other academics on this Google list or other folks who might want the data, I will be happy to share the data on a Google drive or similar.

Thanks.

Jerry

"No trees were harmed in the sending of this message; however, a large number
of electrons were slightly inconvenienced..."

Dr. Jerry Flatto, Professor, Information Systems Department - School of Business

University of Indianapolis, Indianapolis, Indiana, USA mailto:jfl...@uindy.edu

Owen Stephens

unread,

Mar 2, 2017, 11:39:30 AM3/2/17

to OpenRefine

Hi Jerry,

I have a couple of data sets that I use in OpenRefine training:

Firstly - some bibliographic metadata describing books in the British Library:

http://www.meanboyfriend.com/overdue_ideas/wp-content/uploads/2015/02/BL-Flickr-Images-Book-subset.csv

Secondly - some bibliographic metadata describing articles in the Directory of Open Access Journals (DOAJ)

https://github.com/data-lessons/library-openrefine/raw/gh-pages/data/doaj-article-sample.csv

In both cases I've extracted the data in these files from a much larger data set:

British Library metadata on Github: https://github.com/BL-Labs/imagedirectory/blob/master/book_metadata.json

DOAJ: https://doaj.org

With some investment in a bit of code you could generate random sets of data from these sources all of which would differ

Owen

Ettore Rizza

unread,

Mar 2, 2017, 3:30:47 PM3/2/17

to OpenRefine

Hi Jerry.

Using very different datasets, is there not a risk that a student will get an easy file and another a puzzle?

A solution could be to automatically generate messy data sets. Dataiku has published on Github an IPython notebook that automates the operation.

With this methode, each student would receive a different file but similar to that of their classmates.

Ettore Rizza

unread,

Mar 2, 2017, 3:46:49 PM3/2/17

to OpenRefine

@Owen: in the librarians world, the most rotten files I have seen so far are extracted from the Library Genesis database. You can download them here, for Libgen, or here for Libgen Fiction.

Owen Stephens

unread,

Mar 6, 2017, 4:50:58 AM3/6/17

to OpenRefine

Thanks Ettore,

Access to Library Genesis is blocked in the UK (at least through my service provider) by a court order.

However - there is no shortage of dirty data in the world of bibliographic description!

Owen

Ettore Rizza

unread,

Mar 6, 2017, 5:35:04 AM3/6/17

to OpenRefine

@Owen: I have a sample in my Dropbox : https://www.dropbox.com/s/xsenqyqa6f404f5/libgen_foreignfiction.rar?dl=0

Reply all

Reply to author

Forward