Sharing dirty data sets

1,020 views
Skip to first unread message

Jerry Flatto

unread,
Mar 2, 2017, 10:08:28 AM3/2/17
to OpenRefine

Hi all.  I am teaching OpenRefine in the classroom to undergraduates in an analytics curriculum since data cleaning is such a big part of performing analytics.  I am using the excellent “Using OpenRefine” book in the classroom but have run into an issue.  Since I have limited dirty data sets, students in my class are using the same data sets and I am concerned that some students might be “borrowing” the work of other students.

 

I would like to see if folks could share any dirty data files they have with me or send me links to file so that I can use these files next time I teach the class to prevent the possibility of collusion.  If there are other academics on this Google list or other folks who might want the data, I will be happy to share the data on a Google drive or similar.

 

Thanks.

 

Jerry

 

"No trees were harmed in the sending of this message; however, a large number
of electrons were slightly inconvenienced..."


Dr. Jerry Flatto, Professor, Information Systems Department - School of Business

University of Indianapolis, Indianapolis, Indiana, USA mailto:jfl...@uindy.edu

Owen Stephens

unread,
Mar 2, 2017, 11:39:30 AM3/2/17
to OpenRefine
Hi Jerry,

I have a couple of data sets that I use in OpenRefine training:

Firstly - some bibliographic metadata describing books in the British Library:
http://www.meanboyfriend.com/overdue_ideas/wp-content/uploads/2015/02/BL-Flickr-Images-Book-subset.csv

Secondly - some bibliographic metadata describing articles in the Directory of Open Access Journals (DOAJ)

In both cases I've extracted the data in these files from a much larger data set:

With some investment in a bit of code you could generate random sets of data from these sources all of which would differ

Owen

Ettore Rizza

unread,
Mar 2, 2017, 3:30:47 PM3/2/17
to OpenRefine
Hi Jerry. 

Using very different datasets, is there not a risk that a student will get an easy file and another a puzzle?

A solution could be to automatically generate messy data sets. Dataiku has published on Github an IPython notebook that automates the operation. 

With this methode, each student would receive a different file but similar to that of their classmates.

Ettore Rizza

unread,
Mar 2, 2017, 3:46:49 PM3/2/17
to OpenRefine
@Owen: in the librarians world, the most rotten files I have seen so far are extracted from the Library Genesis database. You can download them here, for Libgen, or here for Libgen Fiction.

Owen Stephens

unread,
Mar 6, 2017, 4:50:58 AM3/6/17
to OpenRefine
Thanks Ettore,

Access to Library Genesis is blocked in the UK (at least through my service provider) by a court order.

However - there is no shortage of dirty data in the world of bibliographic description!

Owen

Ettore Rizza

unread,
Mar 6, 2017, 5:35:04 AM3/6/17
to OpenRefine
Reply all
Reply to author
Forward
0 new messages