Tabular CORGIS

27 views
Skip to first unread message

Austin Bart

unread,
May 28, 2019, 1:06:45 PM5/28/19
to corgis-data...@googlegroups.com
Hi everyone, this community hasn't been very active, but I wanted to put this proposal out here before we truly move ahead with it.

TL;DR: 
what's your opinion of a v2.0 version of the repository where the datasets are flattened into traditional tabular datasets?

Lately, we haven't done much with CORGIS development, in part because of my new position at UD and in part because of the general difficulty of preparing CORGIS datasets. The model we set out for in our Intro to Computational Thinking was "Layers of Abstraction": lists of nested dictionaries. For instance, note the dictionaries within dictionaries of this Weather data map:

image.png


Students struggle with this data format, and we noted recently that even our best students walk away with a very shaky understanding. It almost requires a level of recursive thinking to be able to process this kind of data. Although I don't think difficulty or recursion is a bad thing, it ties up a lot of course time and has forced us to ask whether it's really worth it.

A lot of data out there doesn't really look like this diagram. It's usually just a list of dictionaries without any further nesting. This can be modeled easily in a CSV file, and represents the fundamental model of both SQL and real data science APIs like Pandas.

We're considering a major update of the CORGIS project (a v2.0) where the datasets all become flattened. We wouldn't toss away any data, we would just find ways to restructure it all without the nesting (requiring some more complex column labels). In theory, this would simplify the code needed to access data while also making it less work to create and maintain datasets.

In practice, we would maintain the old collection and website at a different url (something like https://think.cs.vt.edu/old-corgis), but the existing URLs would point to our new versions. This could obviously have impact for anyone who is direct linking to our datasets, or if they've designed lessons around the old structure. I think that this move could be very beneficial in the long term, but I was curious if anyone had feelings on this one way or the other.

William H. Hooper

unread,
Sep 1, 2019, 1:32:22 PM9/1/19
to CORGIS Datasets Project
I love the idea!  As an occasional user of CORGIS datasets, I have always gone straight to the .CSV files.

Generating names for the hundreds of categories in the CORGIS datasets is painful, but a stop-gap would be to convert the pathnames into concatenated strings.  For example, the column listing the citation year for Medal of Honor winners could be labelled "awarded->date->year".  
Reply all
Reply to author
Forward
0 new messages