Ideas for sources of data

48 views
Skip to first unread message

Leah Nicolich-Henkin

unread,
Apr 23, 2015, 3:55:48 PM4/23/15
to dance...@googlegroups.com
I recently spent some time creating a small database of Reddit discussion. I think that with a little work it could incorporated into this project. Is there any interest in using it? The corpus and script for collecting new data can be found at https://github.com/leahrnh/reddit_corpus

-Leah

DANCEcollab

unread,
Apr 23, 2015, 4:06:15 PM4/23/15
to dance...@googlegroups.com
That sounds interesting.  I'm not so familiar with reddit.  Can you explain what is particular about the structure of discourse in that context?

Carolyn

Leah Nicolich-Henkin

unread,
Apr 23, 2015, 4:13:00 PM4/23/15
to dance...@googlegroups.com
Reddit has a pretty straightforward structure. It's basically a discussion forum where people post links or questions, and other people reply to them. I like working with it because the replies are well structured (people reply to specific comments, rather than just posting another comment on the same thread), and you can get very deep reply chain. Each comment has a general topic, a thread title, an author, and of course the comment body. There's nothing particularly special about the discourse structure; it's just a very large and active source of discussion on a variety of subjects.

Carolyn Rose

unread,
Apr 23, 2015, 4:22:08 PM4/23/15
to dance...@googlegroups.com
That's cool.  I wish every discussion platform made that structure explicit!  I noticed that within the representation Oliver pointed to for DiscourseDB, there is a distinction between replies that are explicit from the interface vs ones that are recoverable through some analysis method.  Since reddit makes this structure explicit, it would be easier to represent it in this formalism than data from some other platforms, including major MOOC platforms, where that reply structure is implicit, and needs to be inferred.

Leah Nicolich-Henkin

unread,
Apr 23, 2015, 4:35:23 PM4/23/15
to dance...@googlegroups.com
I just remembered that some MOOCs have their own subreddit (reddit forum). Here's an example for Andrew Ng's Machine Learning Coursera class: https://www.reddit.com/r/mlclass So if you want to collect all the discussion surrounding a MOOC, it's definitely a source to check out.

-Leah

kort...@gmail.com

unread,
Apr 24, 2015, 12:25:43 PM4/24/15
to dance...@googlegroups.com
I've been working on an approach to identify reply structure in the Coursera platform, where threads do have explicit chronological and organizational structure, but reply metadata is incomplete or inaccurate. This relates to finding a uniform representation of thread structure, which is directly connected to the work described here for MOOCDB and DiscourseDB.

Carolyn Rose

unread,
Apr 30, 2015, 1:54:04 PM4/30/15
to dance...@googlegroups.com
Cool!  Do you have something you can share?
Reply all
Reply to author
Forward
0 new messages