Taking the decision to make all P2PU data open

2 views
Skip to first unread message

Stian Håklev

unread,
Feb 22, 2011, 9:26:36 AM2/22/11
to p2pu-re...@googlegroups.com, Philipp Schmidt, Erin Knight, Jessy Cowan
We've always said that P2PU is interested in both sponsoring its own research projects, and support outside researchers. In Barcelona, we talked about creating APIs or regular datadumps. One of the things that has been holding this work back a bit has been the indecision about how much data we can release, to whom etc.

I'm hereby proposing that P2PU officially decides that all user interaction data are completely public.

This includes not only all the material that has been posted, but also when users access the websites, what they click on etc. Basically everything that an administrator would have access to, except: e-mail addresses and other personal info that isn't publicly available at this time, as well as personal messages (these aren't stored in the system anyway currently. (It's another decision whether we want to make applications public, I would actually propose that for the next round, we make the "successful" applications public, with one secret field "Message to organizer" - but that's a different discussion).

We would create a system that automatically generated a data dump of the entire system, or course by course, for example on a weekly basis. (In the new platform, there might even be an API). These dumps would be available to the world.

This would enable anyone to start analyzing the data, and create interesting visualizations etc. I believe there is substantial interest in this in the P2PU community already (including by myself, and some tech savvy course organizers), and I also think this would be a way of attracting a lot of attention by learning scientists and others by having a high-value, large data set that is public (possibly no need for ethics clearance etc). It would be wonderful to create a community around the interpretation of these data.

With a future API, I could also see people coming up with tools and visualizations that can be used during courses to support course organizers and students - similar to the function of the tool server for Wikipedia.

Possible concerns:
1) students have not been properly informed about this... I propose that very clear information about this in non-legalese be displayed and agreed to when students apply to courses. ("I realize that all my interactions with the P2PU site will be logged, and made publicly available. All my contributions will also be licensed under a CC license. Should I choose to later delete my user, data will be disassociated from the user id, but it will not be deleted.") Having this associated with course signup avoids the problem of informing all existing users, and anyway this wouldn't be implemented for the current cycle.

2) this isn't a good thing in general, it's not something P2PU should do, it would stop people from participating, it would negatively affect learning processes, it would give us a bad reputation ... This is the crucial point - I think there would be very substantial advantages to us doing this, but I need to hear from you if there are significant objections.

I would like to bring this to the entire community, but I wanted to get an opinion from the research group first.

Stian

--
http://reganmian.net/blog -- Random Stuff that Matters

Joe Corneli

unread,
Feb 22, 2011, 9:37:30 AM2/22/11
to p2pu-re...@googlegroups.com
Arguably CC licensing already covers everything that users type into
the site, but I don't think that would apply directly to clickstreams.
For textual information, just follow the terms of the CC-By-SA
license and you should be good to go. For other information you might
want to ask a lawyer.

॥ स्वक्ष ॥

unread,
Feb 22, 2011, 10:05:03 AM2/22/11
to p2pu-re...@googlegroups.com, Stian Håklev, Philipp Schmidt, Erin Knight, Jessy Cowan
Hello :)

On Tue, Feb 22, 2011 at 14:26, Stian Håklev <sha...@gmail.com> wrote:
> We've always said that P2PU is interested in both sponsoring its own
> research projects, and support outside researchers. In Barcelona, we talked
> about creating APIs or regular datadumps. One of the things that has been
> holding this work back a bit has been the indecision about how much data we
> can release, to whom etc.
>
> I'm hereby proposing that P2PU officially decides that all user interaction
> data are completely public.

Thanks for initiating the discussion Stian. I was wondering what
decision had been made regarding this after the discussion with the
lawyers and since I'm not a lawyer, I'm not going to comment on the
legalese, rather just listen to opinions :)
Again, thanks for bringing this discussion up !

--
vid ॥ http://svaksha.com

Dan Diebolt

unread,
Feb 22, 2011, 10:29:51 AM2/22/11
to p2pu-re...@googlegroups.com
The abundance of indecision and delay on this matter might be good reason to proceed cautiously before declaring all user interaction data completely public and creating highly visible APIs and dumps of that data. How do course organizers and course members as stakeholders have their opinions factored into this decision? What other organization has a similar policy as the one proposed? What pending or proposed research project would be curtailed if a more qualified policy on data access was initially adopted? In adopting a "completely public" policy can you guarantee that no personal data could be inadvertently released due to an oversight or lack of a review process?

I would prioritize getting basic management level information out of the platform as simple reports on the status of schools and courses (enrollments, drops, activity, load etc) and review requests for access to back-end data by researchers as they come in.

To me a "completely public" policy would be as sensible as putting a big button on your homepage labeled "click here to read our server logs." I don't think anybody in government, business, academia, non-profits do anything like this.

Stian Håklev

unread,
Feb 22, 2011, 10:37:26 AM2/22/11
to p2pu-re...@googlegroups.com, Dan Diebolt
Hi Dan,

thanks for chiming in. You are right, nobody else does this, just like almost nobody makes all student interactions open to the public, or offer free classes, or share material using open licenses (well, there are more and more doing that, but initially this was not the case). That nobody else are doing it doesn't mean it's wrong (although it obviously doesn't mean it's right either). I think the dithering was partly because nobody really took ownership of this and drove it forwards, which is what I am trying to do. We would of course not implement anything like this without serious community consultation - which is what I was trying to get started.

I would be very surprised if any personal data were inadvertently released, given that everything on the P2PU site is currently publicly available (with the exceptions of personal messages, which are immediately sent by email and not stored anyway, personal e-mails, and course applications). This means that anyone can at anytime scrape our website (which you, I think, demonstrated). I actually think it's more responsible to tell users clearly and up-front - this is how it works, all your contributions will be publicly shared and available in bulk if you decide to participate - than to lull users into some kind of false sense of security.

At the university of Toronto, getting ethics approval for any kind of interview/data collection in a non-public space requires a lengthy (often several month long) ethics review process. These kinds of datasets would probably be exempt (this doesn't depend on us, but on the individual ethics boards of course, but for example a class I am in is planning to do participant observation in several P2PU courses for a class project - the class ethics proposal does not allow us to observe in any password protected setting, so out of all the elearning settings out there, P2PU was one of the only ones that were viable to use.

Anyway, I'm not saying there are not potential problems with this, but my point with posting this was exactly to try to get those potential problems out and see if they are surmountable or not.

Stian

Dan Diebolt

unread,
Feb 22, 2011, 11:58:24 AM2/22/11
to p2pu-re...@googlegroups.com
I would like to hear other opinions especially from course members who probably are not represented in this group.

I often hear people with a longer history with P2PU making foundational arguments that P2PU does not do <such-and-such> or that one of P2PU's core tenets is <placeholder>. But then I make observations such as the following: In the P2PU Participant Survey (LimeSurvey) dated December 2010 there is a question (#26) asking if participants would pay a fee upfront to participate in a course. So as recently as a few months ago this was an exploratory issue in the mind of the survey author and deemed worthy of asking potential course participants their opinion on the matter. [Unfortunately this survey never ran.]

So my point is that the there is a very short history and a lot of fluidity concerning just about every aspect of P2PU.  We need to find the "sweet spot" where online peer to peer learning works and I think the jury is out on many core issues including the one we are discussing. We need a lot more basic information collection and routine reports and surveys. 

If you read Phillip's recent blog post concerning the P2PU's road map for 2011 he listed four steps as follows:

First Quarter: Opportunity-driven
Second Quarter: +Prioritization
Third Quarter: =>Diversity & Excellence
Fourth Quarter: Full spectrum


We are definitely at the transition point between P2PU activities being advanced based on  opportunistic factors and the prioritization of P2PU activities based on community, course and platform needs. To support that prioritization process we need to conduct surveys and trials and base decisions on the information that comes in.

So in conclusion I would like to hear what other people think about this "complete public" policy including course organizers and course members as stakeholders.

Jessy Kate

unread,
Feb 22, 2011, 12:29:24 PM2/22/11
to p2pu-re...@googlegroups.com, Dan Diebolt
agreed the biggest challenge here is the cycles needed to achieve the needed large scale effort. *assuming* we have a team that thinks this through systematically and carefully, i am absolutely 100% behind it. 
--
Jessy Cowan-Sharp
http://jessykate.com

Joe Corneli

unread,
Feb 22, 2011, 12:38:15 PM2/22/11
to p2pu-re...@googlegroups.com, Stian Håklev, Dan Diebolt
On Tue, Feb 22, 2011 at 4:37 PM, Stian Håklev <sha...@gmail.com> wrote:
> Hi Dan,
>
> thanks for chiming in. You are right, nobody else does this, just like
> almost nobody makes all student interactions open to the public, or offer
> free classes, or share material using open licenses (well, there are more
> and more doing that, but initially this was not the case).

Stian:

"almost nobody else [...] share[s] material using open licenses" -- um, what?

As I tried to indicate, there's nothing particularly special or novel
about posting CC-By-SA content in a tarball. There isn't even a
question here. In my view this should just happen without discussion
or debate.

The question is about posting logs of interaction data. A lot of that
information *isn't* available by screen-scraping. Let's be clear that
that is what's at issue here. This shouldn't be a question either.

The question is what's at stake.

Joe

Stian Håklev

unread,
Feb 22, 2011, 2:32:41 PM2/22/11
to p2pu-re...@googlegroups.com, Alexander Halavais
Hi Alex,

thanks for chiming in.

Yes, this will wary from IRB to IRB. If you did a search on Twitter for a certain hash tag, and then wrote an article based on the statistics you gathered, for example, would that require an ethics approval? But you are right, it would make sense to have in the sign up to courses "Your data will be shared, and they might be used by researchers", or some such statement.

Stian

On Tue, Feb 22, 2011 at 13:29, Alexander Halavais <hala...@gmail.com> wrote:
There is a fairly active debate over what constitutes "public" from
the perspective of human subjects boards in the US (the US tends to be
more protective in this regard than other countries I have experience
with, though our standards are becoming the norm elsewhere as well).
Just because data in a conversation occurs in public does not
automatically make it exempt, though--again--there is a lot of
discussion over this. This is the reason, e.g., the AOL search data
was fairly untouchable, despite being unquestionably public.

An IRB will look at whether users knew that the material was not only
transparent (that is, available to the public) but could envision its
use in research. Again, this isn't the case at every IRB, some (e.g.
University of California) are more restrictive than others. The OHRP
is likely to release new guidelines for online research this year that
may provide a bit more clarity in terms of expectations.

That said, if you wanted to make it easier for researchers to clear
the IRB hurdle, a signup that gave something more akin to real
informed consent would do it: i.e., explicitly saying that the work
can be used (among others) by researchers interested in online
interaction, that you can opt out at any time, etc.

Best,

Alex

Alexander Halavais

unread,
Feb 22, 2011, 1:29:27 PM2/22/11
to p2pu-re...@googlegroups.com

Stian Håklev

unread,
Feb 22, 2011, 1:26:01 PM2/22/11
to Joe Corneli, p2pu-re...@googlegroups.com, Dan Diebolt
Joe,

thanks for clarifying.

So there is already useful information that could be gleaned from having more easy access to the content, including posting dates etc, and I get the sense from you that this is not very controversial. I am not sure if Dan agrees though.

The other data is the interaction data - basically when people log in, what they read etc. It would be interesting if people could do some brainstorming about what might be possible issues with releasing this.

Stian

Philipp Schmidt

unread,
Feb 23, 2011, 7:03:51 AM2/23/11
to p2pu-re...@googlegroups.com, Stian Håklev, Alexander Halavais
Great discussion.

One major issue that hasn't come up yet are the resources required to do this:

(1) think it through carefully and develop a strawman proposal,
(2) share with the full community for feedback, respond to comments,
update strawman, etc.
(3) put in place technology to track, analyse, share the data,
and I would like to see (4) spend time working with people who are
using the data to make sure some of the benefits of their work flows
back to improving P2PU.

In terms of priorities for paid staff - I can see a conflict for the
work of developers in (3) - and would argue that building the new site
trumps this at least for the next while.

I don't want to put a damper on this discussion - this is an important
topic - but I also want to be realistic, and make sure we understand
the trade off between this and other work.

P

Stian Håklev

unread,
Feb 23, 2011, 7:41:54 AM2/23/11
to Philipp Schmidt, p2pu-re...@googlegroups.com, Alexander Halavais
In terms of technology, there is actually already a script that exports all user interactions for a given course, and there is another one that exports all users that have applied to a given course with personal info etc. So far these are totally restricted (which means that even for me, who have access, it's hard to write anything automatic that grabs content for all the courses etc). I am guessing that opening the first script to the world would be the flip of a switch, and opening the second script up would simply be removing a few fields (email addresses etc) that we don't want shared. 

There might be data that these scripts don't capture well, but at this point, I don't think that's our biggest concern, that's something that we can think about with the new platform anyway. I think it's much more of a principle.

So, after some feedback, maybe it's useful to split the debate into two:
- are there any concrete objections / possible negative effects from releasing data dumps of all _publicly available content_ already published at p2pu.org? (this is content that anyone could scrape anyway, and that is already released as CC BY).

- what objections or concerns might there be with releasing interaction data, click stream data (when logged in, when reading what, etc) which is currently not available to the public?

Dan has raised a third question, which is - should all student interactions be public. So far that is our principle (although it isn't always "policed" when things happen on external platforms). That is certainly an interesting debate too, but the two questions above only address the data that already exists publicly.

Stian

Dan Diebolt

unread,
Feb 23, 2011, 9:08:17 AM2/23/11
to p2pu-re...@googlegroups.com
Technically it will be a lot more convenient and error free to extract your data from the database tables for analysis although you will have to scrub certain fields. Scraping is like alchemy if you do it large scale as you have to negotiate a lot of very weird scenarios. However you get your data, it can easily be processed by a ruby or python script to generate a report of some sort.

One issue I don't think is completely understood is the extreme difference in value between aggregated and unaggregated data. Gold particles lying randomly in the earth are unaggregated and of no immediate value. Aggregated gold in ingots is requires a lot of labor and is quite valuabley.  Entire industries are created aggregating government and web harvested data and then selling it back to the public on a pay per view basis. My point is that making your data publicly available in aggregate form can have a lot of intentional and unintentional consequences.

Heather Ford

unread,
Feb 23, 2011, 11:12:05 AM2/23/11
to p2pu-re...@googlegroups.com
I would caution against this - at least without very specific goals in
mind. What kinds of research would benefit from this kind of access?
How would it make learners feel to be told that everything they say on
a learning forum will have the ability to be parsed and tracked and
analysed. My hunch from the research I've been doing is that, although
most people accept it and join the course, they protect themselves by
sharing very carefully, thus prohibiting the kind of social learning
p2pu wants to encourage.

When you want to build a system that emulates peer groups learning
together, try to think of what that would look like if your group was
surrounded by thousands of people with notepads watching you and
taking note of what you say. That's not peer learning anymore. It's
something different. In my perspective, it merely recreates the
problems that p2pu was trying to solve initially: that we treat
learning as access to learning objects when actually it's a lot more
about a group of people getting together to share a part of
themselves.

re. making 'public' information 'more public', Helen Nissenbaum has
some great pieces on 'contextual integrity'. She basically says that
privacy is violated when the context in which information is disclosed
is disrupted. For example, when you're sitting in a restaurant talking
to a friend, you're in a public place but you don't expect that others
will listen in. This means that making public data more public could
become a privacy violation.

Don't get me wrong: I think giving researchers access to data is
important - I just don't think it should be given away so readily.
p2pu has a trustworthy brand. But if you give everyone access to the
information that people feel like they're sharing with p2pu, there
could be some major problems (some of which you might not get pushback
on immediately since people generally don't recognise the effects of
unlimited access until something bad happens).

You could do some work anonymising the data, but I still think that
the challenge is going to be in giving people notice of this in a way
that continues to enable them to make them feel safe. Data that knows
no bounds doesn't make me feel safe.

--------------------
Heather Ford
UC Berkeley School of Information
http://blogs.ischool.berkeley.edu/masks/

Maria Droujkova

unread,
Feb 23, 2011, 11:15:49 AM2/23/11
to p2pu-re...@googlegroups.com
On Wed, Feb 23, 2011 at 11:12 AM, Heather Ford <hfo...@gmail.com> wrote:
I would caution against this - at least without very specific goals in
mind. What kinds of research would benefit from this kind of access?
How would it make learners feel to be told that everything they say on
a learning forum will have the ability to be parsed and tracked and
analysed.

These are open forums already. Everything is already being tracked (try and search Google for a phrase from the forum) and can be analyzed if anyone cares.

Attention is a prize these days. In my experience, the fact that someone is paying attention is usually motivating for people.


Cheers,
Maria Droujkova

Make math your own, to make your own math.
 

Stian Håklev

unread,
Feb 23, 2011, 12:52:58 PM2/23/11
to p2pu-re...@googlegroups.com
Dan, I really appreciate our discussion on this, but let's try to drive this conversation forward.


My point is that making your data publicly available in aggregate form can have a lot of intentional and unintentional consequences.

Yes, I understand that. Now that we all agree that that is the case, can we start brainstorming those intentional and unintentional consequences? Let's get them out there, and discuss them.

Stian

Joe Corneli

unread,
Feb 23, 2011, 1:13:10 PM2/23/11
to p2pu-re...@googlegroups.com
I would say that, quite arguably, any chilling effect from "publicity"
is already fully in effect. However, people are free to sign up with
a pseudonym, and most appear to do that. P2PU could offer a sign-up
service that offered to more fully "mask" users online, and I think
this would be a good idea. It could also offer a sign-up service that
found a way to associate real-world identity to someone (e.g. send
them a thumbdrive with a given cryptographic key on it) and I think
that could also be a good idea.

Moral: options are indeed good.

Personally I think the question shouldn't be "should the data be
open?", but rather, "what data, precisely, should be open, and why?".

I think this is not disconnected from the *other* concerns about user
experience that Dan has highlighted lately. At least in a first pass,
the most powerful "why's" would be: because sharing X data will help Y
researcher(s) address Z issue related to user experience. Other
concerns W are potentially very interesting, but I think at least at
the moment user experience seems key.

Heather Ford

unread,
Feb 23, 2011, 1:25:51 PM2/23/11
to p2pu-re...@googlegroups.com
> Moral: options are indeed good.
>
> Personally I think the question shouldn't be "should the data be
> open?", but rather, "what data, precisely, should be open, and why?".

I agree. And perhaps further: which data should be closed and why,
which data should be open and why, and which data should be more open
and why. Information that is 'public' and under a cc license but
hidden in a forum somewhere that Google doesn't prioritize is very
different from information that is in a forum and tops search results
for a particular user (either their official name or pseudonym).

Dan Diebolt

unread,
Feb 23, 2011, 1:28:33 PM2/23/11
to p2pu-re...@googlegroups.com
D>My point is that making your data publicly available in aggregate form can have a lot of intentional and unintentional consequences.

Just to be clear, in the context of what I was discussing I meant "aggregate" to mean taking scattered data and collecting it together or organizing it into one database or unified format rather than counting, averaging or summing it up in a aggregate statistical sense.

S>...  start brainstorming those intentional and unintentional consequences ...

The unintentional consequences I think largely involve social factors like privacy, trust, encouraging small group learning and formation, and simple risk mitigation. Heather gave a reference to "contextual integrity" research which sounds similar to my statement that there is an expectation that information provided to a web site is used in the manner in which it appears or if for research in conformity with some type of disclosure agreement.

In any event I think I am repeating myself and I would like to take a backseat to this conversation for a while and hear what others have to say - including other stakeholders outside this group.

Philipp Schmidt

unread,
Feb 24, 2011, 5:15:33 AM2/24/11
to p2pu-re...@googlegroups.com, Joe Corneli
On 23 February 2011 20:13, Joe Corneli <holtze...@gmail.com> wrote:
> Personally I think the question shouldn't be "should the data be
> open?", but rather, "what data, precisely, should be open, and why?".

I think Stian's starting point on the open end of the spectrum is right.

At the same time, I find it hard to talk about this at the theoretical
level - it would be much easier to understand the implications if we
had a few concrete cases (not just hypothetical ones, but actual
commitments from people who want to do specific analysis).

Those examples would include the information Joe is looking for. Who
wants access to what data and for what? What will be the benefit, and
to whom?

As long as we are not sure how people are going to use it - we need to
consider the potential downsides even more carefully, including:
negative perceptions of users (justified or not), the fact that we
*might* be exposing data that will be used in ways we are not
comfortable with, and the time and effort to implement and support it.

P

Rebecca Kahn

unread,
Feb 24, 2011, 5:24:19 AM2/24/11
to p2pu-re...@googlegroups.com, Philipp Schmidt, Joe Corneli
I've been following this discussion with a lot of interest, and like Philipp, I find it a bit hard to think in abstract terms about this.
And while I'm by no means a data expert, or a particularly rigorous researcher (hello humanities!) I was wondering, would it be possible to do a bit of a pilot experiment with this?

If, for example, we found one course or group or school who all agreed with making their data public, and we opened it up for x amount of time, invited people we know who have expressed interest in this kind of research to come and use it, and see how it went?
With any luck, this might generate
a) some evidence of how P2PU community members respond to a request for their data
b) some evidence of the kinds of research that people might like to do
c) some kind of lever that we could use in the future, either with the community - ie: look, this is the kind of brilliant research that people do and this is why joining P2PU involves being part of these kinds of projects OR with the researchers ie: this is how our community responds, so please think carefully about your strategy and what you're asking of them.

I know it's a bit of a backwards way of going about things, and it might take a while to set up but I think, in the P2PU spirit of "trying can't hurt" and "fly by the seat of our pants" it could be a possible option.

B
--
Rebecca Kahn
Skype: rebekahn

Niels Sprong

unread,
Mar 12, 2011, 12:41:36 PM3/12/11
to p2pu-re...@googlegroups.com, Rebecca Kahn, Philipp Schmidt, Joe Corneli
Nice discussion.

I can give my own view but it seems that there are a lot of own views here already.

From the latest ideas on 'just like, try it,  and see what happens and what people think', I gather that the question now actually seems to be: how do we get a feel of how the 'p2pu community' , and by this I mean people who use p2pu and not the list per se, thinks about this.

Can we find a way to ask a large number of users how they feel about this somewhere this month or the next (e.g participant survey?)
Reply all
Reply to author
Forward
0 new messages