Re: [learningregistry] Updating or Deleting documents (doc_ID) on sandbox.learningregistry.org

56 views
Skip to first unread message

Walt Grata

unread,
Aug 24, 2012, 9:19:10 AM8/24/12
to learning...@googlegroups.com
Phil documents in the LR are immutable and there currently is no way to delete them.  We're investigating having a TTL for documents so they will be automatically cleaned up by the system.  We've been discussing it at the 4pm eastern design calls if you want to join the discussion.

On Thu, Aug 23, 2012 at 8:32 PM, Phil <dune...@gmail.com> wrote:
I've tried several different ways to change documents associated with "doc_ID" that I've uploaded to sandbox.learningregistry.org/publish
.

First I tried updating the json file I published with the "doc_ID" key/value and then post it to sandbox.learningregistry.org/publish again. It returned what looked like a success, but nothing changed in the record when I go to:
http://sandbox.learningregistry.org/harvest/getrecord?request_ID=305d0d4780344d9f9a94dc4ad650313a&by_doc_ID=true

I tried posting using both the LRSignature publish url commandline arg, and using curl. Both appear to work, but no change occurs.

Then I tried creating a new json file based on the specification outlined in "learning_registry_technical_specification_0.23.0.pdf" under the "Basic Delete Service" section. I was guessing that I should post the json file to sandbox.learningregistry.org/delete using curl.

I figured out how to post to sandbox.learningregistry.org/publish using curl, and tried it for /delete, but no luck.

I've searched several places trying to find a clue as to how to do this, but several posts I've read seem to suggest that it may not be possible. This seems to run contrary to what is in the learning_registry_technical_specification_0.23.0.pdf file.

Basically my question is, can I update or delete documents (files that I have a doc_ID for) on sandbox.learningregistry.org? And, If so, how?

I can't imagine there not being a feature to do this, but it would be nice to have a definitive word from someone familiar in the community.

Thanks Much!
Phil

--
---
This message is posted from the Google Groups "LearningRegistry" group. More information about the Learning Registry project can be found at http://learningregistry.org/
 
To post: learning...@googlegroups.com
To unsubscribe: learningregist...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/learningregistry?hl=en?hl=en

Joshua Marks

unread,
Aug 24, 2012, 10:53:13 AM8/24/12
to learning...@googlegroups.com

This seems a very serious limitation. There has to be a way to remove bad data and a TTL seems not a great way to do it.

 

Joshua Marks

CTO

Curriki: The Global Education and Learning Community

jma...@curriki.org

www.curriki.org

US 831-685-3511

 

I welcome you to become a member of the Curriki community, to follow us on Twitter and to say hello on our blogFacebook and LinkedIn communities.

Jim Klo

unread,
Aug 24, 2012, 11:31:02 AM8/24/12
to <learningregistry@googlegroups.com>
Currently delete is left to the 'policy' of the node owner. 

The problem is, as in any federated network - there is no way to ensure delete happens everywhere. 

Hence really the only approach is update or TTL. Even update isn't guaranteed to propagate everywhere for the same reasons as delete. If people are okay with an eventual consistency - which may never complete, that might be the best approach. 

I believe the only 'delete' we've addressed is "take down & do not replicate" which is node centric. Each node would have to handle it individually. 

Other than what's described, TTL is really the only way to make any sort of delete guarantee that's been discussed thus far. 

We're open to discussion and solutions for this. Please propose solutions - keep in mind the federation effect!

- Jim

Sent from my iPhone

Daniel Rehak

unread,
Aug 24, 2012, 1:13:02 PM8/24/12
to learning...@googlegroups.com
LR was intentionally designed to be write only.

As Jim notes, once data is published, its in the wild, anyone can get it, it may be taken off out the network in a local copy, someone else can reinject it if you delete it ... so no matter how fast you try to delete it there's no guarantee that someone else doesn't have it, can't reinject it.

Similarly, TTL is just a way to clean up a node, not to delete it forever from the entire network.

If it's bad data, and has to be updated, publish the new version, and a paradata statement that the old version has been superceeded.   If you want to delete it; just publish a paradata assertion that the current version is invalid.

    - Dan
Daniel R. Rehak, Ph.D.

Learning Registry Technical Architect
ADL Technical Advisor

Skype: drrehak
Email:  daniel...@learningregistry.org
            daniel.r...@adlnet.gov
Twitter: @danielrehak
Web:   learningregistry.org

Google Voice: +1 412 301 3040

Joshua Marks

unread,
Aug 24, 2012, 1:33:44 PM8/24/12
to learning...@googlegroups.com

Daniel wrote:

 

“If you want to delete it; just publish a paradata assertion that the current version is invalid.”

 

As long as all nodes in the federation update with this method, that is essentially a delete, or at least an update that can nullify bad data.  

 

And

 

“As Jim notes, once data is published, its in the wild, anyone can get it, it may be taken off out the network in a local copy, someone else can reinject it if you delete it”

 

This leads to a very thorny copyright liability issue. If we are just talking paradata about a thing that lives at some public URL and not the thing itself (e.g. an alignment or usage event for a learning resource hosted on Curriki), then the resource host (Curriki) can clearly take down an infringing resource if notified and in the process issue paradata events to essentially remove the alignment event(s) form the federation. If an alignment lives on, it will point to an asset that does not exist (A broken link.) If, on the other hand, the asset itself is transferred from one system to another, the receiving system has a physical copy and with it copyright liability requiring at least a safe harbor for take down requests. I suggest a specific “Take down” event of some sort as an alert to expunge all references to something that is being taken down.

 

 

Joshua Marks

CTO

Curriki: The Global Education and Learning Community

jma...@curriki.org

www.curriki.org

US 831-685-3511

 

I welcome you to become a member of the Curriki community, to follow us on Twitter and to say hello on our blogFacebook and LinkedIn communities.

 

Jim Klo

unread,
Aug 24, 2012, 1:54:50 PM8/24/12
to <learningregistry@googlegroups.com>
Adding to what Dan has mentioned... Maybe a possible (prudent) approach is to formalize the assertion to indicate update and delete?  This should be done outside the context of Core LR in any case. 

Steve and I had discussed a 2 part solution in the past - that might be agreeable to most; I can highlight a bit.

If you consider the core LR as nothing more than a message stream, and core services do nothing more that provide various ways of accessing that stream, like PubHubSubBub. 
One could construct an external index that is created from the parsing of that stream by interpreting such formalized update/delete assertions and adjusting what docs are indexed appropriately (ie: update, old document is removed, new one is added to the index; delete, a document is removed from the index).  This makes the index inclusive / exclusive following the business rules defined by the assertion parsing algorithm. That algorithm could vary based upon how you trust the assertions (ie, assertions by original publishers, by some trusted third-party like NSDL, etc.) This method requires no actual delete/update to performed to the documents - it's just manipulating and external index to LR - much like Data Services's extract interface! Data Services, however, as it is today, is still an idempotent solution, hence, no external document can influence the inclusion/exclusion of another document into the index.  It's possible someone could extend the solution to build such an external index that can me manipulated via trusted assertions.

- Jim


Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International
t. @nsomnac

Phil

unread,
Aug 24, 2012, 2:01:49 PM8/24/12
to learning...@googlegroups.com
It sounds like the paradata method is the best option.

I'm looking through the demo json file in "learning_registry_20_minutes.pdf" and see that you can set the "resource_data_type" to "paradata".

Do I put the deletion or update message in the "resource_data" sibling  to  "resource_data_type"?

I'm guessing there's an exact message I need to place into this field to trigger an update or delete by all nodes that support this feature.

Anybody by chance have a sample json snippet showing exactly what the paradata statement looks like for update and delete?

Thanks again,
Phil


Jim Klo

unread,
Aug 24, 2012, 2:06:54 PM8/24/12
to <learningregistry@googlegroups.com>
This leads to a very thorny copyright liability issue. If we are just talking paradata about a thing that lives at some public URL and not the thing itself (e.g. an alignment or usage event for a learning resource hosted on Curriki), then the resource host (Curriki) can clearly take down an infringing resource if notified and in the process issue paradata events to essentially remove the alignment event(s) form the federation. If an alignment lives on, it will point to an asset that does not exist (A broken link.) If, on the other hand, the asset itself is transferred from one system to another, the receiving system has a physical copy and with it copyright liability requiring at least a safe harbor for take down requests. I suggest a specific “Take down” event of some sort as an alert to expunge all references to something that is being taken down.

One thing should be made very clear... LR is about metadata; resource or activity forms.  Resources should NEVER be injected into LR.  Hence your first assessment is correct.

There is some implementation regarding safe harbor for take down requests.  AFAIK there's a do_not_replicate flag that's honored by current implementation.  A node operator would have to manually modify the document in question to add the field.  The problem with with making it an automated solution - how do you verify the identity of the authority of the take down (as well as who to you honor).  In a public federation, any individual could post a take down notice - whether they have authority to assert copyright or not - that's not a process that can be 'automated' quite yet.

- Jim


Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International
t. @nsomnac

On Aug 24, 2012, at 10:33 AM, Joshua Marks wrote:



Jim Klo

unread,
Aug 24, 2012, 2:14:52 PM8/24/12
to <learningregistry@googlegroups.com>
A doctype of paradata is fine (I think the spec allows that). The problem with LR Paradata 1.0, which is what is that outlined in the 20 minute guide, is that no standard vocabulary is not defined, and IMHO has too loose a structure for handling this type of activity.

I might suggest that an assertion conforming to Schema.org where an explicit vocabulary could be defined externally and recycled for use by search indexes.

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International
t. @nsomnac

Daniel Rehak

unread,
Aug 24, 2012, 2:54:24 PM8/24/12
to learning...@googlegroups.com
Adding to Jim's comments.

Legality of take downs and the like is something LR doesn't intentionally deal with.  Nodes are operated by legal entities, and may be in different jurisdictions and what is "legal" for one node may not apply to all.

Anyone getting a takedown notice needs to respond properly, and should not let their data propagate.  We wanted to avoid legal entanglements between nodes and thus not propagate either the LR data, or the takedown requests.
     - Dan

Damon Regan

unread,
Aug 28, 2012, 9:31:37 PM8/28/12
to learning...@googlegroups.com
Phil,

Thank you for working with the Learning Registry and raising this issue.

LR Team,

I agree with Joshua -- this is a serious limitation. Expected
behavior when one publishes a document into a system is to be able to
update or delete it especially when a new user is getting started and
especially when the specification has statements like the following:

“If the resource data description document has an identifier and a
document with the same identifier exists in the resource data
description document collection, the new document SHALL be an update,
replacing the existing document in total” (p. 35).

“The basic delete service ‘deletes’ an instance of a resource data
description document (or a set of documents) directly from a node in a
resource distribution network” (p. 42).

I would like to propose that LR nodes that we maintain be modified to
implement update and delete services and propagate those messages to
other nodes to carry out subsequent updates and deletes.

Best Regards,
Damon

Jim Klo

unread,
Aug 29, 2012, 12:04:11 AM8/29/12
to <learningregistry@googlegroups.com>, learning...@googlegroups.com
I'll have to go back through and peruse delete features in the spec. The tricky part is validating who has authority to delete in this manner. I don't recall offhand without review if we determined that - and the way I believe it's worded would seem to indicate anyone could delete anything. That means NASA could delete Khan Academy and vice versa. Enabling delete in this way needs to happen in a verifiable way.

Delete in any case is still documented AFAIK as an policy that may or may not be honored.

One of the pillars of the project I recall that was discussed and highly desired was to have a process that allows the community to moderate and curate - we just never resolved how that would happen.


- Jim

Sent from my iPhone

Jason Hoekstra

unread,
Aug 29, 2012, 12:36:25 AM8/29/12
to learning...@googlegroups.com
Curious if there is a way to rely on the digital signature of a metadata record to determine authenticity and authority of a delete/update request?  Thinking through a potential flow (and this may be a re-hash from talks way back in the day of early design):

* Publisher submits metadata record to the LR, digitally signs LR envelope
* Public key store lives on the 'net to answer authenticity of digital signatures (either LR public key store or publisher instance)
* Publisher submits a delete/update request message to the LR, also digitally signed
* LR checks with public key store for validation
* LR accepts or rejects the request depending on public key store's response

Could this work?

Jason

Martin, Marina

unread,
Aug 29, 2012, 1:36:53 AM8/29/12
to learning...@googlegroups.com
Has there been discussion of node-specific registration? So National Geographic signs up with and publishes to a home node that authenticates it (and could, individually, handle lost keys/passwords) and the home node authenticates delete/update requests? Similar to an app.net / Status.net arrangement.

When I asked on the call last time (and I may have misunderstood, apologies if so), it sounded like there was no mechanism to handle lost keys right now. A home node could help with that, too.

Marina
________________________________________
From: learning...@googlegroups.com [learning...@googlegroups.com] On Behalf Of Jason Hoekstra [jasonh...@gmail.com]
Sent: Wednesday, August 29, 2012 12:36 AM
To: learning...@googlegroups.com
Subject: Re: [learningregistry] Updating or Deleting documents (doc_ID) on sandbox.learningregistry.org

Curious if there is a way to rely on the digital signature of a metadata record to determine authenticity and authority of a delete/update request? Thinking through a potential flow (and this may be a re-hash from talks way back in the day of early design):

* Publisher submits metadata record to the LR, digitally signs LR envelope
* Public key store lives on the 'net to answer authenticity of digital signatures (either LR public key store or publisher instance)
* Publisher submits a delete/update request message to the LR, also digitally signed
* LR checks with public key store for validation
* LR accepts or rejects the request depending on public key store's response

Could this work?

Jason


On Tue, Aug 28, 2012 at 9:04 PM, Jim Klo <jim...@sri.com<mailto:jim...@sri.com>> wrote:
I'll have to go back through and peruse delete features in the spec. The tricky part is validating who has authority to delete in this manner. I don't recall offhand without review if we determined that - and the way I believe it's worded would seem to indicate anyone could delete anything. That means NASA could delete Khan Academy and vice versa. Enabling delete in this way needs to happen in a verifiable way.

Delete in any case is still documented AFAIK as an policy that may or may not be honored.

One of the pillars of the project I recall that was discussed and highly desired was to have a process that allows the community to moderate and curate - we just never resolved how that would happen.


- Jim

Sent from my iPhone

>>> e. jim...@sri.com<mailto:jim...@sri.com>
>>> t. @nsomnac
>>> p. 805.542.9330 x121<tel:805.542.9330%20x121>
>>> m. 805.286.1350<tel:805.286.1350>
>>>
>>> On Aug 24, 2012, at 10:33 AM, Joshua Marks wrote:
>>>
>>>
>>>
>>
>>
>>
>> --
>> Daniel R. Rehak, Ph.D.
>>
>> Learning Registry Technical Architect
>> ADL Technical Advisor
>>
>> Skype: drrehak
>> Email: daniel...@learningregistry.org<mailto:daniel...@learningregistry.org>
>> daniel.r...@adlnet.gov<mailto:daniel.r...@adlnet.gov>
>> Twitter: @danielrehak
>> Web: learningregistry.org<http://learningregistry.org>
>>
>> Google Voice: +1 412 301 3040<tel:%2B1%20412%20301%203040>
>>
>> --
>> ---
>> This message is posted from the Google Groups "LearningRegistry" group. More
>> information about the Learning Registry project can be found at
>> http://learningregistry.org/
>>
>> To post: learning...@googlegroups.com<mailto:learning...@googlegroups.com>
>> To unsubscribe: learningregist...@googlegroups.com<mailto:learningregistry%2Bunsu...@googlegroups.com>
>> For more options, visit this group at
>> http://groups.google.com/group/learningregistry?hl=en?hl=en
>
> --
> ---
> This message is posted from the Google Groups "LearningRegistry" group. More information about the Learning Registry project can be found at http://learningregistry.org/
>
> To post: learning...@googlegroups.com<mailto:learning...@googlegroups.com>
> To unsubscribe: learningregist...@googlegroups.com<mailto:learningregistry%2Bunsu...@googlegroups.com>
> For more options, visit this group at
> http://groups.google.com/group/learningregistry?hl=en?hl=en

--
---
This message is posted from the Google Groups "LearningRegistry" group. More information about the Learning Registry project can be found at http://learningregistry.org/

To post: learning...@googlegroups.com<mailto:learning...@googlegroups.com>
To unsubscribe: learningregist...@googlegroups.com<mailto:learningregistry%2Bunsu...@googlegroups.com>

Daniel Rehak

unread,
Aug 30, 2012, 10:35:41 AM8/30/12
to learning...@googlegroups.com
I want to reiterate some key design concepts for LR which impacted the decision to not allow deletes except for administrative actions, e.g., DCMA Take down notices. 

This applies to the LR public network, and not private networks.

We did have multiple informal discussions with legal council to try to make sure the LR nodes fall under DCMA Safe Harbor provisions.

So here goes:

We expect people are publishing only metadata and paradata, not actual resources.

We hope publishers have a QA process to make sure what is published is correct, including using test or private nodes not connected to the public network.

We expect all metadata and paradata is given a license, and for the LR public nodes this is CC0, and further, LR public nodes should not accept publish or distribution of documents that do not have CC0 terms.  Using CC0 is essential for DCMA safe harbor in the US.

The LR network is patterned after mail and net news.  It's a "fire and forget" model.  It's like saying something in public -- you can't take it back and should have no expectations of doing so.  Since the network is async, you have no idea who has received what you have published and what they are doing with it, so deleting it may be in the publisher interest, but not the receivers.  The design is bias to favor the consumers.

The implication is that once you have published something under CC0 you loose control of what happens to the document.  E.g., you publish, someone gets the document, you delete, they republish it.  We anticipated this pattern would be used to re-inject "useful" documents into the network if their TTL has expired.

Administrative delete is by the "do not distribute" flag.  We do not specify what happens except that the document is not distributed or available via access.  In certain legal juridisctions, a node operater may be required to maintain the document to meet legal requirements.

By design there is no global control of LR and no single point of failure.  Thus each node operator has to operate their nodes in their own legal framework.  E.g., each node operator has to decide what is a valid take down request.

There may be a technical limit on delete.  Consider the following sequence -- order is time based
Publish doc at node A
Distribute from node A to B and C: doc is now at A, B, C
Delete doc at node A: doc is now at B, C
Distribute from node C to A,  doc is now back at A, C
Distribute from node A to B and C: doc is now at A, B, C

LR works on eventual consistency and since it is async distribution, it may not be possible to delete.

Thus not being able to delete is considered an LR feature, not a bug, and paradata is the preferred mechanism for replacements, updates, ... to metadata and paradata.

Jim Klo

unread,
Sep 7, 2012, 2:05:11 PM9/7/12
to learning...@googlegroups.com, learnin...@googlegroups.com
Hi Scott,

Good use case. 

You're correct in that the growth rate is not insignificant, but I'm not sure I totally agree with the math. I think it's on the right track; FWIW, all apps should be considered and I'd argue that you're missing a multiplier for the average number of paradata per app (there's no need to publish paradata for objects without change) which would most likely greatly decrease the rate for the average use case and in some rare cases increase the rate.

I think there's an assumption is that everyone harvests everything directly from LR - and there are currently no publicly managed indexes on top of LR other than what's included in the default data services distribution.  There are several members that I know that have done something of a sort to data within LR to massage it for their use case and built a managed index atop.  They continuously harvest, and update their index according to some business rules they've established - that means they may delete or replace things in their index according to a way that's useful for them. For those building a custom index, yes you would need to process every envelope at least once. But I'd argue that stores wishing to share similar paradata should be using a shared managed index.

Consider this, maybe store X allows anyone to rate a resource - qualified or unqualified, however they publish user qualifications with the ratings, store Z only displays ratings that have been performed by a qualified educators.  Store Z needs a different index than store X on top of LR to extract what's relevant; both have to process all records at least once given the federated nature of the network. 

Expiration dates do need to be dealt with - the existing implementation allows for TTL to be published, but is not honored. This is something relatively trivial that should dealt with, it's low hanging and anyone wishing to contribute a solution for this; please submit a pull request!

Delete & Update are still the elephant in the room as to how to deal with them in a federated manner. There's a lot of loaded issues here… yes the spec supports it; implementation does not, because not all the issues are resolved. i.e. - Who get's to delete / update a doc? Publisher? Owner? Curator? someone else? - none are required fields! So how do you trust the document modification? This alludes to the problem that update/delete may not be done in a consistent manner from node to node… if node A only trusts identities M, N, and O - and node B only trusts identities O,  P, and Q - the only shared trusted identity for updates/deletes between nodes is O!  I think there's a mechanism for policy enforcement here that is yet to be defined in order to proceed down solving update/delete.

- Jim

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International
t. @nsomnac

On Aug 31, 2012, at 4:42 AM, Scott Wilson <scott.brad...@gmail.com> wrote:

Here is my use case. 

I have an app store.

Users publish reviews, and the store records statistics (downloads, views, likes, embeds)

User reviews are published as paradata. Users typically don't update their own reviews much anyway. No problem.

For stats to-date figures are published as paradata on a daily schedule.

The store aggregates stats from two partner stores.

Over time the total size of the paradata set grows as a factor of:

n * a *  f * d * s

n: number of stores sharing paradata
a: number of apps in common between stores
f: frequency of paradata publication
d: age of store 
s: mean size of paradata records

This means that each day that passes, the total download size for synching with LR will increase, and if the store is also growing, the rate of increase will also increase :(

I can see the merit in retaining paradata over a longer period to support things like historical trends, however on a practical level its going to be a pain unless we can either support updates, or expiry dates, or be able to perform normalization at the node end.

Currently I perform normalization of updated records at the client end; this is going to get difficult in the longer term. Maybe not for big app store sites, but for individual devs wanting to create a badge showing "my app has x downloads!" its not really going to work if they have to obtain around 1000 JSON objects in order to throw away the first 997 (and thats just after one year).

Steve Midgley

unread,
Sep 11, 2012, 2:02:34 AM9/11/12
to learning...@googlegroups.com, learnin...@googlegroups.com
I'd add that I think the "update issue" is the biggest unsolved problem in this ecosystem of paradata and federation for us. Deletes are somewhat painful but relatively linear. Updates are complex and largely based on "opinions" (aka business rules).

In terms of log-scale growth of paradata, I think it's a problem I'd like to see us have. Not to brush it off of course -- I think the solution will be in several areas:

1) Shortened TTL on particular nodes handling paradata on this scale
2) Map/reductions to convert noisy paradata into processed paradata (and re-emitting on alternative keyword channels to permit some groups to just listen to highly processed paradata, and only a few motivated nodes federating raw paradata).
3) Filtered replication: to enable passing only certain kinds of envelopes among "willing" nodes that have the capacity and business motivation to share noisy paradata.

My thinking in the near-mid term is that it's probably better to collect noisy/log-scale paradata via Google Analytics using custom domains to trap the data elements you're interested in. Then do data dumps out of Google analytics and push the processed paradata into LR for sharing. Google Analytics provides a ton of useful and free infrastructure to track web events. With a bit of careful planning, it's possible to use this infrastructure to do general data collection across the web..

Anyway those are my thoughts - I can't say I have it all figured out by any means - so I'd be curious what others are thinking along these lines.

Best,
Steve

Scott Wilson

unread,
Sep 17, 2012, 5:45:28 AM9/17/12
to learning...@googlegroups.com, learnin...@googlegroups.com
Thanks Jim,

I think implementing TTL in the core LR node code would be the simplest way to address the issue without introducing new behaviours - even relatively long TTLs (e.g. one week) would significantly reduce load; in the example I used, it would mean a client processing around 7-10 JSON objects rather than 1000. 

I'd love to add TTLs to paradata published using SPAWS - any doc yet on how to do that? (OK, I know its not implemented yet, but its probably a good idea to start emitting data with TTL info as soon as)

(I also agree with Steve that this is a "nice problem to have " :)

S
Reply all
Reply to author
Forward
0 new messages