LR Developer Community call 01/06/2016

32 views
Skip to first unread message

joe hobson

unread,
Jan 4, 2016, 3:14:53 PM1/4/16
to learnin...@googlegroups.com
The bi-weekly LR Developer Community call is back – Wednesday, January 6th at 1pm Pacific / 4pm Eastern. To connect: 

Optional dial in number: (716) 293-9163  –  PIN: 05415
If you have items you would like included on the agenda, send them to me or the list. Thanks. ... .joe

-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:
joe hobson
   Navigation North Learning

Steve Midgley

unread,
Jan 4, 2016, 3:17:10 PM1/4/16
to learnin...@googlegroups.com
Hi,

I'm getting off a plane about 30 minutes prior to our call. I'll try to call in but if the plane is late, you'll know why I'm not there! Sorry I missed our last call - I'll try to be there for this one.

Steve




--
You received this message because you are subscribed to the Google Groups "Learning Registry Developers List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to learningreg-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Parsons

unread,
Jan 11, 2016, 2:56:36 PM1/11/16
to learnin...@googlegroups.com

During the developer call last week, the subject of using PostgreSQL (PG) as the data store for the LR was raised by Steve Midgley.

This would obviously be a big step if pursued. Under the context of the Credentials Transparency project, we believe this will be very helpful as the use of PG would, at a minimum:

  • allow for relationships between records/documents
  • enable more robust searching
  • ease use of document types other than resources

I believe that Steve (and correct me if I am wrong, or please elaborate) had suggested that a starting point could be for the PG database to sit 'beside' the couch base store. This would mean there would be no disruption to the current site, as all current APIs would continue to function without change.

 

Ultimately before considering using PG as a full replacement, we would want to ensure that key functionality like the following would still be maintained. This is a short list, not meant to be comprehensive:

 

1. Network of nodes

The current registry includes multiple nodes. Anyone can spin up a new node in a relatively short period of time.

Related, what would the implications be for 'spinning' up a new node that has PostgreSQL as the datastore?

 2. Replication of nodes

An important related topic is the ability to replicate data between the nodes. PostgreSQL does have replication capability, including streaming replication. Would the replication between PostgreSQL instances be fairly simple and hands off?

3. Existing APIs

A preference might be to maintain the current API methods to minimize disruption for current users. For example the publishing API might remain the same, but could evolve over time.

A useful feature of the current LR is to be able to a quick getRecord to view an LR document. This type of query is a very useful feature to maintain.

 

Any change of this magnitude would not be done on a whim. I believe that the availability of a relational database like PostgreSQL would ultimately be very beneficial. However, the benefits must be considered significant before a project like this could be initiated.


LR Group - Thoughts?


Michael Parsons
Solution Architect

Southern Illinois University Carbondale



Pat Lockley (Pgogy)

unread,
Jan 11, 2016, 5:05:33 PM1/11/16
to learnin...@googlegroups.com, Michael Parsons
I like the idea of using different databases, wonder if the logical end
point is to have a consistent URL structure and whatever language / DB
underneath?

On 2016-01-11 14:56, Michael Parsons wrote:
> During the developer call last week, the subject of using PostgreSQL
> (PG) as the data store for the LR was raised by Steve Midgley.
>
> This would obviously be a big step if pursued. Under the context of
> the Credentials Transparency project, we believe this will be very
> helpful as the use of PG would, at a minimum:
>
> * allow for relationships between records/documents
> * enable more robust searching
> * ease use of document types other than resources
>
> I believe that Steve (and correct me if I am wrong, or please
> elaborate) had suggested that a starting point could be for the PG
> database to sit 'beside' the couch base store. This would mean there
> would be no disruption to the current site, as all current APIs would
> continue to function without change.
>
> Ultimately before considering using PG as a full replacement, we
> would want to ensure that key functionality like the following would
> still be maintained. This is a short list, not meant to be
> comprehensive:
>
> 1. Network of nodes
>
> The current registry includes multiple nodes. Anyone can spin up a
> new node in a relatively short period of time.
>
> Related, what would the implications be for 'spinning' up a new node
> that has PostgreSQL as the datastore?
>
> 2. Replication of nodes
>
> An important related topic is the ability to replicate data between
> the nodes. PostgreSQL does have replication capability, including
> streaming replication. Would the replication between PostgreSQL
> instances be fairly simple and _hands off_?
>
> 3. Existing APIs
>
> A preference might be to maintain the current API methods to minimize
> disruption for current users. For example the publishing API might
> remain the same, but could evolve over time.
>
> A useful feature of the current LR is to be able to a quick getRecord
> to view an LR document. This type of query is a very useful feature to
> maintain.
>
> Any change of this magnitude would not be done on a whim. I believe
> that the availability of a relational database like PostgreSQL would
> ultimately be very beneficial. However, the benefits must be
> considered significant before a project like this could be initiated.
>
> LR Group - Thoughts?
>
> _Michael Parsons_
> SOLUTION ARCHITECT
>
> Southern Illinois University Carbondale
>
> --
> You received this message because you are subscribed to the Google
> Groups "Learning Registry Developers List" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to learningreg-d...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout [1].
>
>
> Links:
> ------
> [1] https://groups.google.com/d/optout

--
Pgogy Webstuff
pgogywebstuff.com

Steve Midgley

unread,
Jan 11, 2016, 9:00:06 PM1/11/16
to learnin...@googlegroups.com, Michael Parsons
Pat's point is significant and I think it's compatible but different from Michael's postgres point. I think it overlaps with Michael's point about federation and distribution.

The best foundation for distribution of metadata are URLs, as Pat rightly points out (IMO).

But at a high level I see two distinct issues. We want to make learning resource metadata both:
  1. Easy to find and access
  2. Easy to distribute and share
Point #1 is where we need some kind of database tool. CouchDB is doing "ok" right now, but it was a bad choice from the start (speaking as the person who made the choice) and if we can dump it, we should.

Point #2 is more where Pat's point comes in (and Michael's about federation/distribution). I think that Archive.org's "collections" infrastructure is a good candidate for solving this issue. To give a feel of the power of using archive.org collections, consider that Archive.org itself is publishing all the web crawls that it has conducted into a single collection: https://archive.org/details/widecrawl?&sort=-downloads&page=2 - that's a lot of web crawl data.

BUT, collections are imperfectly organized, so unless you want to download a lot of data, and parse it, it's not a good choice. This is the dilemma between problem #1 and problem #2.

People with problem #1 want an API that makes it easy to find resources they want. People with problem #2 want something that is easy to publish to, and that is easy to download in bulk from.

The current LR system "splits the baby" between problems 1 and 2. 

My assertion is that if we want to work more on problem #1, then we should use postgres on the bottom end.

But I think problem #2 should be solved first, and then we build up a database and a set of APIs on top of it (if we even need it).

I hope you all push back on this agenda I'm shopping! I think it makes sense, but I don't have a great track record of making strong tech decisions for this project up to now!

Best,
Steve


For more options, visit https://groups.google.com/d/optout.

Regan, Damon

unread,
Jan 11, 2016, 10:27:43 PM1/11/16
to learnin...@googlegroups.com, Michael Parsons
Great thread.  I don't like to push back on your ideas Steve as I think you have a great track record.  

But here is my push back: I think solving problem #1 (easy to find and access) is most important.  problem #1 seems directly related to publishing quality metadata.  CouchDB makes it difficult (impossible?) to query to find the metadata immediately after publishing it.  Perhaps this is related to enabling more robust searching, which Michael stated as a reason for considering moving to postgressql.  Immediately finding metadata after publishing seems very important.  It doesn't sound like robust searching if I can't find what I just published :)

Best Regards,
Damon


-----------------------------------------------------
Damon Regan, Ph.D., Contractor
Advanced Distributed Learning (ADL) Initiative
Personal Mobile: 407-924-1238
damon.r...@adlnet.gov
https://www.linkedin.com/in/damonregan

Jerome Grimmer

unread,
Jan 12, 2016, 10:06:54 AM1/12/16
to learnin...@googlegroups.com

I think distribution and sharing (problem #2) is largely solved by the existing infrastructure, IF you’re looking for these scenarios:

1.       I just want to get my metadata out there for others to find.  Publishing is fairly straightforward.  Finding what I just published can be done by doc_ID (which is returned by the publish API), by URL (listrecords API), or identity (submitter/signer – use the slice API).

2.       I want to know everything published for a given date range (for example, calendar year 2015).  The listrecords API will get you this.

3.       I want all of Smithsonian’s (or NSDL’s, or whoever’s) resources (use the slice API).

 

Where it gets tricky is if you’re wanting to find resources that others have shared by almost any other criteria.  For example, with slice as I understand it (and maybe this has been updated since I looked at slice, it’s been a while!) if I’m looking for 8th grade English resources, if I slice on 8th grade and English, I’m going to get every resource that’s tagged 8th grade, plus every resource that’s tagged English, provided that the user has put those in the keys array of the LR envelope.  IOER does not do this, if we publish an 8th grade English resource, “English” *might* go in keywords, but 8th grade most certainly will not – it will be in the payload, and indicated as a grade level.  On top of this, a searcher is not going to be expecting an OR between 8th grade and English, they’re expecting an AND.

 

This is where problem #1 comes in.  I would argue that searchers are going to be less likely to say “What has Smithsonian published” and far more likely to be looking for resources they can use in their 8th grade English class.  This is where a database comes in (and also an index such as Elastic Search, which is very good at searching, if the developers have done their homework).

 

From a developer’s perspective, it is not difficult to do any of the things I outlined in #1-3 above.  It is not difficult to do a dump of the entire LR to a set of JSON files (or XML for that matter).  A bit time consuming, perhaps, but if you know how to program web requests and parse JSON or XML, code can be written fairly easily.  Where it gets complicated is understanding the payloads once you’ve harvested them.  Speaking as the author of the IOER import, I can tell you that making sense of the payloads is more difficult than extracting the documents from the LR.

 

If I were a vendor who provides software to K-12 school districts to help them find resources, I would rather not have to harvest the entire LR in order to populate a database/index with relevant resources.  I’d much rather be able to call an API that could do an OR on a field, and an AND between fields (so that I could say (Grade 9 OR Grade 10 OR Grade 11 OR Grade 12) AND English to get back resources suitable for high school English students).  This is where PostgreSQL, possibly in conjunction with an index like Elastic Search, would shine.

 

From a publisher’s and consumer’s perspective, problems 1 and 2 are different sides of the same coin, in my opinion.  It is pointless to share my metadata if nobody can find it.  If my metadata is easy to find, then the LR can drive traffic to my site.  And if I’m a commercial entity, more traffic means I have increased my chances to sell my users something.  If I’m a non-commercial entity such as a grant-funded non-profit, and I have a grant to make these resources available, I’ve likely improved my performance by making my resources more easily found and used by teachers and students.

 

Now, having said all that, I am not discounting the need to make sure replication happens between nodes.  That’s also an important piece of the puzzle that shouldn’t be left out.  Like Microsoft SQL Server, PostgreSQL supports replication, so one might be able to leverage that, or even the existing replication code, to replicate data between nodes.  You’d want to use a GUID as an ID rather than an int, but it definitely can be done with PostgreSQL.

 

Jerome Grimmer

Applications Analyst

Southern Illinois University Carbondale

This email sent using 100% recycled electrons.

Pat Lockley (Pgogy)

unread,
Jan 12, 2016, 1:33:25 PM1/12/16
to learnin...@googlegroups.com
From my take on metadata searching, all you will *reliably* get is a
title, description and keywords. The rest is a nice thing to hope for.

As such you don't really need structured metadata, as you can hit all
three at the same time. Xpert used to match those three, as does
Solvonauts (both mine) and I think that works ok.

So perhaps I am rambling about what metadata we expect, and what we get?
If what we get is free text searchable, then...?

On 2016-01-12 03:27, Regan, Damon wrote:
> Great thread. I don't like to push back on your ideas Steve as I
> think you have a great track record.
>
> But here is my push back: I think solving problem #1 (easy to find and
> access) is most important. problem #1 seems directly related to
> publishing quality metadata. CouchDB makes it difficult (impossible?)
> to query to find the metadata immediately after publishing it.
> Perhaps this is related to enabling more robust searching, which
> Michael stated as a reason for considering moving to postgressql.
> Immediately finding metadata after publishing seems very important.
> It doesn't sound like robust searching if I can't find what I just
> published :)
>
> Best Regards,
> Damon
>
> -----------------------------------------------------
> Damon Regan, Ph.D., Contractor
> Advanced Distributed Learning (ADL) Initiative
> Personal Mobile: 407-924-1238
> damon.r...@adlnet.gov
> https://www.linkedin.com/in/damonregan [5]
>
> On Mon, Jan 11, 2016 at 9:00 PM, Steve Midgley
> <steve....@mixrun.com> wrote:
>
>> Pat's point is significant and I think it's compatible but different
>> from Michael's postgres point. I think it overlaps with Michael's
>> point about federation and distribution.
>>
>> The best foundation for distribution of metadata are URLs, as Pat
>> rightly points out (IMO).
>>
>> But at a high level I see two distinct issues. We want to make
>> learning resource metadata both:
>>
>> * Easy to find and access
>> * Easy to distribute and share
>>
>> Point #1 is where we need some kind of database tool. CouchDB is
>> doing "ok" right now, but it was a bad choice from the start
>> (speaking as the person who made the choice) and if we can dump it,
>> we should.
>>
>> Point #2 is more where Pat's point comes in (and Michael's about
>> federation/distribution). I think that Archive.org's "collections"
>> infrastructure is a good candidate for solving this issue. To give a
>> feel of the power of using archive.org [1] collections, consider
>> that Archive.org itself is publishing _all_ the web crawls that it
>> has conducted into a single collection:
>> https://archive.org/details/widecrawl?&sort=-downloads&page=2 [2] -
>> For more options, visit https://groups.google.com/d/optout [3]
>> [1].
>>
>> Links:
>> ------
>> [1] https://groups.google.com/d/optout [3]
>>
>> --
>> Pgogy Webstuff
>> pgogywebstuff.com [4]
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Learning Registry Developers List" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to learningreg-d...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout [3].
>
> --
> You received this message because you are subscribed to the Google
> Groups "Learning Registry Developers List" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to learningreg-d...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout [3].
>
> --
> You received this message because you are subscribed to the Google
> Groups "Learning Registry Developers List" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to learningreg-d...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout [3].
>
>
> Links:
> ------
> [1] http://archive.org
> [2]
> https://archive.org/details/widecrawl?&amp;sort=-downloads&amp;page=2
> [3] https://groups.google.com/d/optout
> [4] http://pgogywebstuff.com
> [5] https://www.linkedin.com/in/damonregan

--
Pgogy Webstuff
pgogywebstuff.com

Jason Hoekstra

unread,
Jan 14, 2016, 2:46:46 PM1/14/16
to learnin...@googlegroups.com
Hi all -

Looks like I missed a really good call on a topic I think is really important and looking forward to exploring the idea.  For the past few calls, I've had the calendar on the wrong week and last one sat on Speek for 5 mins talking to myself.  I'll try to get it together and process the LR invites into the right cal entries in the future.

In the past, have been involved in initiatives both trying to set up in LR node and more importantly make use of metadata and have ran into limitations mentioned here in thread.  I love the Postgres idea as easy to setup (apt-get install) and fast relational database tech as a back end database.  Indexing and searching built in (nature of relational) and in the past years, many functions to handle XML and JSON natively.  Of course, it takes adherence to schema to action on that data, but it seems like LR/DCMI and longer standing Dublin Core and LOM schemas are stable to model?

From a recent implementation of a digital content search for K-12, grade levels and standards along with free text title/desc are key for end-users to target material.  It seems like a perpetual chicken and egg problem, without effective tech to provide the last mile of searches for many LMSs/apps, publishers don't put priority on feeding in additional metadata?  My hopes are solving problem #1 provides more incentive to publish high quality metadata, otherwise the LR doesn't add much from where general search engines are today.  From a recent content search implementation, I've noticed end-users start with a free text search, then go straight to a left nav to filter based on attributes which is what I'm going on for the desirability of this.

Problem #2 seems a bit harder as relational databases don't have the same cross-institution/boundary replication that Couch appears to have, but perhaps that develops after taking a look back on the use cases of sharing between public nodes and individual organizations.  I think a few years back a few open-source products seemed to tackle distributed DB sync, maybe BDR is one of them?  Seems to have much of the same "node" based, "multi-master" language that fits with Couch / LR.

Hope everyone is well,

Jason


For more options, visit https://groups.google.com/d/optout.

Michael Parsons

unread,
Jan 14, 2016, 6:38:09 PM1/14/16
to learnin...@googlegroups.com

It would appear there is general agreement that implementing Postgres would be a positive move.

What are the next steps?

Would it be useful to plan a tech session to rough out a plan?


Or maybe the initial session would be to ensure there are no barriers, and then a plan to plan.

The first barrier would be to ensure that the project sponsors approve this step.

Joe/Steve - who would need to be involved in making this decision?



Per Jason note, I briefly reviewed the information on BDR, and the capability seems very promising.


Thoughts?


Michael Parsons
Solution Architect

Illinois workNet (TM)
Southern Illinois University Carbondale





From: learnin...@googlegroups.com <learnin...@googlegroups.com> on behalf of Jason Hoekstra <jasonh...@gmail.com>
Sent: Thursday, January 14, 2016 1:46 PM
To: learnin...@googlegroups.com
Subject: Re: [learningreg-dev] First steps for use of PostgreSQL as an alternate data store for the learning registry
 
Hi all -

Looks like I missed a really good call on a topic I think is really important and looking forward to exploring the idea.  For the past few calls, I've had the calendar on the wrong week and last one sat on Speek for 5 mins talking to myself.  I'll try to get it together and process the LR invites into the right cal entries in the future.

In the past, have been involved in initiatives both trying to set up in LR node and more importantly make use of metadata and have ran into limitations mentioned here in thread.  I love the Postgres idea as easy to setup (apt-get install) and fast relational database tech as a back end database.  Indexing and searching built in (nature of relational) and in the past years, many functions to handle XML and JSON natively.  Of course, it takes adherence to schema to action on that data, but it seems like LR/DCMI and longer standing Dublin Core and LOM schemas are stable to model?

From a recent implementation of a digital content search for K-12, grade levels and standards along with free text title/desc are key for end-users to target material.  It seems like a perpetual chicken and egg problem, without effective tech to provide the last mile of searches for many LMSs/apps, publishers don't put priority on feeding in additional metadata?  My hopes are solving problem #1 provides more incentive to publish high quality metadata, otherwise the LR doesn't add much from where general search engines are today.  From a recent content search implementation, I've noticed end-users start with a free text search, then go straight to a left nav to filter based on attributes which is what I'm going on for the desirability of this.

Problem #2 seems a bit harder as relational databases don't have the same cross-institution/boundary replication that Couch appears to have, but perhaps that develops after taking a look back on the use cases of sharing between public nodes and individual organizations.  I think a few years back a few open-source products seemed to tackle distributed DB sync, maybe BDR is one of them?  Seems to have much of the same "node" based, "multi-master" language that fits with Couch / LR.
This book is the official documentation of BDR 0.9.3 for use with PostgreSQL 9.4 (or the BDR-patched version of it). It has been written by the PostgreSQL and BDR ...

Pat Lockley (Pgogy)

unread,
Jan 15, 2016, 6:54:27 AM1/15/16
to learnin...@googlegroups.com
> From a recent implementation of a digital content search for K-12,
> grade levels and standards along with free text title/desc are key for
> end-users to target material. It seems like a perpetual chicken and
> egg problem, without effective tech to provide the last mile of
> searches for many LMSs/apps, publishers don't put priority on feeding
> in additional metadata? My hopes are solving problem #1 provides more
> incentive to publish high quality metadata, otherwise the LR doesn't
> add much from where general search engines are today. From a recent
> content search implementation, I've noticed end-users start with a
> free text search, then go straight to a left nav to filter based on
> attributes which is what I'm going on for the desirability of this.

I think the "better than google" is somewhat of a red herring (too
english a cliche?) as I am not sure how people search for new resources
when teaching?

The chicken and the egg (hopefully international cliche) problem is
something postgresql (and possibly mysql) could solve. Speaking as a
rather atypical (in some senses, stop sniggering) ed tech person in the
UK, we're all pretty much PHP / MySQL people. Most of our ed tech is
LAMP stack sort of stuff (Moodle, Mahara, WordPress, Drupal, Xerte) and
so if you could drop an LR node onto the same tech as the organisation
already has then more nodes. Also if you could bundle the two together,
then all of a sudden, a tonne of data exists (para mostly) which google
doesn't have.

I started to write a moodle web service which could report on all web
links shared on the vle. If you had an LR nearby talking to the VLE,
then I think you'd get a lot of data

Pat

Jason Hoekstra

unread,
Jan 17, 2016, 1:45:29 PM1/17/16
to learnin...@googlegroups.com
With that in mind, as we talk about a shift to improve LR with relational db tech, it could be possible to also directly LR-enable other relational dbs LMS products with a similar strategy?  As we look at the most popular LR APIs ("/slice" for discovery?), could that same API endpoint be enabled via plug-in or module with these popular products?  

If we can figure out the LR replication / metadata sync strategy from node-to-node with new tech, it seems an LR API enabled product could make available its metadata all the same as a node would.  Instead of an LMS pushing its metadata to the LR, seems the LR would be able to pull metadata from products if LR nodes knew of its existence.  This would be in the hopes of making it easier for existing platforms to announce/list their metadata without the need to setup an integration on their end....



Jason Hoekstra

unread,
Jan 17, 2016, 1:49:30 PM1/17/16
to learnin...@googlegroups.com
Potentially we can use this coming week's LR call (if I have it right...) as a main topic to discuss?  To keep compatibility with existing LR solutions, I'm guessing we have to look at the interfaces/APIs that are most in use today, then work backwards to figure out how the underlying tech could work?
Reply all
Reply to author
Forward
0 new messages