--
You received this message because you are subscribed to the Google Groups "Learning Registry Developers List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to learningreg-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
During the developer call last week, the subject of using PostgreSQL (PG) as the data store for the LR was raised by Steve Midgley.
This would obviously be a big step if pursued. Under the context of the Credentials Transparency project, we believe this will be very helpful as the use of PG would, at a minimum:
I believe that Steve (and correct me if I am wrong, or please elaborate) had suggested that a starting point could be for the PG database to sit 'beside' the couch base store. This would mean there would be no disruption to the current site, as all current APIs would continue to function without change.
Ultimately before considering using PG as a full replacement, we would want to ensure that key functionality like the following would still be maintained. This is a short list, not meant to be comprehensive:
1. Network of nodes
The current registry includes multiple nodes. Anyone can spin up a new node in a relatively short period of time.
Related, what would the implications be for 'spinning' up a new node that has PostgreSQL as the datastore?
2. Replication of nodes
An important related topic is the ability to replicate data between the nodes. PostgreSQL does have replication capability, including streaming replication. Would the replication between PostgreSQL instances be fairly simple and hands off?
3. Existing APIs
A preference might be to maintain the current API methods to minimize disruption for current users. For example the publishing API might remain the same, but could evolve over time.
A useful feature of the current LR is to be able to a quick getRecord to view an LR document. This type of query is a very useful feature to maintain.
Any change of this magnitude would not be done on a whim. I believe that the availability of a relational database like PostgreSQL would ultimately be very beneficial. However, the benefits must be considered significant before a project like this could be initiated.
LR Group - Thoughts?
Michael Parsons
Solution Architect
Southern Illinois University Carbondale
For more options, visit https://groups.google.com/d/optout.
I think distribution and sharing (problem #2) is largely solved by the existing infrastructure, IF you’re looking for these scenarios:
1. I just want to get my metadata out there for others to find. Publishing is fairly straightforward. Finding what I just published can be done by doc_ID (which is returned by the publish API), by URL (listrecords API), or identity (submitter/signer – use the slice API).
2. I want to know everything published for a given date range (for example, calendar year 2015). The listrecords API will get you this.
3. I want all of Smithsonian’s (or NSDL’s, or whoever’s) resources (use the slice API).
Where it gets tricky is if you’re wanting to find resources that others have shared by almost any other criteria. For example, with slice as I understand it (and maybe this has been updated since I looked at slice, it’s been a while!) if I’m looking for 8th grade English resources, if I slice on 8th grade and English, I’m going to get every resource that’s tagged 8th grade, plus every resource that’s tagged English, provided that the user has put those in the keys array of the LR envelope. IOER does not do this, if we publish an 8th grade English resource, “English” *might* go in keywords, but 8th grade most certainly will not – it will be in the payload, and indicated as a grade level. On top of this, a searcher is not going to be expecting an OR between 8th grade and English, they’re expecting an AND.
This is where problem #1 comes in. I would argue that searchers are going to be less likely to say “What has Smithsonian published” and far more likely to be looking for resources they can use in their 8th grade English class. This is where a database comes in (and also an index such as Elastic Search, which is very good at searching, if the developers have done their homework).
From a developer’s perspective, it is not difficult to do any of the things I outlined in #1-3 above. It is not difficult to do a dump of the entire LR to a set of JSON files (or XML for that matter). A bit time consuming, perhaps, but if you know how to program web requests and parse JSON or XML, code can be written fairly easily. Where it gets complicated is understanding the payloads once you’ve harvested them. Speaking as the author of the IOER import, I can tell you that making sense of the payloads is more difficult than extracting the documents from the LR.
If I were a vendor who provides software to K-12 school districts to help them find resources, I would rather not have to harvest the entire LR in order to populate a database/index with relevant resources. I’d much rather be able to call an API that could do an OR on a field, and an AND between fields (so that I could say (Grade 9 OR Grade 10 OR Grade 11 OR Grade 12) AND English to get back resources suitable for high school English students). This is where PostgreSQL, possibly in conjunction with an index like Elastic Search, would shine.
From a publisher’s and consumer’s perspective, problems 1 and 2 are different sides of the same coin, in my opinion. It is pointless to share my metadata if nobody can find it. If my metadata is easy to find, then the LR can drive traffic to my site. And if I’m a commercial entity, more traffic means I have increased my chances to sell my users something. If I’m a non-commercial entity such as a grant-funded non-profit, and I have a grant to make these resources available, I’ve likely improved my performance by making my resources more easily found and used by teachers and students.
Now, having said all that, I am not discounting the need to make sure replication happens between nodes. That’s also an important piece of the puzzle that shouldn’t be left out. Like Microsoft SQL Server, PostgreSQL supports replication, so one might be able to leverage that, or even the existing replication code, to replicate data between nodes. You’d want to use a GUID as an ID rather than an int, but it definitely can be done with PostgreSQL.
Jerome Grimmer
Applications Analyst
Southern Illinois University Carbondale
This email sent using 100% recycled electrons.
For more options, visit https://groups.google.com/d/optout.
It would appear there is general agreement that implementing Postgres would be a positive move.
What are the next steps?
Would it be useful to plan a tech session to rough out a plan?
Or maybe the initial session would be to ensure there are no barriers, and then a plan to plan.
The first barrier would be to ensure that the project sponsors approve this step.
Joe/Steve - who would need to be involved in making this decision?
Per Jason note, I briefly reviewed the information on BDR, and the capability seems very promising.
Thoughts?
Michael Parsons
Solution Architect
Illinois workNet (TM)
Southern Illinois University Carbondale