How many documents are there REALLY in the Learning Registry?

31 views
Skip to first unread message

Jerome Grimmer

unread,
Sep 28, 2012, 12:23:01 PM9/28/12
to learnin...@googlegroups.com

I’ve pulled data with dates from 10/26/2011 thru 11/7/2011 (a period of about two weeks) from the LR for importing into our SQL database using the Basic Harvest.  I’m doing it in date order to get as much info as I can.  However, for this approximately two week period, I’ve received nearly 400,000 records from the Learning Registry.  http://node01.public.learningregistry.net/status reports there are 437,094 documents in the learning registry.  I’m expecting that the amount of data submitted each week remains fairly high for a period of at least a few weeks (or months) beyond what I’ve already pulled.  If my assumption is correct, there are a lot more documents in the LR than are being reported by status. 

 

My database experience has been with relational databases, not CouchDB or other NoSQL style databases. 

·         How many documents are there REALLY in the LR? 

·         At what time frame can I expect the number of documents submitted to LR to drop significantly? 

I need to know this so I can give a fairly accurate estimate to my boss of when we will be caught up with importing data from the LR into our SQL database and can just retrieve new items.  If there are really 437,094 documents in the LR, then I’m in pretty good shape.  If, on the other hand, there are ten times more records than that, then that would definitely be useful information to share with The Powers That Be, as they are chomping at the bit for the import to be caught up.

 

If there’s a CouchDB query that I could run against our node that would give me this information, I’d be okay with that too.

 

 

Jerome Grimmer

Southern Illinois University Carbondale

2450 Foundation Drive Suite 100

Springfield, IL

Phone: 217-786-3010 ext. 5857

Toll-free: 1-800-252-4822 ext. 5857

NOTE: My E-mail address has changed

jgri...@siuccwd.com

"Your words have power.  Use them wisely." --Unknown.

 

Steve Midgley

unread,
Sep 28, 2012, 4:00:21 PM9/28/12
to learnin...@googlegroups.com
Hi Jerome,

I believe that your counts are actually correct. The bulk of the LR data set is from the initial kick off around the dates you pulled (we made a concerted effort to put a lot of data in all at once from a number of sources). It sounds approximately right that there are 40k new records since 11/7/11. Hopefully you find these new records are relatively high value envelopes, such as BetterLesson's lessons, some new Smithsonian content, contributions from the Physics Experiment Toolkit etc!

So I'd theorize that your assumption about flow of data is wrong, not the count of data.

More generally, to count data elements in a Couch/LR database, you can rely on Futon admin interface to tell you - I think it does this pretty easily, if you have your own node. The 100% certain way is to write a map reduce which simply counts every record in the reduction (and emits every record in the map). The map would be null in value (just emit each doc id as the key) and reduce phase would simply count every element it saw.

If you want examples of what that would look like let me know, but for the purposes of your question I think the data count values you are looking at are accurate.

We might see significant increases in data as more contributors come online, especially around paradata contributions, but for purposes of scaling over the next year, I wouldn't think we'd see more than a tripling of current content.

The main storage issue is the size of the slice interface indices. So if you disable slice and remove those indices, you'll capture a ton of free storage space (and your storage requirements will grow arithmetically to incoming data).

Helps?
Steve

--
---
This message is posted from the Google Groups "Learning Registry Developers List" group.
To post: learnin...@googlegroups.com
To unsubscribe: learningreg-d...@googlegroups.com

Jerome Grimmer

unread,
Sep 28, 2012, 4:08:43 PM9/28/12
to learnin...@googlegroups.com

Hi Steve,

That is very helpful.  It is also a bit of a relief on my end to know that there’s probably only about 40,000 records I haven’t pulled and not 4 million!  I would be interested in seeing a map reduce that would do what you showed (with source code pleaseJ) so I can learn from it.

 

Jerome Grimmer

Southern Illinois University Carbondale

2450 Foundation Drive Suite 100

Springfield, IL

Phone: 217-786-3010 ext. 5857

Toll-free: 1-800-252-4822 ext. 5857

NOTE: My E-mail address has changed

jgri...@siuccwd.com

"Everybody is a genius.  But if you judge a fish on its ability to climb a tree, it will live its whole life believing that it is stupid." – Albert Einstein

Steve Midgley

unread,
Sep 28, 2012, 4:28:48 PM9/28/12
to learnin...@googlegroups.com
This is an excellent guide for figuring this stuff out in Couch: http://guide.couchdb.org/draft/views.html

The chapter listed is what I used for sample code below. Before you start, you have to get your views and reductions into a design document. That's beyond this email but is very important for development. I really like CouchApp which provides a command line framework and directory structure to allow you to edit text files for your views/maps and have them automatically escaped and uploaded to CouchDB. Trying to manually create a couch design document is *nuts* and I wouldn't recommend it. For simple stuff, upload via Futon, otherwise invest in learning CouchApp, which is quite simple I think.

The specific part of the guide you want is: http://guide.couchdb.org/draft/cookbook.html#aggregate

To answer your question, I think you'd create a design doc, let's say "sample" and a view called "everything" - in that view you'll create a "map" and a "reduce" element:

map:
function(doc) {
  emit(null, null); // note emitting nulls means that you're emitting only the doc id which is all you need
}

reduce: 
function(keys, values, rereduce) {
  return sum(values);
}

I'm pretty sure that's what you want. Some other articles on aggregates in couchdb:  http://barkingiguana.com/2009/01/28/counting-tags-with-couchdb-and-map-reduce/ 

Steve
Reply all
Reply to author
Forward
0 new messages