Designing data model and caching strategy for new pricing model

94 views

Skip to first unread message

David Whittaker

unread,

Oct 10, 2011, 4:22:34 PM10/10/11

to google-a...@googlegroups.com

Hi,

I am building a web app for GAE that gives teachers a way to enter student grades and then analyze those grades to easily visualize students who might be struggling earlier than when their period average starts to drop. I want to keep the app free as long as possible (at least during the pilot phase at my wife's school), and the price of membership as low as possible when it goes public by optimizing the data storage model up front to minimize writes in an app that is very data-centric.

So I basically have a bunch of student information, a course model representing a single section of a class, and then many activities per course. Each activity has some information about how the grade in the activity contributes to the various possible averages for the course, and of course, there is a grade for each student in the course and a possible comment on the grade. Finally, the student has a list of dates that they were absent from the course and a code which represents the type of absence.

The primary view of the app (probably 60% of the page loads) is a spreadsheet-like interface with students down the side, assignments across the top, and grades in the cells, with cells color-coded to represent general absence reasons (present, tardy, excused, unexcused). The secondary view (probably 20% of the page loads) is an attendance view which is another spreadsheet-like interface with students down the side, dates across the top, and color-coded cells with full absence codes. Another 10% or so of the app will be various charts which look at all the grades in a class and display them visually with various statistics. The remaining 10% or so will be various single-student queries retrieving all their demographic information, or all the grades for that single student.

So, most of us can write out the sql model of this data in our sleep... the core of which is a Score entity with a single grade and foreign keys pointing to the Student, Activity, and Course (if I'm generous - fully normalized would require that I look up the activity to get the related Course), and an Absence entity with foreign keys pointing to the Student and Course. But of course, we also know the data store is not SQL... so I'm trying to figure out just how far I can push the NoSQL model and denormalize my data like crazy.

The one thing I have going for me is that I can safely assume class sizes will never be larger than 50 (this is targeted at k-12 market), and that I can, if needed segment data into 9-week grading periods consisting of fewer than 100 activities. So, my "spreadsheets" are a max of 50 rows x 100 columns, and are almost always only written by a single person, with the odd entry of an attendance code from an office worker (worst case, the whole class is absent on a field trip for instance, at which point the office worker may enter 50 absence codes in a "spreadsheet" at the same time that the teacher is entering grades, since they have no students to teach).

So, if it's not painfully obvious yet, I'm thinking about the implications of storing the entire spreadsheet for a single section of a course in a single datastore entity. The Course entity will be an Expando with unindexed ListProperties of Student references, Activity references, additional information from the Student and Activity entities for quick-reference (i.e. a List of student names of the same size as the list of Student references, so that I don't have to pull all the student entities just to get their name), and then expando ListProperties containing a list of absence date/codes for each student, and a list of grades and comments for each activity (ie. AbsenceDate0 is a list of dates which the student in position 0 of the Student list was absent, AbsenceCode0 is a list of codes for student 0 which correspond to each date, Score0 is a list of scores for student 0 corresponding to entries in the activity list, and so on). With the incredibly large, busy class of 50 students doing 100 activities in 9 weeks, there's the potential for 50 students x 100 activity grades x 8 byte float = 40KB, and if everyone was absent every day, there would be 50 students x 45 absences x (8 byte date + 10 byte code) = 40KB. That leaves 920 KB for additional data about the students and activities, as well as comments... and I could always put a reference to a Comment entity in the list instead of a string if the comments get too long.

By far the most common read case will be to read an entire course with all its grade and attendance data, package that up into a JSON object and send it off to the client for any further processing to be done in the browser. Teachers will generally grab the 6-8 courses they teach all at once, then look at it rendered 10 different ways (but require no further interaction with the server), then send short update messages back to the server to modify attendance or grade data. Administrators will concurrently query for students that are not marked as present, physically go find them, then update their absence code. Perhaps I need to differentiate between "Today's absence data" stored with the student, and "historical absence data" stored with the course. Then I could write queries on absence codes to support the administrators, and do a nightly process to transfer that data to the course records where they will show up as color codes behind the grades the next day.

So the one type of report that this doesn't optimize for is the parent checking on their student's performance. It would have to load the Student entity to get the list of Courses they are taking, then load all the data for all the students in all those courses, then only return the key/value pairs for their student. However, all of this data will be loaded in a single AJAX call, and will be cached primarily in javascript variables on the client, secondarily with cache-headers to avoid stressing the servers with overzealous parents refreshing the page every 10 seconds, and finally in memcache on the server side, before ever having to go back to the datastore for that data. Couple that with the fact that even though parents outnumber teachers 50 to 1, the teachers use the site all day and the parents generally only check once a day or less, and optimizing for the teacher is the important part.

Write caching strategy:

With that said, one of the common write activities is entering a set of grades for an entire activity. Teachers prefer to enter their grades and have them saved as they type, without having to hit a save button or recalculate scores or anything along those lines. I can emulate a lot of that on the client side, but I'm not going to put an AJAX call in an unload hook or hope that a timeout every minute to save the data doesn't bite me (what if they enter a whole assignment and immediately close the window... I can't lose anything). So the normal write request is "change the score for student 12 on activity 27 from None to 97". Most of the time, I'll get 50 of these in a row in rapid succession, one for each student on a particular activity. If that turns into 50 datastore writes in the course of a few minutes, I've just defeated the whole purpose of storing this data together. However, if I just use memcache to store the data and write it to the datastore periodically, then there is always the possibility that memcache is evicted before I have time to persist the write. So it looks like I need to update memcache, but also put the message in a Pull queue. Periodically in a cron task, pull all the messages from the pull queue, and if any of them refer to items that are still in memcache, just put that object in the datastore directly and delete those tasks. If it's not in the memcache, get the object from the datastore, re-apply all the messages, and put it back in both the datastore and the memcache. I'd also have to follow this procedure for cache misses in general. So there is the opportunity for memcache contention to cause some issues (especially near the end of the 9 weeks when the objects are the largest and the number of teachers concurrently entering grades is the highest), but at least the pull-queue methodology helps to batch some writes. And I wouldn't have to do the whole get -> process pull queue -> put cycle on a cache miss during a grade change (the data is still cached in the client), only if another person happens to request the course data, so it will probably still be efficient. For that matter, maybe the "eventual consistency" model is all I need - since I'm storing all my updates as granularly as possible, if two people were working on a memcached version of the "spreadsheet" at the same time, I get the same results from a get -> process pull queue -> put cycle as I do from editing the memcached version as long as I process the pull queue in the same order that the requests came in. But then again, if the teacher made some changes, then refreshed the page, the changes need to be there... so the only question is whether there are more benefits to memcaching, or if cache misses are so rare they are just a waste of memory compared to the single datastore get and fetch from the pull queue (with 0 or 1 second lease time... not actually writing the data back until the cron time, just getting the updates)... I guess the only way to make that decision is to write the app and test it. I'm betting that memcache takes memory within my instance that otherwise could go to concurrent requests, so I may end up saving myself a few pennies of datastore ops only to cost myself several dollars of extra instances.

I've not written any large apps for GAE, so some or all of this might be completely wrong... Veterans, let me know if I'm on the right track... assuming your eyes aren't bleeding from the wall of text I just threw at you...

Thanks,

David

Brandon Wirtz

unread,

Oct 11, 2011, 3:21:53 PM10/11/11

to google-a...@googlegroups.com

David,

You may be better working with the Google Doc API’s since you are just doing spread sheets.

Likely what you are describing would be free for a lot of users if written correctly since you should have low volume (teaches visting maybe once a day not all day every day.

You should check with your state laws, Indiana for example doesn’t allow any student record keeping to happen outside the state, since you can’t guarantee that Google’s servers are in Indiana and only there, teachers in that state couldn’t use your product.

Back to the Technical bits..

Yes you could store data in tables, I’d think you’d want more flexibility and you’d store everyone in one place and “tag” them for doing sorts later.

The Charts API’s should let you make pretty pictures, but again you are likely to run in to issues of security.

Personally I would do this in something like PHP on a server you own, and use PHP with MySQL AppEngine doesn’t really have all the internal functions for doing this, so you will have to rely on a lot of API’s and that will make it hard to build a secure app with all the charts and graphs you want.

I’m all for the cloud, and I don’t think students grades are super confidential, but I don’t make the laws, and often legality trumps technology.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/ONH9M0pI8NEJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

dennis

unread,

Oct 11, 2011, 5:30:31 PM10/11/11

to Google App Engine

The legality issues that Brandon brings up might be a problem but I
don't think you're going to have a billing problem until you have a
lot of schools using the application.

The current free quota for writes is 50,000/day that's a lot of grades
to enter in a single 24 hour period. Obviously this won't be your only
write to the Datastore but should be most of them.

I would also recommend doing your puts in batches from JavaScript and
implementing a hook for window close that checks if everything has
been written to the Datastore. If it hasn't you can show a friendly
dialog saying "We're still saving your data.... Please wait a couple
seconds to close the window." or you can even try to send the ajax
request in your hook for any unsaved data and only return from the
hook once you have received a confirmation that your ajax post to the
server has succeeded.

After that is all done and you set your Idle Instances to 1 you should
be running free without a problem.

In theory the following should work but I haven't tested it:

<script type="text/javascript">
var pending_writes = [];
window.onbeforeunload = function() {
if(pending_writes.length > 0) {
// do whatever you have to do before returning from this hook.
// my understanding is that if you return a string here it will
// show that string in a dialog to the user giving them an
option
// to either stay on the page or navigate away from it.
}
}
</script>

I'm sure there's browser compatibility issues with the JavaScript I
put in here and it will take some hackery to make sure it works
everywhere, but I hope this helps you out.

David Whittaker

unread,

Oct 15, 2011, 4:56:50 PM10/15/11

to google-a...@googlegroups.com

Busy week, thanks for the replies.

RE: legality - ugh. It's amazing how legislation gets in the way of good technology, especially when the technology the legislation prevents is 1000 times more secure than the paper gradebooks it forces teachers to maintain and carry around with them because of the restrictions. After a few hours of perusing Alabama state code, I can pretty confidently state that we don't have any provisions against storing data outside of the state. We even have one provision making it *easier* to transfer records between states targeted at military kids. On the lighter side, I found a ton of legislation protecting the records of student-athletes from their recruiters, and several provisions ensuring that our football coaches are well paid and cared for in retirement. I guess you see what our priorities are here in the South. ROLL TIDE! :rolleyes: Either way, I'm OK with not being able to penetrate every state in the US, as long as I can cover Alabama. The app is for my wife first, and if it catches on and becomes a business of its own, that's great. The cloud does give me the key benefit of being able to scale from one teacher, to one school, to the whole state, to every state that doesn't specifically prohibit out-of-state storage, with minimal additional effort on my part. At least as long as I design it correctly to start with. If my wife and her friends end up being the only ones that use it, then the LAMP server on my own is overkill, and if it spreads to multiple states, then I've got to hire people to maintain them (or rely on whatever IT group the districts manage to cobble together to keep my servers running). Also, the biggest complaint teachers have with locally served grade management systems is they work great until the end of the 9 weeks when everybody is putting in grades at the same time, then they crash or are incredibly slow at the time when there is the biggest amount of pressure to get it all in. The cloud solves this problem handily.

There are national confidentiality and accessibility laws regarding student records, primarily FERPA, but they basically say that a particular students records are only visible to their parents, teachers, and school officials, and that parents have the rights to access their students' records at any time. Standard security stuff. My day job is application development in a bank - so this sort of security is pretty simple to me.

RE: spreadsheets - That was more of a metaphor for the user interface than a description of everything I'm storing. Each grade also has comments attached to it, can be dropped, exempted, have its own unique total points separate from the rest of the students in the activity (for disability accommodations), and other possible extended attributes. Storing this in a google spreadsheet could work, but it would be confusing to the teacher to actually work with the underlying spreadsheet. Data is data, but I am making the assumption that GAE is at least a little more efficient communicating with the data store than with an external service, and the generic nature of the data store allows me to continue to expand the capabilities without trying to squeeze them into a spreadsheet model. My super-spreadsheets will take the concept of the note tag in the corner and add several other graphical indicators of additional meta-data attached to the all-important number in the middle of the cell.

RE: charts - I've already decided on ExtJS 4's charting library. It's all HTML5 based (with canvas support for IE6), and transforms data into pretty pictures on the client side... so all I have to do on the App engine side of things is send and receive data, and update the underlying models.

RE: beforeunload - I specifically want to avoid beforeunload hooks. Different browsers deal with an asynchronous request from this hook in different ways, so there's no guarantee that the request would actually execute, much less complete. I also am opposed to hijacking the browser when a frustrated teacher is finally finished for the day and saying "please wait while we save your data... please?" And that doesn't even take into account browser crashes on aging hardware in the classrooms. My first mantra is "don't be annoying"... so I think the extra overhead of sending each update individually is worth it for increased usability. That grade update handler is going to be a one-liner to just take the request and put it on a task queue... it can't take long to do that, can it?

RE: 50,000 read/write/small a day: So let's say an average teacher has two activities a day (it's not uncommon in math classes especially to have daily homework and classwork grades). The average class has 25 students. So that's 50 writes a day if I store all the grades in individual grade objects. But my understanding is that there is a key index and then the object itself that has to be written, so that's 100 writes, assuming all properties are unindexed. Finally, the teacher usually teaches 6 periods a day, so that turns into 600 write ops a day for one teacher entering 2 assignments a day. The common case is more like one assignment every other day, which averages out to more like 150 write ops a day, but that's just one teacher. So far, great - I can support about 350 teachers for free.

The real kicker comes on the read though... every time the teacher opens their gradebook, it has to read all the grades for every assignment and every student... which towards the end of the 9 weeks could be 100 assignments * 25 students = 2500 grade objects to read... Even in the more common case of about 20 assignments per 9-weeks, you still are reading 20*25 = 500 grade objects per class. Multiply that by 6 classes, and the server has to go find 3000 individual objects just to show all the grades for all classes. Depending on how often they close their browser and/or log out (teachers are supposed to every time they leave their desk to prevent kids from sneaking around and changing something), they could load those gradebooks several times throughout the day. I can use lazy loading (don't download the data until they click the tab), but that's annoying (see previous paragraph). Even then, they always will click at least one tab, and then there's whichever tab is default that they weren't looking at, so that's 1000 reads every time they open their browser. At a bare minimum, the teacher will look at their gradebook 6 times a day to enter attendance information, and they should log out when they are not at their desk, they'll probably look at it at least twice again, once during their prep period to enter some grades, and again after school. That's 8000 reads a day. So now I can only take 6 teachers on the free quota, taking reads into account... OUCH! On the other hand, if all grade information from the course was in one object, then loading all the grades for all courses would require 7 reads. 8 times a day brings that up to 56 reads a day. So we're talking 100-1000 times more teachers on the free quota (or at any given price point) based on datastore usage. And I avoid the problem where the last half of the 9 weeks costs me more than the first half because there are more objects to read.

In other words, moving from one-object-per-grade to one-object-per-course reduces my datastore cost by 2 to 3 orders of magnitude. The question then becomes whether that reduction comes at the cost of increased latency, which translates to increased instances. One thing I know is that reading more than 1000 grades at once is going to cost more calls to the data store, increasing latency. And packaging the data up into JSON format will be roughly the same whether it's coming from 3000 objects, or six objects with 25 properties containing lists of 20 elements. In fact, I may again save some processing time if the native JSON serializer can directly encode datastore objects. Hmmm... that actually gives me an idea... since reading is more common than writing, maybe I should just store the data directly as JSON strings rather than as lists. Then I have practically zero cost in the common case of reading from the datastore, convert to JSON, send to client. With the new native JSON serializer in 2.7, converting back and forth when I have to update the data won't be that big of a deal either. Not to mention, JSON is actually more compact for my numbers which are usually 2 or 3 digit integers (plus the comma) than the datastore's format which always stores numbers as 8-byte doubles as far as I know. About the largest (string-wise) number you'll see is something like 135.25 on a big test with partial credit. And that's still a byte smaller than native format, even counting the comma. Finally, I can store complex objects within the array, where ListProperty only allows me to store simple types. This will work nicely for the list of assignments, which has multiple pieces of info for each assignment and was previously going to have to be represented as multiple parallel lists, and for grades that have comments attached to them (like: [95,90,75,{"grade":50,"comment":"turned in 5 days late"},95,75,...]). I really think this is the solution I'm looking for.

Reply all

Reply to author

Forward

0 new messages