Hi,
I am building a web app for GAE that gives teachers a way to enter student grades and then analyze those grades to easily visualize students who might be struggling earlier than when their period average starts to drop. I want to keep the app free as long as possible (at least during the pilot phase at my wife's school), and the price of membership as low as possible when it goes public by optimizing the data storage model up front to minimize writes in an app that is very data-centric.
So I basically have a bunch of student information, a course model representing a single section of a class, and then many activities per course. Each activity has some information about how the grade in the activity contributes to the various possible averages for the course, and of course, there is a grade for each student in the course and a possible comment on the grade. Finally, the student has a list of dates that they were absent from the course and a code which represents the type of absence.
The primary view of the app (probably 60% of the page loads) is a spreadsheet-like interface with students down the side, assignments across the top, and grades in the cells, with cells color-coded to represent general absence reasons (present, tardy, excused, unexcused). The secondary view (probably 20% of the page loads) is an attendance view which is another spreadsheet-like interface with students down the side, dates across the top, and color-coded cells with full absence codes. Another 10% or so of the app will be various charts which look at all the grades in a class and display them visually with various statistics. The remaining 10% or so will be various single-student queries retrieving all their demographic information, or all the grades for that single student.
So, most of us can write out the sql model of this data in our sleep... the core of which is a Score entity with a single grade and foreign keys pointing to the Student, Activity, and Course (if I'm generous - fully normalized would require that I look up the activity to get the related Course), and an Absence entity with foreign keys pointing to the Student and Course. But of course, we also know the data store is not SQL... so I'm trying to figure out just how far I can push the NoSQL model and denormalize my data like crazy.
The one thing I have going for me is that I can safely assume class sizes will never be larger than 50 (this is targeted at k-12 market), and that I can, if needed segment data into 9-week grading periods consisting of fewer than 100 activities. So, my "spreadsheets" are a max of 50 rows x 100 columns, and are almost always only written by a single person, with the odd entry of an attendance code from an office worker (worst case, the whole class is absent on a field trip for instance, at which point the office worker may enter 50 absence codes in a "spreadsheet" at the same time that the teacher is entering grades, since they have no students to teach).
So, if it's not painfully obvious yet, I'm thinking about the implications of storing the entire spreadsheet for a single section of a course in a single datastore entity. The Course entity will be an Expando with unindexed ListProperties of Student references, Activity references, additional information from the Student and Activity entities for quick-reference (i.e. a List of student names of the same size as the list of Student references, so that I don't have to pull all the student entities just to get their name), and then expando ListProperties containing a list of absence date/codes for each student, and a list of grades and comments for each activity (ie. AbsenceDate0 is a list of dates which the student in position 0 of the Student list was absent, AbsenceCode0 is a list of codes for student 0 which correspond to each date, Score0 is a list of scores for student 0 corresponding to entries in the activity list, and so on). With the incredibly large, busy class of 50 students doing 100 activities in 9 weeks, there's the potential for 50 students x 100 activity grades x 8 byte float = 40KB, and if everyone was absent every day, there would be 50 students x 45 absences x (8 byte date + 10 byte code) = 40KB. That leaves 920 KB for additional data about the students and activities, as well as comments... and I could always put a reference to a Comment entity in the list instead of a string if the comments get too long.
By far the most common read case will be to read an entire course with all its grade and attendance data, package that up into a JSON object and send it off to the client for any further processing to be done in the browser. Teachers will generally grab the 6-8 courses they teach all at once, then look at it rendered 10 different ways (but require no further interaction with the server), then send short update messages back to the server to modify attendance or grade data. Administrators will concurrently query for students that are not marked as present, physically go find them, then update their absence code. Perhaps I need to differentiate between "Today's absence data" stored with the student, and "historical absence data" stored with the course. Then I could write queries on absence codes to support the administrators, and do a nightly process to transfer that data to the course records where they will show up as color codes behind the grades the next day.
So the one type of report that this doesn't optimize for is the parent checking on their student's performance. It would have to load the Student entity to get the list of Courses they are taking, then load all the data for all the students in all those courses, then only return the key/value pairs for their student. However, all of this data will be loaded in a single AJAX call, and will be cached primarily in javascript variables on the client, secondarily with cache-headers to avoid stressing the servers with overzealous parents refreshing the page every 10 seconds, and finally in memcache on the server side, before ever having to go back to the datastore for that data. Couple that with the fact that even though parents outnumber teachers 50 to 1, the teachers use the site all day and the parents generally only check once a day or less, and optimizing for the teacher is the important part.
Write caching strategy:
With that said, one of the common write activities is entering a set of grades for an entire activity. Teachers prefer to enter their grades and have them saved as they type, without having to hit a save button or recalculate scores or anything along those lines. I can emulate a lot of that on the client side, but I'm not going to put an AJAX call in an unload hook or hope that a timeout every minute to save the data doesn't bite me (what if they enter a whole assignment and immediately close the window... I can't lose anything). So the normal write request is "change the score for student 12 on activity 27 from None to 97". Most of the time, I'll get 50 of these in a row in rapid succession, one for each student on a particular activity. If that turns into 50 datastore writes in the course of a few minutes, I've just defeated the whole purpose of storing this data together. However, if I just use memcache to store the data and write it to the datastore periodically, then there is always the possibility that memcache is evicted before I have time to persist the write. So it looks like I need to update memcache, but also put the message in a Pull queue. Periodically in a cron task, pull all the messages from the pull queue, and if any of them refer to items that are still in memcache, just put that object in the datastore directly and delete those tasks. If it's not in the memcache, get the object from the datastore, re-apply all the messages, and put it back in both the datastore and the memcache. I'd also have to follow this procedure for cache misses in general. So there is the opportunity for memcache contention to cause some issues (especially near the end of the 9 weeks when the objects are the largest and the number of teachers concurrently entering grades is the highest), but at least the pull-queue methodology helps to batch some writes. And I wouldn't have to do the whole get -> process pull queue -> put cycle on a cache miss during a grade change (the data is still cached in the client), only if another person happens to request the course data, so it will probably still be efficient. For that matter, maybe the "eventual consistency" model is all I need - since I'm storing all my updates as granularly as possible, if two people were working on a memcached version of the "spreadsheet" at the same time, I get the same results from a get -> process pull queue -> put cycle as I do from editing the memcached version as long as I process the pull queue in the same order that the requests came in. But then again, if the teacher made some changes, then refreshed the page, the changes need to be there... so the only question is whether there are more benefits to memcaching, or if cache misses are so rare they are just a waste of memory compared to the single datastore get and fetch from the pull queue (with 0 or 1 second lease time... not actually writing the data back until the cron time, just getting the updates)... I guess the only way to make that decision is to write the app and test it. I'm betting that memcache takes memory within my instance that otherwise could go to concurrent requests, so I may end up saving myself a few pennies of datastore ops only to cost myself several dollars of extra instances.
I've not written any large apps for GAE, so some or all of this might be completely wrong... Veterans, let me know if I'm on the right track... assuming your eyes aren't bleeding from the wall of text I just threw at you...
Thanks,
David