What is the design behind the Ueber

99 views
Skip to first unread message

Tim Colson

unread,
May 23, 2012, 8:06:40 PM5/23/12
to etherpad-open-...@googlegroups.com
I just connected EtherpadLite to MySQL (https://github.com/Pita/etherpad-lite/issues/153 really helped me solve the puzzle of connection problems!) and had a look at the data schema: one "store" table with "key" and "value" fields. That's it. 

Pulling data out seems to be difficult, and I imagine quite slow if there were 10,000 pads with thousands of revisions each, would you agree?

I see on the notes that one of the created libraries "ueberDB" is a "abstraction layer for databases. It turns every database into a simple key value store" -- my question is why? 
Why key/value NoSQL embedded in MySQL? 
Why not a NoSQL DB like MongoDB which appears to have native NodeJS connection?
Why a key/value approach at all? 

Is there a doc or old thread that talks about the motivation for the current design to help me better understand?

Thanks!
Tim

Peter Martischka

unread,
May 23, 2012, 8:33:45 PM5/23/12
to etherpad-open-...@googlegroups.com
see my talk about this http://youtu.be/V10XcUHfXW4?t=22m37s

John McLear

unread,
May 23, 2012, 8:37:38 PM5/23/12
to etherpad-open-...@googlegroups.com
Yep go with Peter's talk, should really clear that up for you :) 

Tim Colson

unread,
May 25, 2012, 8:56:25 PM5/25/12
to etherpad-open-...@googlegroups.com
Lol... I just got a chance to listen to the presentation. The two minutes on DB architecture was informative, funny, and a tiny bit scary. :-)

Would this be a fair summary:

The goal for data storage was least common denominator (i.e. key+value), commonly available, easy to setup (i.e. MySQL), and "slightly crazy"; however, with "limited responsibility comes limited power" (i.e. reporting challenges and scalability concerns).

So imagine I had say 30K active users, creating ~3 pads per day, with about 1000 changesets each... that would be ~ 100 MILLION rows of data...daily. 

Hmmm... suddenly I'm more than a little scared about the idea of a "like" query on the "key" field to get a list of pads.

Some ideas come to mind...

1) triggers in the DB which extract the data as it's inserted to create some tables.
(Not so good, because still have millions of rows of data in the store table)

2)  Re-think the key/value lowest common denominator thing... then fork and code :-)
MongoDB directly? Or maybe H2 DB with tables? Or MySQL with tables?

I'm just thinking out loud... can't hurt to talk it out before forking. :-)

Cheers,
Tim


On Wednesday, May 23, 2012 5:37:38 PM UTC-7, John McLear wrote:
Yep go with Peter's talk, should really clear that up for you :) 

Tim Colson

unread,
May 29, 2012, 6:46:22 PM5/29/12
to etherpad-open-...@googlegroups.com
I just created a note pad for a meeting. Four people joined. I took most of the notes, but two other authors contributed for a grand total of 1,877 revisions. 

Ouch. Doing the math... gets to 180,000,000 million rows.

What if there was only one entry per pad for all of the revisions line by line? 

That would bring the size down to a fixed 30,000 * N pads/day -- 90,000 per day assuming N=3.

I read in the Ueber DB doc that there is some cache going on anyway, so any reason to not cache the entire list of changesets for "active" pads? If a pad isn't updated in X minutes, clear the in memory objects to save space. 

-Tim

John McLear

unread,
May 29, 2012, 9:12:49 PM5/29/12
to etherpad-open-...@googlegroups.com
What are you actually trying to fix here?  Are you under the impression that larger records are better than lots of rows?

Tim Colson

unread,
May 29, 2012, 10:05:35 PM5/29/12
to etherpad-open-...@googlegroups.com


On Tuesday, May 29, 2012 6:12:49 PM UTC-7, John McLear wrote:
What are you actually trying to fix here?   
Are you under the impression that larger records are better than lots of rows?

Problem: need to list all active pads, and pads associated with a specific authors userid. 

From what I've seen, there are two proposed solutions:
1) read *all* rows and then grep out the keys that match 
2) do a like query that does a rowscan

That may work with a few million rows, but in production I'm anticipating 30,000-90,000 new pads each day. After a few tests of taking meeting notes, I see 1000-2000+ changesets, so the rowcount would quickly reach the hundreds of millions in a few days, billions in a couple weeks. 

Getting a list of pads using the methods above would be extremely slow with billions of rows, no?

-T  

John McLear

unread,
May 29, 2012, 10:10:07 PM5/29/12
to etherpad-open-...@googlegroups.com
I see your predicament.

Surely if they are active pads they are held in memory already so no need to query database?

Not sure about pads associated with a specific authorID.  I think that's a slightly heavier job and might require a new table at scale.

Eric Mill

unread,
May 30, 2012, 12:50:21 AM5/30/12
to etherpad-open-...@googlegroups.com

Instead of refactoring Etherpad's main storage model, how about duplicating the pad id and associated author ids in a second table? Nothing ever gets deleted, only added, right? So keeping them "in sync" should just be a matter of adding in writes in the right places.

-- Eric

Tim Colson

unread,
May 30, 2012, 2:39:49 PM5/30/12
to etherpad-open-...@googlegroups.com
@Eric --yes, earlier in the thread I mentioned how a DB trigger could populate additional tables w/o any change to core code. The main table could still grow to billions of rows, so any operation on it would probably be slow, wouldn't you agree? For example, delete all pads and changesets for pads created > N days ago.  

In the preso, sounded like the key-value store in MySQL was an experiment, intended to make it easy to swap out DBs, but the scale needs may outweigh that. 

-T

Tim Colson

unread,
May 30, 2012, 3:21:25 PM5/30/12
to etherpad-open-...@googlegroups.com
John wrote:
> Surely if they are active pads they are held in memory already so no need to query database?

Hi John, that's a good point. I'm new to the codebase, so I don't know much about how it works... can you help?

What specifically is kept in memory? (i.e. pads? authors? Changesets?)
How long is the info kept in memory? What would happen if there were 90,000 pads -- would they all be in memory?
What happens when a Pad has not been updated in 30 days -- is there still a session for it in memory?
How does the server track the version numbers for a pad and generate a new version number for a changeset?
What happens to the active pads, cached stuff, and version numbers after the server restarts?

I threw out an idea about storing all pad changesets in one big row or a big document (i.e. thinking mongodb), and you asked if I had the perception that would be better than millions/billions of rows. Not a DB expert here -- just asked because anytime a DB has to do a rowscan over tens of millions of rows, things have gotten sloooow. ;-)

I've been looking how many rows various DBs can handle, and most of the benchmark articles dive into how to partition data to improve performance. Partitioning is essentially making smaller tables out of a huge table, so the BTREE indexes are smaller, but with the composite keys, how would I partition the data?

Purpose built document stores like MongoDB seem to have extra magics to make it easier/faster/possible to work with composite keys and values, but since MySQL does not have that, it feels like EtherpadLite may be forcing a square peg in a round hole. Said another way, if you know MySQL or some other SQL DB will be used, why *not* create more than one table? Or if pads are stored as "documents", then why not require something like MongoDB?

FYI - I'm not trying to be all smarty-pants and ask rhetorical questions. I am genuinely interested in better understanding the thoughts behind the storage design, so thanks for the feedback!

Cheers!
-Tim

Eric Mill

unread,
May 30, 2012, 6:35:52 PM5/30/12
to etherpad-open-...@googlegroups.com
I wasn't suggesting DB triggers, that would, as you describe, not escape the core issue. I'm suggesting actual code changes in Etherpad where, any time it writes a new pad to the table, it does another separate write to a new pads table that can be queried more quickly later on.

-- Eric

Tim Colson

unread,
May 30, 2012, 7:32:17 PM5/30/12
to etherpad-open-...@googlegroups.com
How would core changes be different from a trigger for creating a separate table?

Eric Mill

unread,
May 30, 2012, 11:20:07 PM5/30/12
to etherpad-open-...@googlegroups.com
Because it would require no read or write operations on the main key/value table. This would be a separately maintained table of pad IDs and associated authors. Whenever the core code adds pad IDs to the main key/value table, also add a row to the pad table. When you want to list the pads, you only query the pad table, and you leave the hojillion-row key/value table alone.

I'll freely admit I've never looked at the Etherpad source code! I've been heretofore a lurker (though a user and enthusiastic evangelist for Etherpad). So maybe what I'm suggesting is missing something, but if your primary concern is not running unoptimized operations (like LIKE queries) on a large table, *and* it's easy to keep two tables in sync because data is only added and never removed, this might be a solution.

Tim Colson

unread,
May 30, 2012, 11:55:04 PM5/30/12
to etherpad-open-...@googlegroups.com
Thanks, Eric. I think a trigger on insert would be about the same - no read from the main table necessary (http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html) since the data is already available to the trigger. That said, the billions of rows would still exist, so if any other actions need to do a rowscan (ex. looking up all author IDs) would still be a concern. 

Speaking of "only added, never removed" -- I also must have administration features to remove pads based on specific business rules of create date & last update. (And those seem difficult to pull from the current key/value store too.)

-Tim

John McLear

unread,
May 31, 2012, 7:01:17 PM5/31/12
to etherpad-open-...@googlegroups.com
TIL about triggers.  Thanks but be careful, Etherpad Lite supports other databases too so I'd recommend a nodejs level/powered solution.

Peter Martischka

unread,
Jun 1, 2012, 6:07:47 AM6/1/12
to etherpad-open-...@googlegroups.com
Maybe we should really use something that can store json by nature.
Like mongo or couch. Even though we might mess up with MySQL people

Tim Colson

unread,
Jun 1, 2012, 2:02:03 PM6/1/12
to etherpad-open-...@googlegroups.com
> TIL about triggers. Thanks but be careful, Etherpad Lite supports other databases too so I'd recommend a nodejs level/powered solution.
TIL about the TIL acronym. :-)

Triggers would be DB layer only, so EPL wouldn't have a clue about the new tables which would just be a bonus feature for MySQL users.
That said - I think a trigger would only temporarily stem the pain because of my scalability concern assuming billions of rows in the store table, and because making changes in the DB layer only would be a Bad Idea(tm).

Writing code in the core brings up two paths...
Path 1 - Go down the road of Many Tables which leads to traditional ORM models and SQL
Path 2 - Go further down the road of document storage via a NoSQL server

I'm a traditional IT dude, so you'd think my choice would be #1 with MySQL, but my tiny developer voice is saying, "Go NoSQL...MongoDB would rock for this, and the mental model isn't all that different from the existing key/value code."

-Tim
P.S. I'm still genuinely interested in answers to the other questions I sent.

Reply all
Reply to author
Forward
0 new messages