benchmarking script, blobs, etc

0 views
Skip to first unread message

Mark Hammond

unread,
Aug 3, 2010, 4:03:51 AM8/3/10
to raindr...@googlegroups.com
I've pushed a few changes and have 1 more ready to push. You will need
to delete your database...

There is now a benchmark-raindrop.py script which can be used to gather
some stats about raindrop's performance without needing to hit your imap
server. It can load messages from either the enron corpus or from a
'mailbox' file like used by Thunderbird - ie, it should be capable of
loading any of your local thunderbird folders assuming you know where
they are stored locally :) Execute it with no arguments to get a little
help about the various options.

I've attached a patch for a functioning blob-store. I haven't checked
it in yet in the off-chance it breaks some things or causes problems.
Barring objections, I intend checking it in tomorrow morning so I can
deal with any issues it causes while we are all online. Notes on this
implementation:

* It is somewhat pluggable in that mogile, mogilelocal and sqlalchemy
blob stores are all supported with the default being mogilelocal storing
in ~/home/raindrop.blobs. You probably need to run setup.py again to
pick up this dependency (or just easy_install mogilelocal). The
sqlalchemy implementation uses a different DB than the main DB and may
get thrown away later.

* I failed to setup mogilefs on my linux box for testing, so some work
will need to be done before a "real" mogile works. I will probably end
up asking Gozer to setup a mogile install for me to test with.

* It is 20-50% slower than our old strategy of storing the blobs
directly in the same DB. Some of this is hopefully just Windows being
slow to open lots of small files - I expect other operating-systems to
perform a little better. Note however that even the new sqlalchemy blob
store is slower than before as we are no longer relying on the blob
store having transactional semantics, so the sqlalchemy version commits
and updates blobs more frequently. I think we probably need to move
ahead with this regardless, as there seems universal agreement that
storing the blobs in the core SQL DB isn't good.

* It is implemented by sub-classing a sqlalchemy 'Session' object -
thus, the same Session object we use for all object work now has 3
blob-related methods. All the magic "scoped session" stuff works as
expected.

* Obviously this change now means the SQL database and the blob storage
must be treated by us as an "atomic" pair. Eg, where you previously
deleted 'raindrop.sqlite', you will now need to also remove the
directory tree 'raindrop.blobs'. Both the SQL DB and the blob storage
strategy and location are controlled by command-line params.

* Storm probably needs to grow a config option for the blob-store "url"
to use - although this probably isn't critical until storm starts
performing blob operations - until then, storm is just using the default
of "mogilelocal://dir=~/raindrop.blobs"

One more note for Shane:

* The guts of bootstrap.py has now been split into a helper function
poco.add_contactpoints_to_me() - bootstrap now just collects the
addresses and calls this function. I think I carried your recent
changes to bootstrap across (the tests all pass anyway ;)

That's all folks - let me know if there are any concerns, and I'll check
this in first thing tomorrow...

Cheers,

Mark

blob_store.patch

Philippe M. Chiasson

unread,
Aug 3, 2010, 11:37:28 AM8/3/10
to raindr...@googlegroups.com, Mark Hammond
On 10-08-03 04:03 , Mark Hammond wrote:
> I've pushed a few changes and have 1 more ready to push. You will need
> to delete your database...
>
> There is now a benchmark-raindrop.py script which can be used to gather
> some stats about raindrop's performance without needing to hit your imap
> server. It can load messages from either the enron corpus or from a
> 'mailbox' file like used by Thunderbird - ie, it should be capable of
> loading any of your local thunderbird folders assuming you know where
> they are stored locally :) Execute it with no arguments to get a little
> help about the various options.

I've run that, and it works, so excellent.

I'd like to start running this regularly, and tracking numbers. So for
now, the only question I have is if we can decide on some generic output
format (or output file, really) that I can easily parse for metrics we
should be tracking.

From :

Queue processing stats:
processed 3153 items in 198.0 seconds (62.791ms avg.)
extension statistics
raindrop.ext.msg.twitter_email_grouping - 654 calls in 0.9 seconds
(1.36086ms avg.)
raindrop.ext.conv.summary - 654 calls in 4.9 seconds (7.55352ms avg.)
raindrop.ext.attach.bitly - 1191 calls in 0.1 seconds (0.0755668ms avg.)
raindrop.ext.msg.email_body - 654 calls in 2.5 seconds (3.83792ms avg.)
raindrop.ext.msg.links - 654 calls in 4.8 seconds (7.263ms avg.)
raindrop.ext.msg.email_grouping - 654 calls in 124.1 seconds (189.786ms
avg.)
raindrop.ext.msg.email_envelope - 654 calls in 50.5 seconds (77.2018ms avg.)
raindrop.ext.msg.mailing_list - 654 calls in 2.2 seconds (3.33333ms avg.)
raindrop.ext.msg.email_conv - 654 calls in 4.3 seconds (6.6208ms avg.)
(renv)[gozer@huigui raindrop-reboot]$

To:

raindrop.items.processed.qty: 3153
raindrop.items.processed.time: 198s
raindrop.items.processed.avg: 62.791ms
raindrop.ext.conf.summary.calls: 654
raindrop.ext.conf.time: 4.9s
[...]

Or some similar output format.


> I've attached a patch for a functioning blob-store. I haven't checked
> it in yet in the off-chance it breaks some things or causes problems.
> Barring objections, I intend checking it in tomorrow morning so I can
> deal with any issues it causes while we are all online. Notes on this
> implementation:

Sounds great.

> * It is somewhat pluggable in that mogile, mogilelocal and sqlalchemy
> blob stores are all supported with the default being mogilelocal storing
> in ~/home/raindrop.blobs. You probably need to run setup.py again to
> pick up this dependency (or just easy_install mogilelocal). The
> sqlalchemy implementation uses a different DB than the main DB and may
> get thrown away later.

How do i configure that ?

> * I failed to setup mogilefs on my linux box for testing, so some work
> will need to be done before a "real" mogile works. I will probably end
> up asking Gozer to setup a mogile install for me to test with.

Sure thing, trying to figure out what would work best for you to test
against ? Some publicly accessible trackers you can point your mogilefs
client to good enough ? (I am building a VMWare image with a mogilefs
server-in-a-box, but not sure how long and how big it will be)

> * It is 20-50% slower than our old strategy of storing the blobs
> directly in the same DB. Some of this is hopefully just Windows being
> slow to open lots of small files - I expect other operating-systems to
> perform a little better. Note however that even the new sqlalchemy blob
> store is slower than before as we are no longer relying on the blob
> store having transactional semantics, so the sqlalchemy version commits
> and updates blobs more frequently. I think we probably need to move
> ahead with this regardless, as there seems universal agreement that
> storing the blobs in the core SQL DB isn't good.
>
> * It is implemented by sub-classing a sqlalchemy 'Session' object -
> thus, the same Session object we use for all object work now has 3
> blob-related methods. All the magic "scoped session" stuff works as
> expected.
>
> * Obviously this change now means the SQL database and the blob storage
> must be treated by us as an "atomic" pair. Eg, where you previously
> deleted 'raindrop.sqlite', you will now need to also remove the
> directory tree 'raindrop.blobs'. Both the SQL DB and the blob storage
> strategy and location are controlled by command-line params.

Wonder if we need an admin tool to delete a database, that not only
nukes the database, but iterates over blobs and deletes them too.
Withouht it, destroying an instance will leave blogs hanging around forever.

> * Storm probably needs to grow a config option for the blob-store "url"
> to use - although this probably isn't critical until storm starts
> performing blob operations - until then, storm is just using the default
> of "mogilelocal://dir=~/raindrop.blobs"

Indeed, and that configuration needs to live inside the database too, if
at all possible. Ideally in a way that is easy for someone like me to
change on the fly (via simple DB table access or admin tool)


signature.asc

Mark Hammond

unread,
Aug 3, 2010, 10:13:05 PM8/3/10
to raindr...@googlegroups.com
Saw the .torrent you created for mogile - thanks! Although it appears
you aren't seeding it :)

Cheers,

Mark

Mark Hammond

unread,
Aug 3, 2010, 10:15:08 PM8/3/10
to raindr...@googlegroups.com, Philippe M. Chiasson
Oops - sorry about that - the message was intended just for Gozer...

Mark Hammond

unread,
Aug 4, 2010, 1:47:28 AM8/4/10
to raindr...@googlegroups.com, Philippe M. Chiasson
On 4/08/2010 1:37 AM, Philippe M. Chiasson wrote:
> I'd like to start running this regularly, and tracking numbers. So for
> now, the only question I have is if we can decide on some generic output
> format (or output file, really) that I can easily parse for metrics we
> should be tracking.

Would json work for you? I'm thinking of later when we have much richer
statistics that json offers both flexibility and the ability to parse
without formalizing a one-size-fits-all structure for stats.

>> * It is somewhat pluggable in that mogile, mogilelocal and sqlalchemy
>> blob stores are all supported with the default being mogilelocal storing
>> in ~/home/raindrop.blobs. You probably need to run setup.py again to
>> pick up this dependency (or just easy_install mogilelocal). The
>> sqlalchemy implementation uses a different DB than the main DB and may
>> get thrown away later.
>
> How do i configure that ?

I'm not sure what you are asking here - I need to do a little more work
with a real mogile store before it will work and your VMware image
should help there. Once this is in place, you will configure the use of
a real mogile server by way of a command-line option
--blobs=mogilefs://domain=raindrop&trackers=http://server1/&trackers=http://server2/
etc.

> Wonder if we need an admin tool to delete a database, that not only
> nukes the database, but iterates over blobs and deletes them too.
> Withouht it, destroying an instance will leave blogs hanging around forever.

Yeah - we already have that largely in place (the test suite uses it) -
although all it does currently is drop all tables etc from the DB
without actually removing it - I expect that any such admin tool for
deleting an instance would want to delete the DB itself rather than just
the content? I'm not aware of a way to have sqlalchemy create or delete
the database itself (but I haven't really looked) - eg, at the moment
you need to use the mysql admin tool to create the DB itself before
raindrop can work with mysql.

>> * Storm probably needs to grow a config option for the blob-store "url"
>> to use - although this probably isn't critical until storm starts
>> performing blob operations - until then, storm is just using the default
>> of "mogilelocal://dir=~/raindrop.blobs"
>
> Indeed, and that configuration needs to live inside the database too, if
> at all possible. Ideally in a way that is easy for someone like me to
> change on the fly (via simple DB table access or admin tool)

I think this needs a little more thought. Specifically:

* We will already be relying on the middleware to set a
X-Raindrop-Database or similar header so the correct database is
operated on. At first glance it would seem reasonable that the blob
store configuration is passed using the same mechanism. It then becomes
the task of the middle-ware layer to have an admin option to change this
on the fly, just as it will need for the DB.

* On the other hand though, I think we decided a single mogile "domain"
would be used for all users - IOW, my attachments might be available via
http://mogileserver/markh while yours might be available via
http://mogileserver/gozer. Ignoring for the moment the security issues
involved with this, it means that the same mogile instance would be used
for all requests. Thus, the location of the mogile stores could still
be configured by way of a command-line option as it is independent of
the actual request. Changing the location of the store then means
restarting those processes, but that is probably better than needing to
perform a DB lookup on each request just to find the mogile server is
still the same as it was for every other request previously.

* If we do wind up with a mogile instance/domain per user, we are just
back at step 1 - the same tool which tells us the specific DB to use for
the request can also tell us the location of the blob store.

I hope I'm not misunderstanding something...

Thanks,

Mark

Philippe M. Chiasson

unread,
Aug 5, 2010, 11:19:01 AM8/5/10
to Mark Hammond, raindr...@googlegroups.com
On 10-08-04 01:47 , Mark Hammond wrote:
> On 4/08/2010 1:37 AM, Philippe M. Chiasson wrote:
>> I'd like to start running this regularly, and tracking numbers. So for
>> now, the only question I have is if we can decide on some generic output
>> format (or output file, really) that I can easily parse for metrics we
>> should be tracking.
>
> Would json work for you? I'm thinking of later when we have much richer
> statistics that json offers both flexibility and the ability to parse
> without formalizing a one-size-fits-all structure for stats.

Absolutely, I was just throwing up an example. The idea is that I'd love
to be able to just start tracking whatever metrics comes out of there
without having to tweak my parser for them each time. JSON should be
just fine.

>>> * It is somewhat pluggable in that mogile, mogilelocal and sqlalchemy
>>> blob stores are all supported with the default being mogilelocal storing
>>> in ~/home/raindrop.blobs. You probably need to run setup.py again to
>>> pick up this dependency (or just easy_install mogilelocal). The
>>> sqlalchemy implementation uses a different DB than the main DB and may
>>> get thrown away later.
>>
>> How do i configure that ?
>
> I'm not sure what you are asking here - I need to do a little more work
> with a real mogile store before it will work and your VMware image
> should help there. Once this is in place, you will configure the use of
> a real mogile server by way of a command-line option
> --blobs=mogilefs://domain=raindrop&trackers=http://server1/&trackers=http://server2/
> etc.

Hrm, the web interface/api stuff will also need to know where mogilefs
is for that account, wouldn't it make more sense to store that as part
of the account's configuration in the database ?

>> Wonder if we need an admin tool to delete a database, that not only
>> nukes the database, but iterates over blobs and deletes them too.
>> Withouht it, destroying an instance will leave blogs hanging around
>> forever.
>
> Yeah - we already have that largely in place (the test suite uses it) -
> although all it does currently is drop all tables etc from the DB
> without actually removing it - I expect that any such admin tool for
> deleting an instance would want to delete the DB itself rather than just
> the content? I'm not aware of a way to have sqlalchemy create or delete
> the database itself (but I haven't really looked) - eg, at the moment
> you need to use the mysql admin tool to create the DB itself before
> raindrop can work with mysql.

Yeah, and there is certainly a permission issue with dropping a
database, but dropping all tables in it should be good enough ?

>>> * Storm probably needs to grow a config option for the blob-store "url"
>>> to use - although this probably isn't critical until storm starts
>>> performing blob operations - until then, storm is just using the default
>>> of "mogilelocal://dir=~/raindrop.blobs"
>>
>> Indeed, and that configuration needs to live inside the database too, if
>> at all possible. Ideally in a way that is easy for someone like me to
>> change on the fly (via simple DB table access or admin tool)
>
> I think this needs a little more thought. Specifically:
>
> * We will already be relying on the middleware to set a
> X-Raindrop-Database or similar header so the correct database is
> operated on. At first glance it would seem reasonable that the blob
> store configuration is passed using the same mechanism. It then becomes
> the task of the middle-ware layer to have an admin option to change this
> on the fly, just as it will need for the DB.

Yeah, that's not a bad point. And after thinking about it for a bit
more, it's probably even easier to have a static, global configuration
for mogilefs trackers, and possibly look at a
X-Raindrop-Mogile-Trackers: header just in case.

> * On the other hand though, I think we decided a single mogile "domain"
> would be used for all users - IOW, my attachments might be available via
> http://mogileserver/markh while yours might be available via
> http://mogileserver/gozer. Ignoring for the moment the security issues
> involved with this, it means that the same mogile instance would be used
> for all requests. Thus, the location of the mogile stores could still
> be configured by way of a command-line option as it is independent of
> the actual request. Changing the location of the store then means
> restarting those processes, but that is probably better than needing to
> perform a DB lookup on each request just to find the mogile server is
> still the same as it was for every other request previously.

Yes, you are right. Single domain for all of hosted raindrop (or one
domain per-datacenter, but we are *so* not there yet...).

The security issue is not an issue, as it's not part of mogile, the
security has to be implemented on top of it, so no biggie there. Either
the web front-end, or my first layer will be handling that part.

> * If we do wind up with a mogile instance/domain per user, we are just
> back at step 1 - the same tool which tells us the specific DB to use for
> the request can also tell us the location of the blob store.

Yeah, I love the idea of a globally configured mogilefs tracker list,
and the possibility to alter that list with a http header.

Sounds excellent!

As for the mogilefs namespacing stuff, I suspect using the already
unique extension names as part of the key will be a good idea.

We still have to figure out how we will uniquely identify users. More
and more, it feels to me like we just need an integer unique id for
users. (or a UUID) type of thing. But something that isn't derived from
any of the account's propreties. Since these might change.

However, for some stuff, the id of the source might make more sense.

Raw e-mails from go...@gmail.com could be stored with 'go...@gmail.com'
as part of the key, (same for twitter) and the fact that raindrop
account 1234 has access to go...@gmail.com (via OAUTH dance) is merely a
temporarly association.

Hope that makes sense.

signature.asc

Mark Hammond

unread,
Aug 5, 2010, 7:12:21 PM8/5/10
to raindr...@googlegroups.com, Philippe M. Chiasson
On 6/08/2010 1:19 AM, Philippe M. Chiasson wrote:
...

> Hrm, the web interface/api stuff will also need to know where mogilefs
> is for that account, wouldn't it make more sense to store that as part
> of the account's configuration in the database ?

At the moment, raindrop doesn't have any database records for the
'raindrop account'. It does store the list of imap/twitter/etc accounts
associated with the raindrop instance, but nothing about the raindrop
instance itself. I guess we could simply call it a "preference" though...

> Yeah, and there is certainly a permission issue with dropping a
> database, but dropping all tables in it should be good enough ?

It is good enough for me, sure :) I was assuming you would not like
having many "stale" empty databases hanging around (ie, they would
always appear in your mysql admin tools), but if it is good enough for
you it certainly works for me.

> The security issue is not an issue, as it's not part of mogile, the
> security has to be implemented on top of it, so no biggie there. Either
> the web front-end, or my first layer will be handling that part.

I'm a little concerned about this. When the back-end itself is hitting
the blob store the request will not go via this middleware. In
practice, this means that all users emails will be one simple HTTP
request away from the back-end with no security at all to stand in the way.

However, I'm happy for you to own this part of the world, so now that
I've raised that concern I'll shut up about it ;)

> We still have to figure out how we will uniquely identify users. More
> and more, it feels to me like we just need an integer unique id for
> users. (or a UUID) type of thing. But something that isn't derived from
> any of the account's propreties. Since these might change.
>
> However, for some stuff, the id of the source might make more sense.
>
> Raw e-mails from go...@gmail.com could be stored with 'go...@gmail.com'
> as part of the key, (same for twitter) and the fact that raindrop
> account 1234 has access to go...@gmail.com (via OAUTH dance) is merely a
> temporarly association.

The ID of the source doesn't really make sense to the rest of the
raindrop architecture - by the time we need the blobs in the back-end
we've completely lost any concept of where the message came from. While
we could query to determine that, it doesn't seem to buy much.

Also complicating things is that the same message-id could appear in
multiple accounts - raindrop will (by design) only have 1 copy of that
message.

That said though, I'm happy for the middleware to provide any user ID it
chooses, so long as that user ID is stable. So, along with the location
of the DB, the middleware should provide the opaque user ID, and this
user ID will be prefixed to the blob store keys. I guess we could
design that header passing system to use the same user ID in both the DB
and the store - eg:

X-Raindrop-DB-Prefixes: mysql://somehost mysql://backuphost
X-Raindrop-Blob-Prefix: mogilefs://... mogilefs://...
X-Raindrop-UserID: markh

and we would piece together the DB name and full blob store key from
that info.

Sound reasonable?

Cheers,

Mark

Reply all
Reply to author
Forward
0 new messages