Max Gibiansky

unread,

Aug 13, 2013, 4:39:21 AM8/13/13

to council...@googlegroups.com

Just checking up on the progress. How are things going with the scraper? Is there anything I can help with?

Mike McCallister

unread,

Aug 14, 2013, 2:30:08 AM8/14/13

to council...@googlegroups.com

Hey Max,

At this point, I've been doing work in the "vagrant" branch in my repository ( https://github.com/mikemccllstr/dominionstats/tree/vagrant). I've been trying to simplify the environment setup and application build process so that it's easier to test and deploy. After the commits I just made, I think I've now got it to the point where the server setup and build process are scripted and repeatable.

Now that I can do testing in a local/private environment without risking disruption to the production site or database, I expect to be able to make progress on updating the analysis jobs to handle the Goko-scraped games.

Mike

On 08/13/2013 03:39 AM, Max Gibiansky wrote:

Just checking up on the progress. How are things going with the scraper? Is there anything I can help with?

--
You received this message because you are subscribed to the Google Groups "Councilroom.com development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to councilroom-d...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Max Gibiansky

unread,

Aug 21, 2013, 5:25:26 PM8/21/13

to Mike McCallister, Rob Renaud, council...@googlegroups.com

That does sound like a better way of organizing cr data, and also a much larger piece of work.

I mean, mainly I'm worried about the time to do all of that relative to the simpler path of updating the current scanner system. I'll definitely try to help out if that's the path you decide to take though.

On Wed, Aug 21, 2013 at 12:26 AM, Mike McCallister <mi...@mccllstr.com> wrote:

Hey Max,

Here's the "big task" on my mind... Getting the scanner working is probably pretty doable and maybe even trivial if we come up with a new data field that is monotonically increasing, such as a "game start time" or "game end time".

But a big flaw with the current approach to analyzing the data is that there are several data elements that are calculated as the "sum of all data up to a given point in time". This means that the result stored in the DB depends on all ~20M games that have been played to date. This makes that result pretty precious, because it is so difficult and costly to recalculate. And it is hard to go back in time and fix missing data (e.g., fix a parser bug) or to produce summaries over different periods of time (e.g., what changed after Hinterlands was released).

What I would really like to see done differently is to summarize the analysis data at several different levels of granularity. Everything should probably first be calculated and stored for a given date. Then we can aggregate those date-level figures into weekly figures. And those weeks can be assembled into months, and so on, and so on.

In this new model, if we fix a date parse error for games played three weeks back, we only have to reparse those dates, recalculate the statistics for those date-buckets, and then reaggregate. We will never have to go back to 2010-10-15 and start over from scratch.

This is a much bigger piece of work, but this seems like the right time to do it. Any thoughts?

Mike

On 08/20/2013 10:14 PM, Max Gibiansky wrote:

I'll try running indexes.py. I'm pretty sure I have not done that. My laptop overheating issues are independent of councilroom, but maybe that'll help with handling larger datasets.

I'll try out the vagrant branch and see whether I can get it working, see whether I can take a look at how much code needs to be written to get the scanner working, see whether I can try it out on something with some iso and some goko data...

On Sun, Aug 18, 2013 at 9:13 AM, Mike McCallister <mi...@mccllstr.com> wrote:

I just pushed a new commit on the vagrant branch that makes a lot of progress towards getting it up and running fully within the Vagrant environment. There's still some more tidying up and documenting to do, but it's getting pretty close. Any feedback on the docs in the README.rst would be appreciated.

In terms of your laptop issues... one possibility that occurs to me is you might be missing some indexes on the MongoDB collections. Have you run the "indexes.py" script? If you do it now and it returns in seconds, then that wasn't the issue. If it takes a long time, then you were missing indexes. Since I've been doing most dev on the prod DB to date, and it already has the indexes, I wasn't hit by this problem until I started working in a new env.

Once the vagrant branch is working for local deployments, it should only take a little more work to get it working on Amazon. This would let you spin up an instance when you want to hack and then shut it down when you're done, so very little cost on an ongoing basis.

Mike

On 08/15/2013 08:53 PM, Max Gibiansky wrote:

Well, I guess the idea would be so that I can keep working on the stuff that needs to be rewritten for cr to get back online. The analysis code, so basically the scanner and everything that depends on it, and so on. Help out however I can.

My 'local resources' are sucky. I've got a laptop with overheating problems. It's workable as long as I don't try working on the full-size database and limit myself to a week or two worth of data in the db. It's good enough for some hacking though not a full-size test. I can try grabbing the vagrant branch and trying it out.

I don't think it's worth it to spend money to give me a proper dev environment right now, I really can't promise I'd spend the time on this that would make that worthwhile. I'm kind of swamped right now with work + job search. I'd definitely try to make time 'cause I'd like to get this working but I can't really make a commitment so I don't really want to impose a cost on you guys for it.

On Thu, Aug 15, 2013 at 5:27 PM, Mike McCallister <mi...@mccllstr.com> wrote:

There's several different ways to skin this cat. The hosting charges for July were about $90. Of that, about $45 is for keeping the server hosting the site running 24x7, about $30 is for disk space, and there are a handful of other dollars here and there for disk and network IO and related services. A development environment wouldn't need to keep the lights on 24x7, so it could be much cheaper to just turn it on when you want to hack and then turn it off later. You could also cut down on the disk space somewhat... there is about 20% unused, and I've got some storage hanging around for ad-hoc stuff. You also wouldn't have the network and disk IO that the main site does. So a near-perfect copy, used a few hours a day for development, would probably run around $35-$40/mo.

Another alternative, depending on what kind of development you want to do, is to simply hack directly on the production system itself. If you're just reading from existing collections or writing to new ones, it's easy enough to set up a new user, set up a new virtualenv with some separate ports, and just get to work. There's no isolation from the production DB, so we wouldn't want to use this approach to overhaul the on-disk format, but a lot of work could basically be done for free.

The thing I don't like about either of these is that the production server remains a "unique snowflake". It was somewhat hard to set up, I didn't keep good notes, and that means it is somewhat fragile and relatively difficult to recreate. I've been doing work on the Vagrant branch mentioned below to automate the provisioning of the environment and deployment of the app so it is as simple as 1) checkout the source, 2) run the scripts, 3) load as much data as you need, and 4) hack away in your own private sandbox. Take a look at https://github.com/mikemccllstr/dominionstats/blob/vagrant/README.rst#setup-for-local-development, it has been mostly updated.

These same scripts will work the same to deploy the app on a private sandbox in Vagrant or on a new production or testing site in AWS. This will reduce the friction for developers dramatically. This is pretty close to done and working if you are doing development from Linux or Mac. It probably won't work right out of the box on a Windows PC, but we can figure that out if necessary.

There are certainly other possibilities, too, so let me know your thoughts. What is it you want to use this environment to accomplish? Is the need time-limited or ongoing? What resources does Max have locally?

Mike

On 08/15/2013 06:30 PM, Rob Renaud wrote:

How hard/costly would it be to get max his own copy of the in prod site for development purposes? eg, something running on AWS, with an identical setup to what hosts councilroom now, so that if the code runs well on his dev copy, it will likely run well for the prod version? I am willing to finance this up to a few hundred dollars if it's primarily a cost issue.

Michael McCallister

unread,

Aug 24, 2013, 12:42:54 PM8/24/13

to Max Gibiansky, council...@googlegroups.com, Rob Renaud

I think we can, and probably should, do both.

The simplest thing that will get the old scanner system working again is to come up with a new timestamp field that can be used for the games. It should probably be taken from the log filename. Let's store this field in the parsed games and put an index on it. Then, adjust the scanner logic to use this new field instead of the game id.

We will probably need to reload and reparse the existing Iso games, as I think Mongo handles document growth in a suboptimal way for this situation.

The new approach will also benefit from this new timestamp field, so this is a step forward towards both goals.

Thoughts?

Max Gibiansky

unread,

Aug 29, 2013, 5:05:43 PM8/29/13

to Michael McCallister, council...@googlegroups.com, Rob Renaud

I agree with that; that approach seems like it would be best.

I've been really busy, I haven't had a chance to get the vagrant branch set up and try working on this. I'll try to make progress as soon as I have some time.

Reply all

Reply to author

Forward