Goko Log Downloads

11 views
Skip to first unread message

Michael McCallister

unread,
Jul 1, 2013, 1:19:26 AM7/1/13
to council...@googlegroups.com
Ok, I'm incorporating Max's work into Councilroom.com, and I'm scratching my head a bit about downloading the Goko logs...

It looks like the code currently uses the multi_scrape.sh script to retrieve the logs. Goko seems to offer them from two sites: http://dominionlogs.goko.com/ and http://archive-dominionlogs.goko.com/. Several questions come to mind:
  • In terms of the specific directories, has anyone figured out the timezone they are defined in? In other words, the directory name "20130629" refers to a date, but is it in UTC or some other timezone?
  • Are the logs placed in a folder by the time they were started or finished?
  • Do partial logs ever appear in the directories? Asked another way, do the logs of in-progress games appear, or only completed games?
  • Roughly when does a directory stop getting updated with new games?
  • When does a directory move from the dominionlogs site to the archive-domionionlogs site? Does it ever appear in both places at the same time?
  • What do the components of the filenames mean? The first looks like a hash, maybe an ID for Player 1? The second looks like a timestamp, anyone know of what?

Mike

Max Gibiansky

unread,
Jul 2, 2013, 2:01:50 PM7/2/13
to council...@googlegroups.com
Oops, sent my reply only to the personal email last time, forwarding to councilroom-dev.


Responses inline.


On Sun, Jun 30, 2013 at 10:19 PM, Michael McCallister <mi...@mccllstr.com> wrote:
Ok, I'm incorporating Max's work into Councilroom.com, and I'm scratching my head a bit about downloading the Goko logs...

It looks like the code currently uses the multi_scrape.sh script to retrieve the logs. Goko seems to offer them from two sites: http://dominionlogs.goko.com/ and http://archive-dominionlogs.goko.com/. Several questions come to mind:

That is correct. Archive-dominionlogs has old ones, dominionlogs has new ones.
 
  • In terms of the specific directories, has anyone figured out the timezone they are defined in? In other words, the directory name "20130629" refers to a date, but is it in UTC or some other timezone?

Pacific time. I just watched the 7/1 directory appear at midnight pacific time. It's where Goko is, U.S. West coast. 
  • Are the logs placed in a folder by the time they were started or finished?

Finished time (based on the n=1 game I just tried now), which I started on 6/30 and finished on 7/1 and it appeared in the 7/1 folder. 
  • Do partial logs ever appear in the directories? Asked another way, do the logs of in-progress games appear, or only completed games?

I think it's only completed games. I've yet to encounter an incomplete game, and I just checked on the game with the latest timestamp and it was a completed one.
 
  • Roughly when does a directory stop getting updated with new games?

Based on timestamps, 11:59 pacific time on that day.
  • When does a directory move from the dominionlogs site to the archive-domionionlogs site? Does it ever appear in both places at the same time?
The cutoff used to be mid-march, now I just checked and dominionlogs has only the last two days? So as far as I can tell the cutoff is arbitrary. There used to be overlap, directories that appeared in two places, right now there isn't. No consistency that I can glean yet.
 
  • What do the components of the filenames mean? The first looks like a hash, maybe an ID for Player 1? The second looks like a timestamp, anyone know of what?


First part is a hash of an id for the person who hosted the game. Second part is a timestamp, I think of when the game ended.
 
 You're right about the timezone issue though, if the CR update starts running when goko's still on the previous day it'll miss some games... I didn't catch that...

Mike

--
You received this message because you are subscribed to the Google Groups "Councilroom.com development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to councilroom-d...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


Mike McCallister

unread,
Jul 4, 2013, 1:30:45 AM7/4/13
to council...@googlegroups.com
So you know where things stand, I've started downloading about 7.1M game logs from Goko. It is proceeding slowly (about 83K/hour), but it will finish eventually. After that, I'll compress them into day-size archives and put them in S3, where the jobs that parse and load them into the Councilroom database will pick them up to process.

I'll keep you informed of progress.

Once we're caught up, I think we can rework the jobs a bit so there's not so much lag time between the end of a day and the daily processing being kicked off.

Michael McCallister

unread,
Jul 7, 2013, 1:32:36 AM7/7/13
to council...@googlegroups.com
Current status:
  • The Goko logs through July 2 are downloaded, tared, bziped, and stored in the councilroom S3 bucket.
  • I've made some minor tweaks to the scripts on my issue_48 branch (https://github.com/mikemccllstr/dominionstats/tree/issue_48).
  • I've taken a snapshot (backup) of the production database.
  • I've added some "source" attributes in the games and raw_games collections, and I've indexed these attributes.
  • I've modified the rawgame and game modules to use these new attributes.
  • I'm making some minor changes to the background tasks and related pieces so that it is ready to run. In particular, it needs a couple of tweaks to be able to run for days on which both Goko logs and Iso logs are present in the same database.
More progress to come.


Mike

Michael McCallister

unread,
Jul 8, 2013, 1:22:47 AM7/8/13
to council...@googlegroups.com
Updates from my EOD July 7:

I tried to parse August 5, 2012, the very first day that logs are available. The format is different enough that the games didn't parse. This is somewhat as you expected, so I'm not too concerned about it. However, it seems to die with tracebacks that kill the Celery task that is parsing the block of 100 games. I need to fix this, as a traceback should just cause the single game to not parse, not break the entire task.

I next tried to parse March 20, 2013, and it went much better. There were 24 tracebacks out of ~48K games, all like the one below. I'll dig into this to figure out what needs to be done to avoid the traceback. 

[2013-07-07 23:18:33,914: ERROR/MainProcess] Task background.tasks.parse_games[919bd5f2-75e4-4400-bc58-67242bcc6694] raised exception: TypeError("'NoneType' object has no attribute '__getitem__'",)
Traceback (most recent call last):
  File "/home/ubuntu/worker/.venv-worker/local/lib/python2.7/site-packages/celery/task/trace.py", line 224, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/ubuntu/worker/.venv-worker/local/lib/python2.7/site-packages/celery/task/trace.py", line 406, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/ubuntu/worker/background/tasks.py", line 47, in parse_games
    return parse_and_insert(log, raw_games, parsed_games_col, parse_error_col, day)
  File "/home/ubuntu/worker/parse_game.py", line 185, in parse_and_insert
    parsed_games = map(lambda x: parse_game_from_dict(log, parse_error_col, x), raw_games)
  File "/home/ubuntu/worker/parse_game.py", line 185, in <lambda>
    parsed_games = map(lambda x: parse_game_from_dict(log, parse_error_col, x), raw_games)
  File "/home/ubuntu/worker/parse_game.py", line 119, in parse_game_from_dict
    parsed = parse_game(contents, dubious_check = True)
  File "/home/ubuntu/worker/parse_game.py", line 83, in parse_game
    game_dict = parse_goko_game.parse_game(game_str, dubious_check)
  File "/home/ubuntu/worker/parse_goko_game.py", line 1015, in parse_game
    turns = parse_turns(log_lines, game_dict[PLAYERS], removed_from_supply)
  File "/home/ubuntu/worker/parse_goko_game.py", line 887, in parse_turns
    removed_from_supply, masq_targets, previous_name)
  File "/home/ubuntu/worker/parse_goko_game.py", line 753, in parse_turn
    if bom_choice[0] == dominioncards.Knights:
TypeError: 'NoneType' object has no attribute '__getitem__'

It looks like 43K games got parsed from March 20. Assuming another ~2400 get added to that count from fixing the above traceback, this would yield about a 94% parse rate. Is that in line with your expectations? Will that ratio improve over time as Goko fixes bugs?

It took about 21 minutes to load and parse March 20. Rough math suggests it will take 40 processing hours to parse the backlog of Goko logs from mid-March to present. Once I get a little more comfortable with the parsing results, we can spin up a dozen AWS instances to burn through that in a few hours.


Mike

Max Gibiansky

unread,
Jul 8, 2013, 3:39:44 AM7/8/13
to Michael McCallister, council...@googlegroups.com
94% parse rate is a bit low. There are a few goko bugs that didn't get fixed until much later:

1) when trashing attacks are played, it would report "[attacker] trashes [card]" instead of the victim trashing their own card. They fixed that later.
2) Durations played on the last turn wouldn't get returned to the deck for scoring, leading to the sanity check failing even though we parsed the game correctly. Still not entirely fixed.
3) Sometimes with self-trashers (HoP, Knights, Feast) they report the card being trashed too many times when throned/counterfeited.

And on my end:
Band of Misfits leads to endless edge cases, I think most of which Goko reports properly, but which I just can't get right for all cases. (Some really are ambiguous, some are doable but require an arbitrarily large amount of logic to determine what BoM-as-[something] is doing.)

It got up to 98-99% for me by the later days.

The specific bug - hmm, I don't think I've seen it. I'm not sure *why* I haven't seen it, since it seems like it should have cropped up for me, it's trying to compare the card that was chosen for band of misfits to knights when there wasn't a card chosen at all.  Band of Misfits bug, not surprised.

A straightforward patch that might work is indenting that clause (lines 753: 754), so that it'll only check for BoM-as-sir-Martin if it's doing BoM processing. I guess I didn't have anything besides Band of Misfits trigger that whole section, but something else is getting there... Do you know what logs cause that error, do you have an example one or something?

Or just put an "bom_choice is not None" check, that would prevent the crash - it's a check for an obscure Band of Misfits edge case so that check doesn't actually do anything most of the time.








Mike

--

Michael McCallister

unread,
Jul 10, 2013, 1:32:18 AM7/10/13
to council...@googlegroups.com, Michael McCallister
Things are looking good now. I've parsed and loaded March 24 to April 20, about 1.1M games, and only ~1300 encountered parse errors due to exceptions. These are probably all instances of those Knights processing problem. I've got the game IDs... I'll pull a few out as test cases. With my recent changes, it now only fails to parse the single game instead of failing the block of 100. 

Through 4/20, I'm seeing an average parse ratio of a little over 94%. This seems good to me, so I'm going to spin up an additional AWS instance and will let both of them crank on data from 4/20 to the beginning of July.
Reply all
Reply to author
Forward
0 new messages