Re: Councilroom data

8 views
Skip to first unread message

Rob Neuhaus

unread,
Nov 2, 2016, 10:04:36 AM11/2/16
to Henk Broekhuizen, council...@googlegroups.com
I think councilroom has been unmaintained for years, sorry :(.

On Wed, Nov 2, 2016 at 9:23 AM, Henk Broekhuizen <h.broe...@x-is.eu> wrote:

Hello RRenaud,

 

For a hobby project I’m interested in running machine learning analyses on the councilroom database using R. This page tells me that if I just want to access the data, I should send you a message, so here I am ;). Is there a way to access the database through R that you know of, or is the data available as a file somewhere? Thanks in advance for your time.

 

Best wishes,

Henk

 

Henk Broekhuizen, MSc

PhD candidate  | University of Twente, dept. HTSR | E: h.broe...@utwente.nl | W: www.utwente.nl/bms/htsr

Data Scientist | X-IS | E: h.broe...@x-is.eu | W: www.x-is.eu

 


Mike McCallister

unread,
Nov 2, 2016, 11:31:30 PM11/2/16
to council...@googlegroups.com, h.broe...@x-is.eu
Hi Henk,

While the Councilroom site is still running, I am getting close to winding it down. As Rob points out, it's been years since I've closely looked at the code. That said, I think the original game logs are all still available, and I could also dump the parsed data that's stored in the Mongo database. What are you interested in?


Mike
--
You received this message because you are subscribed to the Google Groups "Councilroom.com development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to councilroom-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Henk Broekhuizen

unread,
Nov 3, 2016, 4:26:52 AM11/3/16
to Mike McCallister, council...@googlegroups.com

Hi Mike,

 

Thanks for your response. I’m interested in the game logs to run analyses on. How large a database (or collection of files) are we talking about? I’m not familiar with Mongo but a quick Google shows that it should be possible to import data into my R session(s). Let me know what’s most convenient for you.

 

Thanks again,

 

Henk

Mike McCallister

unread,
Nov 4, 2016, 11:26:18 AM11/4/16
to Henk Broekhuizen, council...@googlegroups.com
Hi Henk,

I'll take a look at what's available this weekend. From rough recollection, the compressed files are in the tens of GBs, and that probably includes both the original game logs and the parsed results that are in a JSON-like format.


Mike

Mike McCallister

unread,
Nov 7, 2016, 5:40:41 PM11/7/16
to Henk Broekhuizen, council...@googlegroups.com
So one way to get you the data is to dump the full Mongo DB (roughly 161GB, according to Mongo). Alternatively, I can dump or one or more collections within the DB.

Here is the list of collections:

analysis
buys
card_supply
game_size
game_stats
games
goal_stats
goals
isotropic_tracker
leaderboard_history
month
optimal_card_ratios
optimal_card_ratios2
parse_error
raw_games
scanner
system.indexes
trueskill_openings

Most of the data comes from the game logs themselves (raw_games), and here are the stats for that collection:

> db.raw_games.stats()
{
    "ns" : "test.raw_games",
    "count" : 18027299,
    "size" : 32904914856,
    "avgObjSize" : 1825.282581489329,
    "storageSize" : 33382170464,
    "numExtents" : 46,
    "nindexes" : 3,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 2272592784,
    "indexSizes" : {
        "_id_" : 1291987872,
        "game_date_1" : 631089088,
        "src_1" : 349515824
    },
    "ok" : 1
}

If I recall correctly, the text of the game logs have been compressed (gzip, bzip2, maybe xz) before being inserted into the collection. There are 18M games, requiring about 33GB of space. Since they are already compressed, there's little to gain by recompressing them again. I need to check my S3 buckets... these may already be available for download in month-size chunks.

Alternatively, you might find it easier to work from the parsed games. These are in a JSON format, and they've been somewhat normalized/standardized. Here are the stats for that collection:

> db.games.stats()
{
    "ns" : "test.games",
    "count" : 16006304,
    "size" : 92226397240,
    "avgObjSize" : 5761.879646919114,
    "storageSize" : 93925428640,
    "numExtents" : 75,
    "nindexes" : 5,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 10893211840,
    "indexSizes" : {
        "_id_" : 1171408224,
        "game_date_1" : 560342160,
        "P_1" : 1449171472,
        "S_1" : 7413915040,
        "F_1" : 298374944
    },
    "ok" : 1
}

The parser does some filtering out of junk games, and it also fails to parse some, so there are only 16M games in this collection. These are uncompressed, so the raw storage requirement is higher, about 92GB (excluding indexes). That said, they will probably compress pretty well for transit.

Let me know if you have any questions. If you are handy at the Linux shell prompt, I'd be OK with setting you up with a user on the server so you could explore the data in place if you like.



Mike



On 11/03/2016 03:26 AM, Henk Broekhuizen wrote:

Henk Broekhuizen

unread,
Nov 8, 2016, 3:31:39 AM11/8/16
to Mike McCallister, council...@googlegroups.com

Hi Mike,

 

The parsed games sound like a good place to start for me as it’s filtered and I think JSON can readily be read into R. In my job I work with SQL server, but I’ve read online converting from Mongo DB to SQL is rather bothersome.

 

I’m not very handy in Linux terminal apart from basics, but I think I’ll manage. So a login sounds good and then I can find out how to compress the data on-server and download the JSON by myself. Of course if you would volunteer to send me the data, that’d be greatly appreciated, but I don’t want to impose on your time.

 

Thanks,

Mike McCallister

unread,
Nov 9, 2016, 10:53:00 PM11/9/16
to Henk Broekhuizen, council...@googlegroups.com
Hi Henk,

To give you an idea of what the parsed games look like, I exported some into JSON. The pretty-printed version looks something like the blob below:

{
  "_id": "game-20101015-000153-1034d2ce.html",
  "D": [
    {
      "*": 44,
      "O": 1,
      "N": "gooftroop",
      "R": false,
      "T": [
        {
          "!": [
            25,
            25
          ],
          "b": [
            18
          ],
          "m": 2
        },
        {
          "!": [
            25,
            25,
            25,
            25,
            25
          ],
          "b": [
            100
          ],
          "m": 5
        },
        {
          "!": [
            18,
            100,
            25,
            25,
            25
          ],
          "b": [
            100
          ],
          "m": 5,
          "o": {
            "1": {
              "g": [
                31,
                25
              ]
            }
          }
        },
[snip]
        {
          "!": [
            161,
            25,
            25,
            133
          ],
          "b": [
            85
          ],
          "m": 4
        },
        {
          "!": [
            100,
            25,
            25,
            25,
            119
          ],
          "b": [
            57
          ],
          "m": 6,
          "o": {
            "0": {
              "g": [
                25
              ]
            }
          }
        }
      ],
      "W": 0,
      "V": 0,
      ":": {
        "25": 15,
        "18": 1,
        "57": 1,
        "35": 3,
        "41": 3,
        "133": 5,
        "131": 3,
        "31": 6,
        "103": 3,
        "100": 3,
        "119": 3,
        "161": 1,
        "85": 3
      }
    }
  ],
  "G": [
    35,
    31,
    103
  ],
  "P": [
    "gooftroop",
    "dl337"
  ],
  "S": [
    0,
    18,
    85,
    89,
    100,
    103,
    111,
    119,
    131,
    161
  ],
  "R": false,
  "game_date": "20101015",
  "X": {}
}

This structure encodes all the detail about the game, including the deck, the players, the turns they took (which cards were drawn, played, with what results, etc), and how they placed at the end of the game. To understand what all this means, you will probably have to spend a fair bit of time with the Python code that reads/writes it, as well as looking at the corresponding raw game logs.

Some of the keys are defined here: https://github.com/mikemccllstr/dominionstats/blob/master/keys.py

The cards are identified by their index within the card list from here: https://github.com/mikemccllstr/dominionstats/tree/master/card_info

Does this seem like something you can use?

I've kicked off a mongoexport to dump all the games into an xz compressed JSON file. It's not very speedy on this AWS instance I'm using... current estimate is about 99 hours of run time. Let me know if I should let it run to completion, or you want to do something different.


Mike

Henk Broekhuizen

unread,
Nov 10, 2016, 2:01:00 AM11/10/16
to Mike McCallister, council...@googlegroups.com

Hi Mike,

 

That looks quite readable, actually; given the translation key for the symbols. It definitely sounds like something I could use, so please let the dump complete.

 

Thanks again, and I look forward to hearing from you.

 

Best wishes,

Mike McCallister

unread,
Nov 10, 2016, 11:08:08 AM11/10/16
to Henk Broekhuizen, council...@googlegroups.com
I'm glad to hear it will be useful to you. It's currently estimating about 92 hours to go. If you don't hear from me in a few days, feel free to ping me a reminder.

For your planning purposes, it looks like the xz compression is achieving a good rate. I estimate the final file will be about 8.5GB raw (92GB uncompressed). I can put it up on an Amazon S3 bucket, so you can retrieve it at your convenience.


Let me know if you have any questions.


Mike

Mike McCallister

unread,
Nov 14, 2016, 9:17:22 PM11/14/16
to Henk Broekhuizen, council...@googlegroups.com
Hi Henk,

You should be able to retrieve the data from this URL:

https://s3.amazonaws.com/static.councilroom.mccllstr.com/mongo_export/games.json.xz

Please let me know once you've successfully retrieved it, as I might move it to a lower-cost storage tier.


Mike
Reply all
Reply to author
Forward
0 new messages