File format training data?

1,618 views
Skip to first unread message

anomander

unread,
Apr 22, 2018, 2:15:39 PM4/22/18
to LCZero
I must be missing something obvious, but what is the file format of the training data files?

Alexander Lyashuk

unread,
Apr 22, 2018, 2:18:58 PM4/22/18
to mnor...@gmail.com, LCZero
The format is the following structure https://github.com/glinscott/leela-chess/blob/master/training/tf/chunkparser.py#L115 repeated.
The whole file is gzip'ed.

On Sun, Apr 22, 2018 at 8:15 PM anomander <mnor...@gmail.com> wrote:
I must be missing something obvious, but what is the file format of the training data files?

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/1dcd6b80-819d-4b31-8831-b975974bc31c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

anomander

unread,
Apr 22, 2018, 2:23:25 PM4/22/18
to LCZero
Thanks! So, is there an easy way to extract pgn information from the unpacked files?

Alexander Lyashuk

unread,
Apr 22, 2018, 2:37:51 PM4/22/18
to mnor...@gmail.com, LCZero
No, there is no easy way, but there is a hard way.

First 12 of "packet bit planes" are 64-bit bitboards which represent:
  • Pawns of side to move
  • Knights of side to move
  • Bishops of side to move
  • Rooks of side to move
  • Queens of side to move
  • King of side to move
  • Pawns of other side
  • Knights of other side
  • Bishops of other side
  • Rooks of other side
  • Queens of other side
  • King of other side
Those encode board state. By comparing that board state with the board at the next move, one can find which move was taken.

anomander

unread,
Apr 22, 2018, 2:40:19 PM4/22/18
to LCZero
Alright, thanks a lot for the information!

Trevor

unread,
Apr 22, 2018, 7:53:29 PM4/22/18
to LCZero
I just went ahead and did this... This will convert training data to PGN. Just edit filename/directory up top. You need Python 3 and python-chess (pip install python-chess)... Annoyingly, the last move is not recoverable form the training data, so this does not output results for the PGN. However, it's probably possible to get W/L/D from the training data at the last node. I just haven't gotten that far.

Trevor

unread,
Apr 22, 2018, 9:30:57 PM4/22/18
to LCZero
I updated this to include the result in the pgn as well. Still missing the final move though.

Jeremy Zucker

unread,
Apr 23, 2018, 12:30:40 AM4/23/18
to Trevor, LCZero
Thanks Trevor!

Alexander Lyashuk

unread,
Apr 23, 2018, 3:11:16 AM4/23/18
to trevor...@gmail.com, LCZero
Thanks!

There is actually also a way to recover the last move:

The 1858 probabilities is probabilities to make every move. Here is the index->move mapping:
Just pick move with the highest probability and that's (very likely) it.
For black move should be flipped: tr/12345678/87654321/
Also pawn promotion to knight looks the same as normal move: e.g. e7d8 if e7 is where pawn is, then it takes on d8 and promotes to knight.

During the training games, maximum probability is not necessary the move made due to temperature=1 setting.
Because of that, it may also be useful to annotate in PGN when not the best move was played, and what's the relative (move probability/best move probability) probability of it.
E.g. 10. Nf3 (45%, best move: Bxe7)



Alexander Lyashuk

unread,
Apr 23, 2018, 3:35:22 AM4/23/18
to trevor...@gmail.com, LCZero
Also if you don't object, I'd be useful to have this script in the repository (in scripts/ directory). In gists it may be quickly lost/forgotten.

Alexander Lyashuk

unread,
Apr 23, 2018, 4:12:40 AM4/23/18
to trevor...@gmail.com, LCZero
And for win/lose/draw, it's stored in every node, and it's convenient to extract it from the first chunk (as for black moves it's flipped and first one is white).
It's stored in the last "result" byte, which is 1 if white won, -1 if black, 0 if draw.

Trevor

unread,
Apr 23, 2018, 10:00:48 AM4/23/18
to LCZero
Ok cool, I was wondering where I could find that table (for move probabilities). I'll see if I can get around to adding that. And of course, no objection to adding to scripts... Usually the last move is a mate, so if it's a win or loss, not too difficult to figure it out (should be one of the highest probabilities as you said). But I have a feeling there are some non-optimal draws in there - those would likely not be one of the highest probabilities.

Anyway, one of the reasons I wanted to do this is I'm curious about some things dealing with Leela's policy. Specifically, I have a feeling that if you list out all possible moves, as correspond to the normalized (flipped for black) view of them, along with how often those specific moves are legal, that we'll find Leela's policy is heavily biased towards not wanting to play moves that are rarely legal.

Here's the type of data I mean - from 200 or or so training games, here's how often some of the most rarely legal and most often legal moves showed up. Anyway, my hypothesis is that moves like G8F6, even when it's clearly the best move, will be difficult for Leela's policy to see (especially when there are a lot of other options) -- I'll have to actually run the NN through this PGN to figure it out. If I'm right about this, I'm curious if there's some way to weight the moves/training to help remove that effect... In either case, I'm putting together some code to convert PGN to pandas-dataframes for this type of analysis, and will put that up somewhere as well.

legal_countplay_countratio
norm_uci
a6b4400.000000
a6b8400.000000
a6c5420.500000
a6c7400.000000
g8h6500.000000
g8e7620.333333
g8f6630.500000
...
legal_countplay_countratio
norm_uci
b2b392541280.013832
g2g390801960.021586
h2h388182210.025062
b2b483511240.014849
a2a381262270.027935
h2h480261430.017817

anomander

unread,
Apr 23, 2018, 12:28:44 PM4/23/18
to LCZero
Awesome Trevor, the script works like a charm, thanks a lot!

anomander

unread,
Apr 24, 2018, 7:25:22 AM4/24/18
to LCZero
 By the way, has anyone collected all the match game data? I could scrape it myself, but it would be unnecessary to do it if someone else has already done it.

Rudra C

unread,
Apr 24, 2018, 8:04:48 AM4/24/18
to LCZero
I am in process of collecting training games and matches. So far I have collected 3.5 million training games and all match games.
I can share those thru google drive.

Thanks. 

anomander

unread,
Apr 24, 2018, 8:08:20 AM4/24/18
to LCZero
That would be great, thanks!

Jesse Jordache

unread,
Apr 24, 2018, 8:48:34 AM4/24/18
to LCZero
You do gods work. 

Do you know if anyone has done similar on match data?  I'd like to organize by openings over time, as in the A0 paper.

Alexander Lyashuk

unread,
Apr 24, 2018, 8:54:13 AM4/24/18
to Jesse Jordache, LCZero
Yeah, I think it would be very good idea to convert all games into some format that is easy for analysis (annotated PGN, for example), so that people could do lots of interesting stats!

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

Trevor

unread,
Apr 24, 2018, 12:59:31 PM4/24/18
to LCZero
And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

Alexander Lyashuk

unread,
Apr 24, 2018, 1:15:12 PM4/24/18
to Trevor G, LCZero
Wow, that's beautiful!

On Tue, Apr 24, 2018 at 6:59 PM Trevor <trevor...@gmail.com> wrote:
And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

Rudra ठाकुर

unread,
Apr 24, 2018, 11:26:50 PM4/24/18
to LCZero
Below is the link to Leela match Games from 0 to 100000.

sorry, It is taking time to upload to google drive.
I wrote powershell script to download these Match games. Let me know if anyone is interested in that script.

After match games, I will  post all training games too.

Thanks


On Tue, Apr 24, 2018 at 12:14 PM, Alexander Lyashuk <moos...@gmail.com> wrote:
Wow, that's beautiful!

On Tue, Apr 24, 2018 at 6:59 PM Trevor <trevor...@gmail.com> wrote:
And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/CACGjmsLLeO34qtUqKJQzJSoz_y%3DFqLQErq29PD5SB8wp71xqyQ%40mail.gmail.com.

Rudra ठाकुर

unread,
Apr 25, 2018, 1:04:11 AM4/25/18
to LCZero

Below is the link to Leela match Games from 100000 to 141554.   As of 04/25/2018, 141554 matches  have been played.

Thanks.

Esher

unread,
Apr 25, 2018, 3:30:34 AM4/25/18
to LCZero
I'm also collecting all training games. Tomorrow I will have all games up to 8 milion.
There are some errors in the PGN notation, e.g. en passant is written as xc6 instead od dxc6 or bxc6. Also some PGN strings contains multiple games. These things wil be adjusted. Only the first 2-3000 games still have errors because the PGN notation sometimes don't have white spaces between moves.

I will create a CBH database. White name will be the game number and black name the network ID. Also if possible result and ECO code will be set.
When I'm ready with the first 8 million games I will publish the database anywhere and publish the link here.

anomander

unread,
Apr 25, 2018, 12:41:28 PM4/25/18
to LCZero
Thanks a lot for the data!

Alex K

unread,
Apr 26, 2018, 12:33:45 AM4/26/18
to LCZero
Thanks for this, 

I tried using it and I get assertion error after checking that the data is divisible by chunk size on line 31 on some of the earlier training files. 
But it works fine for the more recent data.

Alexander Lyashuk

unread,
Apr 26, 2018, 3:56:27 PM4/26/18
to gru...@gmail.com, LCZero
If someone is scraping files from the server, please stop. That makes the server unstable.

Rudra ठाकुर

unread,
May 6, 2018, 9:48:56 PM5/6/18
to LCZero
Hello,
     Below is the link to Leela match Games from 100000 to 200000.
    https://drive.google.com/file/d/1jbEwVCQVHNxm-5O5PAEf_CQvH5JtOMsd/view?usp=sharing

Thanks,

On Tue, Apr 24, 2018 at 10:26 PM, Rudra ठाकुर <rud...@gmail.com> wrote:

anomander

unread,
May 7, 2018, 8:39:12 AM5/7/18
to LCZero
Great, thanks!


Den måndag 7 maj 2018 kl. 03:48:56 UTC+2 skrev Rudra C:
Hello,
     Below is the link to Leela match Games from 100000 to 200000.
    https://drive.google.com/file/d/1jbEwVCQVHNxm-5O5PAEf_CQvH5JtOMsd/view?usp=sharing

Thanks,

On Tue, Apr 24, 2018 at 10:26 PM, Rudra ठाकुर <rud...@gmail.com> wrote:
Below is the link to Leela match Games from 0 to 100000.

sorry, It is taking time to upload to google drive.
I wrote powershell script to download these Match games. Let me know if anyone is interested in that script.

After match games, I will  post all training games too.

Thanks

On Tue, Apr 24, 2018 at 12:14 PM, Alexander Lyashuk <moos...@gmail.com> wrote:
Wow, that's beautiful!

On Tue, Apr 24, 2018 at 6:59 PM Trevor <trevor...@gmail.com> wrote:
And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

Rudra ठाकुर

unread,
May 22, 2018, 7:58:35 AM5/22/18
to LCZero
Hello,
     Below is the link to Leela match Games from 200000 to 300000. 


Thanks.

Albert Silver

unread,
May 22, 2018, 10:34:59 AM5/22/18
to LCZero
Would a reverse conversion, meaning PGN to the Training format, allow supervised training from a PGN database? If so, could we beg you to do a script for this? 

Trevor G

unread,
May 22, 2018, 11:35:05 AM5/22/18
to Albert Silver, LCZero
You could do this, and it wouldn't be terribly difficult (I could probably whip this out in a couple of hours using my Python code). But what's the end goal? I have a feeling that you won't be able to accomplish what you want to with PGNs...



Albert Silver

unread,
May 22, 2018, 12:14:06 PM5/22/18
to LCZero
I had in mind using the games as supervised training. See result etc. This wouldn't work?

Trevor G

unread,
May 22, 2018, 12:36:31 PM5/22/18
to Albert Silver, LCZero
It depends what you mean by “work.” Yes, you can use PGNs as a source for supervised training, but I think most likely it won’t be enough data to do what you want - not with Leela’s 15x192 size which I think has well over 10M parameters. Or you could end up just terribly overfitting the network to the training set. How many games do you want to use?

Trevor G

unread,
May 22, 2018, 12:41:52 PM5/22/18
to Albert Silver, LCZero
Keep in mind also that this type of data would be different than Leela’s self-play data in that the self-play data sets a whole probability distribution for the next move. PGNs you’d only  get 1 policy value per position. Effect of that is you’d want more data by PGN than self-play — But then again, maybe the PGN play is qualitatively better than self-play enough to make up for that. But I wouldn’t necessarily assume that stronger/less random play is in fact qualitatively better for the network to learn from.

Albert Silver

unread,
May 22, 2018, 1:32:49 PM5/22/18
to Trevor G, LCZero
I'm not 100% sure. Offhand I would guess a million games or so, possibly less since I had in mind GM level games. Wish I could include notes as well. Is there any way to append computer evals to some moves (or all) and have it use them?

To unsubscribe from this group and stop receiving emails from it, send an email to lczero+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages