File format training data?

anomander

unread,

Apr 22, 2018, 2:15:39 PM4/22/18

to LCZero

I must be missing something obvious, but what is the file format of the training data files?

Alexander Lyashuk

unread,

Apr 22, 2018, 2:18:58 PM4/22/18

to mnor...@gmail.com, LCZero

The format is the following structure https://github.com/glinscott/leela-chess/blob/master/training/tf/chunkparser.py#L115 repeated.

The whole file is gzip'ed.

On Sun, Apr 22, 2018 at 8:15 PM anomander <mnor...@gmail.com> wrote:

I must be missing something obvious, but what is the file format of the training data files?

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/1dcd6b80-819d-4b31-8831-b975974bc31c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

anomander

unread,

Apr 22, 2018, 2:23:25 PM4/22/18

to LCZero

Thanks! So, is there an easy way to extract pgn information from the unpacked files?

Alexander Lyashuk

unread,

Apr 22, 2018, 2:37:51 PM4/22/18

to mnor...@gmail.com, LCZero

No, there is no easy way, but there is a hard way.

First 12 of "packet bit planes" are 64-bit bitboards which represent:

Pawns of side to move
Knights of side to move
Bishops of side to move
Rooks of side to move
Queens of side to move
King of side to move
Pawns of other side
Knights of other side
Bishops of other side
Rooks of other side
Queens of other side
King of other side

Those encode board state. By comparing that board state with the board at the next move, one can find which move was taken.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/451aa9e8-a16b-4d67-ba90-033e7b1d7868%40googlegroups.com.

anomander

unread,

Apr 22, 2018, 2:40:19 PM4/22/18

to LCZero

Alright, thanks a lot for the information!

Trevor

unread,

Apr 22, 2018, 7:53:29 PM4/22/18

to LCZero

I just went ahead and did this... This will convert training data to PGN. Just edit filename/directory up top. You need Python 3 and python-chess (pip install python-chess)... Annoyingly, the last move is not recoverable form the training data, so this does not output results for the PGN. However, it's probably possible to get W/L/D from the training data at the last node. I just haven't gotten that far.

https://gist.github.com/so-much-meta/8dbfd1b3e667b0cdcab594f744791327

Trevor

unread,

Apr 22, 2018, 9:30:57 PM4/22/18

to LCZero

I updated this to include the result in the pgn as well. Still missing the final move though.

Jeremy Zucker

unread,

Apr 23, 2018, 12:30:40 AM4/23/18

to Trevor, LCZero

Thanks Trevor!

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/37d9bd8f-651b-4341-8798-a94da62d58a8%40googlegroups.com.

Alexander Lyashuk

unread,

Apr 23, 2018, 3:11:16 AM4/23/18

to trevor...@gmail.com, LCZero

Thanks!

There is actually also a way to recover the last move:

The 1858 probabilities is probabilities to make every move. Here is the index->move mapping:

https://github.com/glinscott/leela-chess/blob/master/lc0/src/chess/bitboard.cc#L26

Just pick move with the highest probability and that's (very likely) it.

For black move should be flipped: tr/12345678/87654321/

Also pawn promotion to knight looks the same as normal move: e.g. e7d8 if e7 is where pawn is, then it takes on d8 and promotes to knight.

During the training games, maximum probability is not necessary the move made due to temperature=1 setting.

Because of that, it may also be useful to annotate in PGN when not the best move was played, and what's the relative (move probability/best move probability) probability of it.

E.g. 10. Nf3 (45%, best move: Bxe7)

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/37d9bd8f-651b-4341-8798-a94da62d58a8%40googlegroups.com.

Alexander Lyashuk

unread,

Apr 23, 2018, 3:35:22 AM4/23/18

to trevor...@gmail.com, LCZero

Also if you don't object, I'd be useful to have this script in the repository (in scripts/ directory). In gists it may be quickly lost/forgotten.

Alexander Lyashuk

unread,

Apr 23, 2018, 4:12:40 AM4/23/18

to trevor...@gmail.com, LCZero

And for win/lose/draw, it's stored in every node, and it's convenient to extract it from the first chunk (as for black moves it's flipped and first one is white).

It's stored in the last "result" byte, which is 1 if white won, -1 if black, 0 if draw.

Trevor

unread,

Apr 23, 2018, 10:00:48 AM4/23/18

to LCZero

Ok cool, I was wondering where I could find that table (for move probabilities). I'll see if I can get around to adding that. And of course, no objection to adding to scripts... Usually the last move is a mate, so if it's a win or loss, not too difficult to figure it out (should be one of the highest probabilities as you said). But I have a feeling there are some non-optimal draws in there - those would likely not be one of the highest probabilities.

Anyway, one of the reasons I wanted to do this is I'm curious about some things dealing with Leela's policy. Specifically, I have a feeling that if you list out all possible moves, as correspond to the normalized (flipped for black) view of them, along with how often those specific moves are legal, that we'll find Leela's policy is heavily biased towards not wanting to play moves that are rarely legal.

Here's the type of data I mean - from 200 or or so training games, here's how often some of the most rarely legal and most often legal moves showed up. Anyway, my hypothesis is that moves like G8F6, even when it's clearly the best move, will be difficult for Leela's policy to see (especially when there are a lot of other options) -- I'll have to actually run the NN through this PGN to figure it out. If I'm right about this, I'm curious if there's some way to weight the moves/training to help remove that effect... In either case, I'm putting together some code to convert PGN to pandas-dataframes for this type of analysis, and will put that up somewhere as well.

	legal_count	play_count	ratio
norm_uci
a6b4	4	0	0.000000
a6b8	4	0	0.000000
a6c5	4	2	0.500000
a6c7	4	0	0.000000
g8h6	5	0	0.000000
g8e7	6	2	0.333333
g8f6	6	3	0.500000

...

	legal_count	play_count	ratio
norm_uci
b2b3	9254	128	0.013832
g2g3	9080	196	0.021586
h2h3	8818	221	0.025062
b2b4	8351	124	0.014849
a2a3	8126	227	0.027935
h2h4	8026	143	0.017817

anomander

unread,

Apr 23, 2018, 12:28:44 PM4/23/18

to LCZero

Awesome Trevor, the script works like a charm, thanks a lot!

anomander

unread,

Apr 24, 2018, 7:25:22 AM4/24/18

to LCZero

By the way, has anyone collected all the match game data? I could scrape it myself, but it would be unnecessary to do it if someone else has already done it.

Rudra C

unread,

Apr 24, 2018, 8:04:48 AM4/24/18

to LCZero

I am in process of collecting training games and matches. So far I have collected 3.5 million training games and all match games.

I can share those thru google drive.

Thanks.

anomander

unread,

Apr 24, 2018, 8:08:20 AM4/24/18

to LCZero

That would be great, thanks!

Jesse Jordache

unread,

Apr 24, 2018, 8:48:34 AM4/24/18

to LCZero

You do gods work.

Do you know if anyone has done similar on match data? I'd like to organize by openings over time, as in the A0 paper.

Alexander Lyashuk

unread,

Apr 24, 2018, 8:54:13 AM4/24/18

to Jesse Jordache, LCZero

Yeah, I think it would be very good idea to convert all games into some format that is easy for analysis (annotated PGN, for example), so that people could do lots of interesting stats!

--

You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/8b99d2b4-3ffc-4a0e-b75f-5f8530149b66%40googlegroups.com.

Trevor

unread,

Apr 24, 2018, 12:59:31 PM4/24/18

to LCZero

And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

Alexander Lyashuk

unread,

Apr 24, 2018, 1:15:12 PM4/24/18

to Trevor G, LCZero

Wow, that's beautiful!

On Tue, Apr 24, 2018 at 6:59 PM Trevor <trevor...@gmail.com> wrote:

And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/50b04804-561e-40f8-8831-f0a0ec1861ef%40googlegroups.com.

Rudra ठाकुर

unread,

Apr 24, 2018, 11:26:50 PM4/24/18

to LCZero

Below is the link to Leela match Games from 0 to 100000.

https://drive.google.com/file/d/1NI6MFdEy5ijjqKfU-UBrAsCCuT8c9l79/view?usp=sharing

sorry, It is taking time to upload to google drive.

I wrote powershell script to download these Match games. Let me know if anyone is interested in that script.

After match games, I will post all training games too.

Thanks

On Tue, Apr 24, 2018 at 12:14 PM, Alexander Lyashuk <moos...@gmail.com> wrote:

Wow, that's beautiful!

On Tue, Apr 24, 2018 at 6:59 PM Trevor <trevor...@gmail.com> wrote:

And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

--
You received this message because you are subscribed to the Google Groups "LCZero" group.

To unsubscribe from this group and stop receiving emails from it, send an email to lczero+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/50b04804-561e-40f8-8831-f0a0ec1861ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "LCZero" group.

To unsubscribe from this group and stop receiving emails from it, send an email to lczero+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/CACGjmsLLeO34qtUqKJQzJSoz_y%3DFqLQErq29PD5SB8wp71xqyQ%40mail.gmail.com.

Rudra ठाकुर

unread,

Apr 25, 2018, 1:04:11 AM4/25/18

to LCZero

Below is the link to Leela match Games from 100000 to 141554. As of 04/25/2018, 141554 matches have been played.

https://drive.google.com/file/d/1Jfc9pj9qtdcVND-_aNCxV28e9w-bjrby/view?usp=sharing

Thanks.

Esher

unread,

Apr 25, 2018, 3:30:34 AM4/25/18

to LCZero

I'm also collecting all training games. Tomorrow I will have all games up to 8 milion.

There are some errors in the PGN notation, e.g. en passant is written as xc6 instead od dxc6 or bxc6. Also some PGN strings contains multiple games. These things wil be adjusted. Only the first 2-3000 games still have errors because the PGN notation sometimes don't have white spaces between moves.

I will create a CBH database. White name will be the game number and black name the network ID. Also if possible result and ECO code will be set.

When I'm ready with the first 8 million games I will publish the database anywhere and publish the link here.

anomander

unread,

Apr 25, 2018, 12:41:28 PM4/25/18

to LCZero

Thanks a lot for the data!

Alex K

unread,

Apr 26, 2018, 12:33:45 AM4/26/18

to LCZero

Thanks for this,

I tried using it and I get assertion error after checking that the data is divisible by chunk size on line 31 on some of the earlier training files.

But it works fine for the more recent data.

Alexander Lyashuk

unread,

Apr 26, 2018, 3:56:27 PM4/26/18

to gru...@gmail.com, LCZero

If someone is scraping files from the server, please stop. That makes the server unstable.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/31af5b31-1895-46fd-8902-6810f73cd32f%40googlegroups.com.

Rudra ठाकुर

unread,

May 6, 2018, 9:48:56 PM5/6/18

to LCZero

Hello,

Below is the link to Leela match Games from 100000 to 200000.

https://drive.google.com/file/d/1jbEwVCQVHNxm-5O5PAEf_CQvH5JtOMsd/view?usp=sharing

Thanks,

On Tue, Apr 24, 2018 at 10:26 PM, Rudra ठाकुर <rud...@gmail.com> wrote:

anomander

unread,

May 7, 2018, 8:39:12 AM5/7/18

to LCZero

Great, thanks!

Den måndag 7 maj 2018 kl. 03:48:56 UTC+2 skrev Rudra C:

Hello,
Below is the link to Leela match Games from 100000 to 200000.
https://drive.google.com/file/d/1jbEwVCQVHNxm-5O5PAEf_CQvH5JtOMsd/view?usp=sharing

Thanks,

On Tue, Apr 24, 2018 at 10:26 PM, Rudra ठाकुर <rud...@gmail.com> wrote:

Below is the link to Leela match Games from 0 to 100000.
https://drive.google.com/file/d/1NI6MFdEy5ijjqKfU-UBrAsCCuT8c9l79/view?usp=sharing

sorry, It is taking time to upload to google drive.
I wrote powershell script to download these Match games. Let me know if anyone is interested in that script.

After match games, I will post all training games too.

Thanks

On Tue, Apr 24, 2018 at 12:14 PM, Alexander Lyashuk <moos...@gmail.com> wrote:

Wow, that's beautiful!

On Tue, Apr 24, 2018 at 6:59 PM Trevor <trevor...@gmail.com> wrote:

And visualizations too... Check out this site. It includes a link to github with all the code, and scripts to preprocess PGN. With Leela’s data being in a time-series, somebody could put together all sorts of cool animations to show how/what it’s learned.

https://blog.ebemunk.com/a-visual-look-at-2-million-chess-games/

--
You received this message because you are subscribed to the Google Groups "LCZero" group.

To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/50b04804-561e-40f8-8831-f0a0ec1861ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "LCZero" group.

To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

Rudra ठाकुर

unread,

May 22, 2018, 7:58:35 AM5/22/18

to LCZero

Hello,

Below is the link to Leela match Games from 200000 to 300000.

https://drive.google.com/file/d/1ka4lHWKQC7Bavn4wr_mTLNJF8332JxeG/view?usp=sharing

Thanks.

Albert Silver

unread,

May 22, 2018, 10:34:59 AM5/22/18

to LCZero

Would a reverse conversion, meaning PGN to the Training format, allow supervised training from a PGN database? If so, could we beg you to do a script for this?

Trevor G

unread,

May 22, 2018, 11:35:05 AM5/22/18

to Albert Silver, LCZero

You could do this, and it wouldn't be terribly difficult (I could probably whip this out in a couple of hours using my Python code). But what's the end goal? I have a feeling that you won't be able to accomplish what you want to with PGNs...

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/eb47b2d5-53b6-4c43-b276-c97f97b6ecae%40googlegroups.com.

Albert Silver

unread,

May 22, 2018, 12:14:06 PM5/22/18

to LCZero

I had in mind using the games as supervised training. See result etc. This wouldn't work?

Trevor G

unread,

May 22, 2018, 12:36:31 PM5/22/18

to Albert Silver, LCZero

It depends what you mean by “work.” Yes, you can use PGNs as a source for supervised training, but I think most likely it won’t be enough data to do what you want - not with Leela’s 15x192 size which I think has well over 10M parameters. Or you could end up just terribly overfitting the network to the training set. How many games do you want to use?

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/4198f8dd-4fe8-40f6-b90e-8552687b17f2%40googlegroups.com.

Trevor G

unread,

May 22, 2018, 12:41:52 PM5/22/18

to Albert Silver, LCZero

Keep in mind also that this type of data would be different than Leela’s self-play data in that the self-play data sets a whole probability distribution for the next move. PGNs you’d only get 1 policy value per position. Effect of that is you’d want more data by PGN than self-play — But then again, maybe the PGN play is qualitatively better than self-play enough to make up for that. But I wouldn’t necessarily assume that stronger/less random play is in fact qualitatively better for the network to learn from.

Albert Silver

unread,

May 22, 2018, 1:32:49 PM5/22/18

to Trevor G, LCZero

I'm not 100% sure. Offhand I would guess a million games or so, possibly less since I had in mind GM level games. Wish I could include notes as well. Is there any way to append computer evals to some moves (or all) and have it use them?

To unsubscribe from this group and stop receiving emails from it, send an email to lczero+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward