Multiline input records (or maybe merging rows by condition, not by fixed count)

Mekk

unread,

Oct 12, 2011, 2:13:10 PM10/12/11

to Google Refine

I decided to try playing with Google Refine on chess PGN file (you can
take http://www.chesscenter.com/twic/zips/twic883g.zip as an example).
This is somewhat non-trivial format as one recognizes every new
"record" by complicated condition (new game is started by a line
starting with '[' unless previous line also started this way). Simple
example:

[Event "4th Final Masters"]

[Date "2011.10.06"]

[Round "6"]

[White "Vallejo Pons,F"]

[Black "Aronian,L"]

[Result "1/2-1/2"]

1. Nf3 d5 2. d4 c5 3. c4 dxc4 4. e4 cxd4 5. Qxd4 Bd7 6. Bxc4 Nc6 7.
Qe3 e6 8.

O-O Qb8 9. b3 Ne5 10. Bb2 Bd6 11. Nbd2 Nf6 12. Qg5 Ng6 13. Bxf6 h6 14.
Qa5 Bc7

15. Qc5 Bd6 16. Qa5 Bc7 17. Qh5 gxf6 18. Bb5 Bxb5 19. Qxb5+ Kf8 20.
Rad1 Kg7 21.

Nc4 h5 22. Rd7 a6 23. Qb4 b5 24. Nd6 Bxd6 25. Rxd6 Qc7 26. Rfd1 Rad8
27. g3 h4

28. Rxd8 Rxd8 29. Rxd8 Qxd8 30. Qd4 Qc8 31. Nxh4 Nxh4 32. gxh4 Qc2 33.
e5 fxe5

34. Qxe5+ Kh7 35. Qh5+ Kg7 36. Qg5+ Kh7 37. Qh5+ Kg7 38. Qg5+ Kh7 39.
Qh5+ 1/2-1/2

[Event "4th Final Masters"]

[Site "Sao Paulo/Bilbao BRA/ESP"]

[Date "2011.10.06"]

[Round "6"]

[White "Carlsen,M"]

[Black "Anand,V"]

[Result "1/2-1/2"]

[WhiteElo "2823"]

[BlackElo "2817"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Nf3 b6 5. Qc2 Bb7 6. a3 Bxc3+ 7. Qxc3
O-O 8.

Bg5 d6 9. Nd2 Nbd7 10. f3 h6 11. Bh4 Rc8 12. e4 c5 13. Bd3 d5 14. exd5
exd5 15.

O-O dxc4 16. Bxc4 cxd4 17. Qxd4 Nc5 18. Bxf6 Qxf6 19. Qxf6 gxf6 20.
Rfd1 Na4 21.

Rab1 Rfd8 22. Bb5 Bc6 23. Be2 Bd5 24. Nf1 Be6 25. Ne3 f5 26. g3 Kg7
27. Kf2 Kf6

28. Ba6 Rb8 29. b4 Nc3 30. Rxd8 Rxd8 31. Rc1 Nd5 32. Ng2 Rd7 33. Rc2
Rc7 34. Ne3

Rxc2+ 35. Nxc2 f4 36. Nd4 Bd7 37. Ke2 fxg3 38. hxg3 Ne7 39. Ke3 Nf5+
40. Nxf5

Kxf5 41. Kd4 Ke6 42. Bc4+ Ke7 43. f4 f6 44. Bd5 Kd6 45. Bf3 Be6 46.
Ba8 Bf5 47.

Bf3 Be6 48. Ba8 Bf5 49. Bf3 Be6 1/2-1/2

I couldn't find a way to split it properly while importing, so I
imported it in the line mode, and without much problems I managed to
create columns like IsHeader (simple check for leading [), HeaderName
and HeaderValue (splitting and cleaning rows with IsHeader=true) and
even White, Black, Site, Date, Result etc (just clicking over all
values in HeaderName facets and creating new columns based on
HeaderValue). So I finally got sparse table with sth like

IsHeader White Black Result Site Moves
true Vallejo
true Aronian
true 1/2-1/2
false
1. Nf3 d5 2. d4 c5 3. c4 dxc4 (...)
false
O-O Qb8 9. b3 Ne5 10. Bb2 (...)
true Carlsen
true Anand
true 1/2-1/2
true Bilbao
false
1. d4 Nf6 2. c4 e6 (...)
false
Bg5 d6 9. Nd2 (...)

Now I would like to merge the records to obtain dense matrix but can
not find a way. There is a transpose function which may do what I
need, but it requires fixed row count. But what I need here is sth
like: "merge rows with isheader=true until some row with
isheader=false appears; merge rows with isheader=false concatenating
moves value until some row with isheader=true appears, then merge
every two rows"

Is there a way? Or am I wrong with my approach?

Of course I could preparse the data with awk or perl before importing,
but it would make less fun ;-)

Best regards and thanks for very interesting tool I recently discovered

David Huynh

unread,

Oct 12, 2011, 2:17:17 PM10/12/11

to google...@googlegroups.com

I'd suggest trying the "Blank Down" command on the isheader column.

David

Mekk

unread,

Oct 12, 2011, 5:33:32 PM10/12/11

to Google Refine

On Oct 12, 2:17 pm, David Huynh <dfhu...@gmail.com> wrote:
> I'd suggest trying the "Blank Down" command on the isheader column.
>

It properly made my table even sparser, now I have sth like (replacing
values by single letters for shorter presentation):

IsHeader A B C D M
true X
Y
Z
false xxxxx
yyyyyy
true P
Q
R
S
false tttttttt
mmmm

The question is how to merge those rows starting on every non-null
IsHeader value. I'd love to get sth like

IsHeader A B C D M
true X Y Z
false xxxxxyyyyyy
true P Q R S
false ttttttttmmmm

Mekk

unread,

Oct 12, 2011, 6:53:25 PM10/12/11

to Google Refine

I found this article: http://googlerefine.blogspot.com/2011/08/merg-records-spread-on-multiple-rows.html
it is probably what I need and I plan giving it a try, although if
there is a better way…

David Huynh

unread,

Oct 12, 2011, 7:29:23 PM10/12/11

to google...@googlegroups.com

Mekk,

I created a project from the sample data you gave (in 2.5 RC1) and then apply the following operations:

[
{
"op": "core/column-addition",
"description": "Create column isHeader at index 1 based on column Column 1 using expression grel:value.startsWith(\"[\")",

"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "isHeader",
"columnInsertIndex": 1,
"baseColumnName": "Column 1",
"expression": "grel:value.startsWith(\"[\")",
"onError": "set-to-blank"

},
{
"op": "core/column-move",
"description": "Move column isHeader to position 0",
"columnName": "isHeader",

"index": 0
},
{
"op": "core/blank-down",
"description": "Blank down cells in column isHeader",
"engineConfig": {

"facets": [],
"mode": "row-based"
},
"columnName": "isHeader"
},
{
"op": "core/text-transform",

"description": "Text transform on cells in column isHeader using expression null",
"engineConfig": {
"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,
"omitError": false,
"selectBlank": false,

"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",
"type": "list",

"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
]
}
],
"mode": "row-based"
},
"columnName": "isHeader",

"expression": "null",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10
},
{
"op": "core/column-addition",
"description": "Create column Moves at index 2 based on column Column 1 using expression grel:forEach(filter(row.record.cells[\"Column 1\"], c, and(isNotNull(c), isNonBlank(c.value), not(c.value.startsWith(\"[\")))), c, c.value).join(\" | \")",

"engineConfig": {
"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,

"omitError": false,
"selectBlank": false,
"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",

"type": "list",
"selection": [
{
"v": {
"v": true,
"l": "true"

}
}
]
}
],
"mode": "row-based"
},
"newColumnName": "Moves",

"columnInsertIndex": 2,
"baseColumnName": "Column 1",
"expression": "grel:forEach(filter(row.record.cells[\"Column 1\"], c, and(isNotNull(c), isNonBlank(c.value), not(c.value.startsWith(\"[\")))), c, c.value).join(\" | \")",

"onError": "set-to-blank"
},
{
"op": "core/row-removal",
"description": "Remove rows",
"engineConfig": {

"facets": [
{
"invert": false,
"expression": "grel:not(value.startsWith(\"[\"))",
"selectError": false,

"omitError": false,
"selectBlank": false,
"name": "Column 1",
"omitBlank": false,
"columnName": "Column 1",

"type": "list",
"selection": [
{
"v": {
"v": true,
"l": "true"

}
}
]
}
],
"mode": "row-based"
}
},
{
"op": "core/column-addition",

"description": "Create column Event at index 2 based on column Column 1 using expression grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Event \"))[0].value",

"engineConfig": {
"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,

"omitError": false,
"selectBlank": false,
"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",

"type": "list",
"selection": [
{
"v": {
"v": true,
"l": "true"

}
}
]
}
],
"mode": "row-based"
},
"newColumnName": "Event",

"columnInsertIndex": 2,
"baseColumnName": "Column 1",
"expression": "grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Event \"))[0].value",

"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column Date at index 2 based on column Column 1 using expression grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Date \"))[0].value",

"engineConfig": {
"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,

"omitError": false,
"selectBlank": false,
"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",

"type": "list",
"selection": [
{
"v": {
"v": true,
"l": "true"

}
}
]
}
],
"mode": "row-based"
},
"newColumnName": "Date",

"columnInsertIndex": 2,
"baseColumnName": "Column 1",
"expression": "grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Date \"))[0].value",

"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column White at index 2 based on column Column 1 using expression grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[White \"))[0].value",

"engineConfig": {
"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,

"omitError": false,
"selectBlank": false,
"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",

"type": "list",
"selection": [
{
"v": {
"v": true,
"l": "true"

}
}
]
}
],
"mode": "row-based"
},
"newColumnName": "White",

"columnInsertIndex": 2,
"baseColumnName": "Column 1",
"expression": "grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[White \"))[0].value",

"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column Black at index 2 based on column Column 1 using expression grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Black \"))[0].value",

"engineConfig": {
"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,

"omitError": false,
"selectBlank": false,
"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",

"type": "list",
"selection": [
{
"v": {
"v": true,
"l": "true"

}
}
]
}
],
"mode": "row-based"
},
"newColumnName": "Black",

"columnInsertIndex": 2,
"baseColumnName": "Column 1",
"expression": "grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Black \"))[0].value",

"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column Result at index 2 based on column Column 1 using expression grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Result \"))[0].value",

"engineConfig": {
"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,

"omitError": false,
"selectBlank": false,
"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",

"type": "list",
"selection": [
{
"v": {
"v": true,
"l": "true"

}
}
]
}
],
"mode": "row-based"
},
"newColumnName": "Result",

"columnInsertIndex": 2,
"baseColumnName": "Column 1",
"expression": "grel:filter(row.record.cells[\"Column 1\"], c, c.value.startsWith(\"[Result \"))[0].value",

"onError": "set-to-blank"
},
{
"op": "core/row-removal",
"description": "Remove rows",
"engineConfig": {

"facets": [
{
"invert": false,
"expression": "value",
"selectError": false,
"omitError": false,
"selectBlank": true,
"name": "isHeader",
"omitBlank": false,
"columnName": "isHeader",

"type": "list",
"selection": []
}
],
"mode": "row-based"
}
},

{
"op": "core/column-reorder",
"description": "Reorder columns",
"columnNames": [
"Result",

"Black",
"White",
"Date",
"Event",
"Moves"
]
}
]

The main trick is to move the isHeader column to the first column so that it defines records. Then in row-based mode, filter for rows with isHeader = true, and use an expression that access the whole record for each row--this gives you all "Column 1" cells across all rows within the record containing the current row

row.record.cells["Column 1"]

David

Mekk

unread,

Oct 12, 2011, 8:35:09 PM10/12/11

to Google Refine

Interesting approach, thank you. I had the feeling that "records" may
be an answer, but had no clue how to define and use them.

I tried applying this idea manually according to your description (on
2.1) and I am going towards proper results (I even managed to merge
move lines using join(row.record.cells["Moves"].value, " ")

I still feel that the operation of gathering data from multiple lines
with specific field delimiting (by being non-null) start of new record
is frequent enough that it may deserve some interface (maybe just let
one "flatten records to rows" - concatenating non null values in every
group)? But that's just an opinion.

Once more thank you for your help.

David Huynh

unread,

Oct 12, 2011, 11:24:16 PM10/12/11

to google...@googlegroups.com

That's a good idea. Though I wonder if that's a primitive operation or a macro of operations.

David

Reply all

Reply to author

Forward