Opening a really messy csv file

43 views
Skip to first unread message

George

unread,
Sep 23, 2021, 1:03:49 PM9/23/21
to OpenRefine
Hello data wranglers,

I tried to open a really messy csv file (https://drive.google.com/file/d/1KI9DxR5biVC_fXcDL59j3jzSux8Ur1a3/view?usp=sharing) with OpenRefine but without success. There are empty lines, so I have removed them, first with "Store blank rows" option unchecked, then with with Notepad++ "Remove empty lines", but without success. I have also tried PowerQuery and R read.csv - the same.

Can somebody help with opening this file?

Paul R Pival

unread,
Sep 23, 2021, 3:49:04 PM9/23/21
to OpenRefine
Hi, George, it seems to work for me, though it took a long time to load. I did uncheck the option to store blank rows, and upon initial load OpenRefine seemed to think it was looking at a TSV instead of CSV, so I manually chose that option. At which step is it breaking for you, and what exactly is happening?

Paul
2021-09-23_13-43-03.png

George

unread,
Sep 24, 2021, 2:50:41 AM9/24/21
to OpenRefine
Hi, Paul, as I said, it doesn't work for me, because I did exactly the same and it opens different number of rows and columns than the original and the data is quite messed up. The original data has 11521 rows and 35 columns. When opened in OpenRefine it's 999+ columns and ~5000+ rows

jonathan...@gmail.com

unread,
Sep 24, 2021, 3:33:13 AM9/24/21
to openr...@googlegroups.com

To add in another tester’s experience, I just tried it –

 

Refine opened 10061 rows (or records) for me – I had to manually switch from TSV to CSV to see it as a table. But it brought in a lot of empty rows. And if I untick the “store empty cells” option it opens faster, but finds only 5030 rows. However, it sees 994 columns!

 

In R, using read_csv() from the readr package I get 11520 records, organised in 35 columns. Base R – read.csv() gets me 17425 rows, with 35 columns.

 

Excel reads it in as 11520 records, but with an empty spacer row after each entry making a total of 23040 rows.

 

Interesting…..I haven’t had such varying results from three different tools before.

 

Jonathan Stoneman

We are worldwide with offices in the UK, US and Europe.

 

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/fbad0b1d-0bf9-42cd-b6b6-4958a84e3b95n%40googlegroups.com.

Owen Stephens

unread,
Sep 24, 2021, 3:57:27 AM9/24/21
to OpenRefine
The main issue I'm seeing with the file (aside from the blank lines which we can just ignore as you say) is the use of commas as the column separators, but also that there are commas inside fields. For example we have the row:

1042213,Master Cyl Kit - R - MSR213,Brake - Repair Kits,Brake Systems,Master Cyl & Caliper Repair Kits,Master Cylinder Repair Kits,1,TOURMAX,TOURMAX,1,1042213.jpg,photo.epcloud.bickers-online.co.uk/SKU/1042213.jpg,https://data.bickersplc.com/images/zoom/1042213.jpg.jpg,,,,,MSR213,,3,,20,27.23,27.23,31.58680000000,31,1,Rear brake master cylinder repair kit,<p>Gives new life to old master cylinders. Contains all parts needed for overhaul. Cheaper alternative to replacing the cylinder.</p><p><strong><u>Fits:</u></strong><br/><small>For full fitment details in a searchable list please check the <b>FITMENTS</b> tab.</small><ul><li title="Yamaha Compatible Models"><b>Yamaha:</b> DT 125R, DT 125R/RE, DT 125RE, DT 125RR, DT 125RRF, DT 125X, DT 200WR, MT-03 660, TDR 125R, TT 660R, VM-X 1200USA, VM-X 1200V-Max, VMX 1700V-Max, WR 125R, WR 125X, WR 200D, WR 250ZB, WR 250ZD, WR 250ZE, WR 250ZF, WR 250ZG, WR 250ZH, XT 660R, XT 660X (1D2/10S), XT-Z 660Tenere, XV 1700Road Star Warrior, XV 1900C Raider, XV 1900Midnight Star, XVZ 1300Royal Star Venture, XVZ 1300Royal Star/Tour Classic, XVZ 1300Wild Star, YZ 125A1, YZ 125B1, YZ 125D1, YZ 125E1, YZ 125F1, YZ 125G1, YZ 125R, YZ 125S, YZ 250A, YZ 250B, YZ 250D, YZ 250E, YZ 250F, YZ 250G, YZ 80F, YZ 80G, YZ 80H, YZ 80J, YZ 80K, YZ 80L, YZ 80M, YZ 80N, YZ 80P, YZ 80R, YZ 80X, YZ 85A, YZ 85B, YZ 85P, YZ 85R, YZ 85S, YZ 85T, YZ 85V, YZ 85W, YZ 85X, YZ 85Y, YZ 85Z</li></ul></p>,,,1042213 MASTER CYLIND KIT REAR MSR213 Rear brake master cylinder repair kit Master Cyl Kit R MSR213 SKU1042213 Master Cyl Kit R MSR213 TOURMAX Master Cyl & Caliper Repair Kits Master Cylinder Repair Kits,B2C,Master Cyl Kit - R - MSR213,TOURMAX.jpg

Looking at the "LongDescription" field it is:
<p>Gives new life to old master cylinders. Contains all parts needed for overhaul. Cheaper alternative to replacing the cylinder.</p><p><strong><u>Fits:</u></strong><br/><small>For full fitment details in a searchable list please check the <b>FITMENTS</b> tab.</small><ul><li title="Yamaha Compatible Models"><b>Yamaha:</b> DT 125R, DT 125R/RE, DT 125RE, DT 125RR, DT 125RRF, DT 125X, DT 200WR, MT-03 660, TDR 125R, TT 660R, VM-X 1200USA, VM-X 1200V-Max, VMX 1700V-Max, WR 125R, WR 125X, WR 200D, WR 250ZB, WR 250ZD, WR 250ZE, WR 250ZF, WR 250ZG, WR 250ZH, XT 660R, XT 660X (1D2/10S), XT-Z 660Tenere, XV 1700Road Star Warrior, XV 1900C Raider, XV 1900Midnight Star, XVZ 1300Royal Star Venture, XVZ 1300Royal Star/Tour Classic, XVZ 1300Wild Star, YZ 125A1, YZ 125B1, YZ 125D1, YZ 125E1, YZ 125F1, YZ 125G1, YZ 125R, YZ 125S, YZ 250A, YZ 250B, YZ 250D, YZ 250E, YZ 250F, YZ 250G, YZ 80F, YZ 80G, YZ 80H, YZ 80J, YZ 80K, YZ 80L, YZ 80M, YZ 80N, YZ 80P, YZ 80R, YZ 80X, YZ 85A, YZ 85B, YZ 85P, YZ 85R, YZ 85S, YZ 85T, YZ 85V, YZ 85W, YZ 85X, YZ 85Y, YZ 85Z</li></ul></p>

All the commas between the list of compatible models (DT 125R, DT 125R/RE, DT 125RE etc etc) mean that when OpenRefine splits the field based on commas (to get the data in columns) it has no way of telling the difference between commas between fields and commas within fields - so in this particular case the record ends up with around 60 additional columns than expected - one for each compatible model in the list

In theory comma separated files should put inverted commas around the field content in these cases so that software, including OpenRefine, knows to ignore commas inside a quote field - but unfortunately that's not happened in this case.

The first thing I'd ask is - is it possible to get the data as a correct CSV in the first place (with fields quoted)? If so that's definitely going to be your easiest solution.
However if that's not possible then you are left with trying to find a way to identify those situations where additional commas are present, and somehow avoid them when splitting into columns - this is not straightforward (seeing Jonathon's message I wonder if the R readr read.csv() is doing a better job as it ends up with 35 cols - but I'm guessing it might actually just be using that as the "max" number of fields based on the header row and just stopping splitting after it reaches 35 - but it is definitely worth checking).

The one ray of light I see is that the fields containing commas seem to generally be HTML fragments - which offers a pattern to work with - I've been playing around and I think I may have a way to get this into better shape but I'm going to have to take a bit more time to look at it - I think I've found an approach that can isolate the fields with commas, which would mean these could be parsed separately. When I have time (hopefully during today) I'll see if my approach is going to work and post an update here.

Owen

George

unread,
Sep 24, 2021, 2:29:24 PM9/24/21
to OpenRefine
Than you all for the answers.  The problem is machines have no previous knowledge of the data. Yes, I have tried all the the tools mentioned in this tread, including Base R read.csv function (aslo with blank.lines.skip = TRUE), but with no success. A colleague has laready sent me a cleaner version of the data, but dated. Owen's approach (seeking pattern in the data with regular expression) seems promising in this case. I will also intend to try the Python's CleverCSV library ( GitHub - alan-turing-institute/CleverCSV).

Owen Stephens

unread,
Sep 24, 2021, 7:33:09 PM9/24/21
to OpenRefine
I'd be very interested in knowing how CleverCSV dealt with it - let us know!

In the meantime - a bit more work and it looks to me like the vast majority (but not all) the problems are with the four "Description" columns. One approach you can take is to nibble at either side of these columns using the GREL "match" with greedy and lazy regular expressions to isolate the first or last field.

So a suggested approach:
  • Import Products.csv as a Line Based file, so all the text from each row goes into a single column to start with. Uncheck the "Store blank rows" box to get rid of the blank rows
  • Do "Add column based on this column" with the GREL
    • value.match(/(.*),(.*?)/)[1]
  • This will extract everything following the final comma in the data - which should be the final field - the BrandCode. So label the column "BrandCode" and add it
  • Use one or more facets to check the content of this column - does it all look good? (e.g. in this case - are they all short names, ending in ".jpg"?)
  • If you are happy, go to your data column (that you imported) and do a transformation on this column:
    • value.match(/(.*),(.*?)/)[0]
  • This removes that field data you've just extracted into a new column from your data column
  • If you can, repeat this approach until you've extracted / removed the columns: Keywords,Division,PhotoDesc,BrandCode
  • Once you done this, start from the other end (this allows us to skip the Description fields for now). To start from the other end you use a similar approach but reversing regular expression:
  • Do "Add column based on this column" with the GREL
    • value.match(/(.*?),(.*)/)[0]
  • This will extract the first field - the SKU. So label the column "SKU" and add it
  • Use one or more facets to check the content of this column - does it all look good? (e.g. in this case - are they all short strings starting with a 7 digit number?)
  • If you are happy, go to your data column (that you imported) and do a transformation on this column:
    • value.match(/(.*?),(.*)/)[1]
  • Repeat this until you've extracted the first 27 columns (up to and including Matrix)
If you hit problems with rows at any point, suggest you isolate them through Stars/Flags and then carry on with the rest
You'll be left with just the concatenated description fields in the original column, and everything else pretty much where it should be.

It's a bit tedious, but my experimentation suggests this will get you about 10.5k good lines at this point, with somewhere around 1000 lines to do some more careful splitting on
This still leaves the Description fields a mess - but if you don't need to split these out individually you can keep them as a block, OR if you do need to split them then you can look for the patterns in there - although they are pretty messy - but at least your problems are all in 4 columns.

Hope this is helpful - my attempt is in this OpenRefine project https://drive.google.com/file/d/1C3t7zyOKDgQG0FADy2YRseHF6OJrzzYt/view?usp=sharing - the approach I took here was basically what I describe above but with a few shortcuts I later regretted. If you look at starred lines you can see these all have problems. There maybe more with problems as well - I haven't done a full analysis but I'm going to stop now :)

Best wishes

Owen

Reply all
Reply to author
Forward
0 new messages