Memory needed

42 views
Skip to first unread message

okn...@gmail.com

unread,
May 3, 2018, 4:15:27 PM5/3/18
to OpenRefine
Hi, I am trying to clean a csv file that has 300K rows. I have config 16G in my server allocated to Open Refine. I'm able to upload the file but it won't show any columns. How much memory it's needed for a file this size in rows? Thanks for the help.

Ettore Rizza

unread,
May 4, 2018, 6:38:14 AM5/4/18
to OpenRefine
Hi,

It depends on the number of columns, but 16Gb is certainly more than enough. I am working on a file of 3 million rows with 12Gb. The problem is probably elsewhere.

Martin Magdinier

unread,
May 4, 2018, 8:49:47 AM5/4/18
to openrefine
Otto,

I agree with Ettore, 16GB of RAM should be enough. 
What is the format of the import (csv, xls) ?  (I got issue loading poorly formatted csv previously)
What is the total size of your file? How many columns does it has?




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

okn...@gmail.com

unread,
May 4, 2018, 9:03:20 AM5/4/18
to OpenRefine
Hi Martin and Ettore,
 I have upgraded to Open Refine version 2.8 and still, the issue remains. It is taking forever to load the columns. my file is cvs type and yes it is poorly formatted.-Drefine.memory=31744M Any idea?
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Martin Magdinier

unread,
May 4, 2018, 9:16:53 AM5/4/18
to openrefine
Otto, 

Some check you can do:
  • encoding of your csv 
  • delimiters are consistent through the file (comma tab or something else). 
  • cell enclosing with quote are consistent through the file

I will suggest doing that with the command line or a text editor that will not crash when you load the file. 
If you have access to the system that generated the data, I will see if you can get a dump in another format



Martin

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

Ettore RIZZA

unread,
May 4, 2018, 9:24:25 AM5/4/18
to openrefine
(Maybe stupid question) are you sure you have a 64-bits Java installed, as indicated in the documentation? 

The 32-bits version cannot go beyond 4GB RAM, even if you specify 36Gb in -Xmx.

Otto Knoke

unread,
May 4, 2018, 9:57:59 AM5/4/18
to openr...@googlegroups.com

Martin and Ettore, yes I have 64-bits Java. Which text editor you recommend for that large file? Yes, it’s comma separated.

Ettore Rizza

unread,
May 4, 2018, 10:30:31 AM5/4/18
to OpenRefine
If there is no privacy concerns, could you send me the zipped file at ettorerizza[at]gmail.com ? (note : I'm not Cambridge Analytica ^^ )

Thad Guidry

unread,
May 4, 2018, 10:59:05 AM5/4/18
to openr...@googlegroups.com
Use line based importer and then manually "Add new column based on" and split on comma char and as you do create each column make facets to spot outlying problem rows.

If not utf8 needed (potentially doubles memory needed) then use ascii as encoding. Although later this might be regrettable if you did really have utf8 chars. Just check for utf8 chars in file first with other tools or use our special facet on your line based rows before splitting and it shows counts of utf8 chars...then just import again with utf8 encoding if that is the case.

Lots of other tips but try those first.

Ettore we should make a wiki page for this info !

-Thad

okn...@gmail.com

unread,
May 4, 2018, 11:16:56 AM5/4/18
to OpenRefine
It took 5 hours to display the columns. Now it's impossible to work on Open Refine. Temporarily my sysadmin has allocated 32G of memory to Open Refine.

Thad Guidry

unread,
May 4, 2018, 11:23:23 AM5/4/18
to openr...@googlegroups.com
If you could privately share this file with us on Google drive (use my email address) that would be most welcome. We could see where potential issues are and perhaps make improvements in the future for what we find are the reasons causing you difficulty. 

Thanks,
-Thad

Ettore Rizza

unread,
May 4, 2018, 11:56:40 AM5/4/18
to OpenRefine
@thad : I was able to reproduce the issue. Indeed, it comes from commas inside text fields. Simplest solution: open the file with LibreOffice (which has a better CSV parser) and export it in xlsx. 

Ettore Rizza

unread,
May 4, 2018, 12:00:54 PM5/4/18
to OpenRefine
Mmh, looks like the parsing is not perfect even with LibreOffice.

Otto Knoke

unread,
May 4, 2018, 12:15:48 PM5/4/18
to openr...@googlegroups.com

Tks Ettore.

Ettore RIZZA

unread,
May 4, 2018, 12:21:56 PM5/4/18
to openrefine
Not sure the result is good enough. I will try something else this evening.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

Otto Knoke

unread,
May 4, 2018, 2:20:20 PM5/4/18
to openr...@googlegroups.com

Thanks to Thad, Ettore and Martin,

 

I did as instructed by Ettore to open the CSV file and save it as xlms. That solved my problem of loading and seeing the columns in Open Refine. It’s a parsing issue from the source when the CSV file it was generated. Thanks a million for aiding me.

okn...@gmail.com

unread,
May 4, 2018, 5:56:31 PM5/4/18
to OpenRefine
I found the issue in my file. Plese see the below line
221802223","Apr/30/2018","96689917","96689917","CHEVROLET"," SPARK",2015,"BISAGR.INF.PUER.DL.I","LOWER HINGE","24,119.20","","","","","Not Awarded","Not Awarded","0.00","0.00","","0.00","","","Taller: Tres Erres SA Suc Tibas Telefono: 22429090 Direccion: 250m N y 75m E del cruce de Llorente de Tibas frente al edificio Alfredo Sasso, San Juan, Tibas, San","KL1CM6CD6FC796724"
 
Then Year does not have Quotes (") on both sides. Now everything is in one column. I have 262,404 rows with this issue. How can I fix it? Then I will do a split on every ","? You guys have been great!

Ettore Rizza

unread,
May 4, 2018, 7:27:46 PM5/4/18
to OpenRefine
Re Otto,

Your file is badly formatted, it will not be easy to parse as csv. Here is a regex that should do the trick for 99% of the rows.

I suggest using it to replace commas with something else, for example |||. This little Jython script should help you (use "Transform", then select "Pyhon / Jython"):

import re

regex
= re.compile(r'("(?:[^"]|"")*"|[^,"\n\r]*)(,|\r?\n|\r)')

return re.sub(regex, r"\1|||", value)

The result will be something like this :

1.
 "221802223"|||"Apr/30/2018"|||"KNPL0"|||"KNPL0"|||"CHEVROLET"|||" SPARK"|||2015|||"PORTAPLACA DEL."|||""|||""|||""|||""|||""|||""|||"Not Awarded"|||"Not Awarded"|||"0.00"|||"0.00"|||""|||"0.00"|||""|||""|||"Taller: Tres Erres SA Suc Tibas Telefono: 22429090 Direccion: 250m N y 75m E del cruce de Llorente de Tibas frente al edificio Alfredo Sasso, San Juan, Tibas, San"|||"KL1CM6CD6FC796724"

You can then use the separator (|||) to split your unique column. If the result is right for you, you can delete the leading and trailing quotation marks by going into "All", then "Transform", then using this GREL formula:

value.replace(/(^"|"$)/, '').trim()


(Don't be scared if the screen goes grey or black during the operation)

Good luck !

Otto Knoke

unread,
May 5, 2018, 7:27:35 AM5/5/18
to openr...@googlegroups.com

Hi Ettore,

 

Your solution worked like a charm. Now the file is 99.5% clean. You are the man!

Reply all
Reply to author
Forward
0 new messages