Re: Creating project extremely slow on 168mb csv file

999 views
Skip to first unread message

china...@gmail.com

unread,
May 28, 2013, 1:13:44 PM5/28/13
to openr...@googlegroups.com
I did another test, moved the refine folder and the data import folder to the 2TB Sata drive that is standalone to this VM. It ran for 6 Minutes at 95% then dropped down to 6%.

Why is Refine dropping CPU usage when it obviously still has work to do?

Thad Guidry

unread,
May 28, 2013, 1:29:09 PM5/28/13
to openr...@googlegroups.com

Try removing all options for csv import. I bet there is some issue with your csv.  Try also just line based and see if that works.  We are here to help. Do not get discouraged.

On May 28, 2013 11:57 AM, <china...@gmail.com> wrote:

Am a bit dismayed with Open Refine at the moment. Hoping it is something I am doing or not doing that will speed things up.

First the computer specs:
HP Proliant DL160, Xeon 2.4ghz x2, 32GB ram, Citrix Enterprise.


Anyway I was trying to import a csv file that is 168MB with 1,767,334 rows and 7 columns. When I clicked the create project it started just fine, then quickly went from 2 minutes to over 70 minutes. I edited the ini file and increased ram to 4GB, almost same results, I changed it again this time to 16GB, same results. Finally I shut down ALL my VM's running on the server, assigned 16 CPU's, and gave it 31GB of ram, inside Refine I gave it 28GB of ram.

From 11:28 until 11:38 it used 2 CPU's around 95% full throttle then dropped like a rock, at 12:43 was still showing only around 1% to 2% of 1 CPU. Every minute that passes the time remaining got longer and longer. 

240 minutes remaining Heap usage: 26724/26724MB


I have JRE x64 installed to use above 4GB of ram.


The VM OS is Windows 7 Ultimate with nothing running except windows core processes, java, and Refine.


Here is my ini file


# Launch4j runtime config

# initial memory heap size
-Xms2048M

# max memory memory heap size
-Xmx28672M

# Use system defined HTTP proxies
-Djava.net.useSystemProxies=true

#-XX:+UseLargePages
#-Dsomevar="%SOMEVAR%"

***************************************************************

I have Refine running from the VM C: Drive which has 9GB free, this VM has a 2TB drive attached directly as removable SATA.

Refine default install folder is around 50MB, when I checked it while it was running it went up to 215mb (50+168=215)

Inside the refine command prompt window it simply says

[refine] POST /command/core/get-importing-job-status (1014ms)

That line is simply repeated about every second


Some questions:

Refine really does NOT use more than 1 CPU?

Why would Refine go from 95% down to 1% when it still is processing the file?

Is there anything I can do to speed this thing up?

If I throw in 2 SSD drives will this make Refine work any better?

Am I missing something here? 168mb csv really takes this long?

************
Citrix is telling me it is writing data to disk at 122.8kbps

Regards,
Bill 

--
You received this message because you are subscribed to the Google Groups "Open Refine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tom Morris

unread,
May 28, 2013, 1:46:09 PM5/28/13
to openr...@googlegroups.com
25+ GB of heap to import and 168 MB file isn't reasonable, so something else is going on.  When the heap is reported as "26724/26724MB" that means it's also used up (the next release will report this as 100% and it will be in red for values >90%).

I would try changing the CSV quote option before import to see if that helps.  If you can make your file available, I'll take a look to see what's going on.

Tom

china...@gmail.com

unread,
May 28, 2013, 2:39:27 PM5/28/13
to openr...@googlegroups.com
Tom sent you file info via direct email.

I tried on a another system, i7 920, 20GB ram running server 2012. Basically I am getting the same results even when running Refine from 120GB SSD and 16GB ram assigned. So it must be my file. Not sure why though.

In any event hears to hoping Tom has the solution since I have several thousand files to go!


Thad Guidry

unread,
May 28, 2013, 2:54:19 PM5/28/13
to openr...@googlegroups.com

If you have several thousand files...you might want to look at ETL software depending on your needs.  I use Pentaho data integration, but you might look at Clover ETL or Talend Open Studio.

OpenRefine does not have the goal of replacing existing ETL tools.  We primarily focus on cleanup and analysis with scripting support.

We also mention those tools and related software on the wiki under Related Software.

Tom Morris

unread,
May 28, 2013, 4:23:43 PM5/28/13
to openr...@googlegroups.com
The good news is that the 1.7M rows fits easily in "only" 2.5GB of memory.  The bad news is that there's a bug in Refine's escaped quote processing for CSVs which prevents the files from being loaded by both 2.5 and the current development version.  Fortunately, it's an easily fixable bug.

While Refine can process 1.7M rows, it's well above it's design center, so things will be pretty sluggish unless you've got a pretty beefy machine.

Tom


china...@gmail.com

unread,
May 28, 2013, 6:15:18 PM5/28/13
to openr...@googlegroups.com
I was able to do a work around on this. I converted the csv to tsv and imported in under 2 minutes without an issue. I did however hit some issues once it was imported. For some reason it took 2 rows and pushed them from 7 to 8 columns, quick fix. 

Exporting to rdf went really smooth too even though it generated a 1gb file. The trick and real trouble came when I tried to import the rdf as a reconciliation service. Apparently I had some non unicode characters that kept giving errors. Fixing those is PITA when were talking about a 1gb rdf file!

Just a thought, but could refine actually be made to warn, or even better simply show the rows that have illegal characters in them when you go to export to rdf? I saw something about unicode in the facets will read up more on it. But just wondering if that could be an automatic check before it exports to rdf etc.

Uhm, 1.7M rows is pushing it for Refine? Guess that means the 120M rows I wanna do is outta the question then huh? OH well, where there is a will, there's always a way!

Thanks for the effort and assistance.


Thad will be replying to your comment(s) after I get some sleep!





Martin Magdinier

unread,
May 31, 2013, 6:58:46 PM5/31/13
to openrefine
First, thanks for the feedback while you are pushing Refine capacity. 

To automate your process on a 120M rows data set, you might want to try the python or ruby extension to automate your code (see Known Client libraries for Refine). But as Thad mention you might better go with an ETL tool if you need to repeat this on a regular basis.

regarding the charset validation before exporting to rdf the best will be to make the request to the extension developer (see GitHub project here) - I assumed you used this extension to generate your rdf.

thanks
Martin



Reply all
Reply to author
Forward
0 new messages