Help me understand performance problems please

30 views
Skip to first unread message

will...@gmail.com

unread,
Jan 26, 2021, 10:53:14 PM1/26/21
to OpenRefine

Hello,

I use OpenRefine a lot for projects up to 40,000 lines with many columns. Works great.

Today I'm trying a 90,000 line, 3 column project and it's painfully slow. I tried increasing memory, to no avail. I see elsewhere that this size of file should not pose a problem.

The 8.2 MB tsv file I'm trying to use is here: https://raw.githubusercontent.com/whanley/b-g/master/bslc-members/bslc-members.tsv. Any advice would be appreciated.

Running 3.4.1 on OSX.



Will

Owen Stephens

unread,
Jan 27, 2021, 5:52:45 AM1/27/21
to OpenRefine
Hi Will,

That file imports for me reasonably quickly (without having to increase the memory from the default settings).

One issue I did see initially was multiple lines being merged into a single cell - I had to uncheck the option `Use character " to enclose cells containing column separators` to avoid this issue (because some of the data ended with a single inverted comma). You could try that and see if this improves the import speed for you.

I'm working on OS X, but I use the Linux version rather than the native Mac version - it's possible that makes a difference to the memory usage although I've never done any checking on that. I'm happy to share how I do this if it helps, but basically its just a matter of downloading the Linux version and running just as you would on Linux.

Best wishes

Owen

will...@gmail.com

unread,
Jan 27, 2021, 1:52:17 PM1/27/21
to OpenRefine
Thanks Owen. Import speed was fine (as you say, I unchecked the use character option). The problem is faceting speed.

A file of similar size and length works quickly and well on my system, but this file is very slow. It makes me think there's something in the text that is creating this problem.

I will look at using Linux, that's interesting. But I am puzzled by such different speeds on two similarly-sized files using my existing installation.

Will

will...@gmail.com

unread,
Jan 27, 2021, 2:06:34 PM1/27/21
to OpenRefine

Following up with a comparison:

I have two projects about 5 MB in size about 35,000-40,000 lines in length.

This dataset works normally: https://raw.githubusercontent.com/whanley/egypt-data/main/exp-manifests-rough(1).tsv

This dataset works slowly: https://raw.githubusercontent.com/whanley/b-g/master/bslc-members/bslc-members-to-1900-tsv.tsv

I notice the slow speed when faceting. For example, faceting "Surname" by count is very slow.

Thanks for any insights,
Will

Tom Morris

unread,
Jan 27, 2021, 7:12:32 PM1/27/21
to openr...@googlegroups.com
Please don't cross post the same question to multiple forums. If you feel you must, please at least mention that fact and link them together.


Tom


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/96fd87a3-e542-4639-91a1-6f87b5b0b225n%40googlegroups.com.

will...@gmail.com

unread,
Jan 27, 2021, 9:55:10 PM1/27/21
to OpenRefine
You're right--makes sense that the two communities overlap. Sorry!
Reply all
Reply to author
Forward
0 new messages