DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

1,167 views
Skip to first unread message

verylu...@gmail.com

unread,
May 31, 2015, 12:50:03 PM5/31/15
to julia...@googlegroups.com
Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9 columns (mostly strings). https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data

Loading the dataset in R using read.csv takes 5 minutes and the resulting dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my machine)

>t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
user   system elapsed 
332.295   4.154 343.332 
> object.size(a)
601496056 bytes #(0.6 GB)

Loading the same dataset using DataFrames' readtable takes about 30 minutes on the same machine (varies a bit, lowest is 25 minutes) and the resulting (Julia process, REPL on Terminal, takes 6GB memory on the same machine)

(I added couple of calls to @time macro inside the readtable function to see whats taking time - outcomes of these calls too are below)

julia> @time DataFrames.readtable("bids.csv");
WARNING: Begin readnrows call
elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35% gc time)
WARNING: End readnrows call
WARNING: Begin builddf call
elapsed time: 1809.506275842 seconds (18509704816 bytes allocated, 85.54% gc time)
WARNING: End builddf call
elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 84.12% gc time) #total time for loading


Can you please suggest how I can improve load time and memory usage in DataFrames for sizes this big and bigger?

Thank you!

Tim Holy

unread,
May 31, 2015, 1:36:32 PM5/31/15
to julia...@googlegroups.com
If you're using julia 0.3, you might want to try current master and/or
possibly the "ob/gctune" branch.

https://github.com/JuliaLang/julia/issues/10428

Best,
--Tim

Jiahao Chen

unread,
May 31, 2015, 9:25:14 PM5/31/15
to julia...@googlegroups.com
Not ideal, but for now you can try turning off the garbage collection while reading in the DataFrame.

gc_disable()
df = DataFrames.readtable("bids.csv")
gc_enable()


Thanks,

Jiahao Chen
Research Scientist
MIT CSAIL

verylu...@gmail.com

unread,
May 31, 2015, 11:42:08 PM5/31/15
to julia...@googlegroups.com
Thank you Tim and Jiahao for your responses. Sorry, I did not mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) Commit 80dd75c* (1 day old release-0.3).

I tried other releases as Tim suggested:

On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 days old master), 
the same command takes 14 minutes - half that it was taking with release-0.3 but still 3 times more than that taken by R's read.csv (5 min). More important, Julia process takes up 8GB memory (Rsession takes 1.6GB)
output of the command `@time DataFrames.readtable("bids.csv");` is
857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc time) # reduced from 85% to 71%

For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC) ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the macro is: 
elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc time in 33 pauses with 31 full sweep)
The process also takes up 8GB memory on the machine, more than the earlier one. My machine has also significantly slowed down - so perhaps the increase in memory when compared to release-0.3 is significant.

On disabling gc, my machine (4GB laptop) goes soul searching; so its not an option for now.

Is this the best one can expect for now? I read the discussion on issue #10428 but I did not understand it well :-(

Thank you!

Jacob Quinn

unread,
Jun 1, 2015, 1:06:50 PM6/1/15
to julia...@googlegroups.com
I've been meaning to clean some things up and properly release the functionality, but I have a new way to read in CSV files that beats anything else out there that I know of. To get the functionality, you'll need to be running 0.4 master, then do

Pkg.add("SQLite")
Pkg.checkout("SQLite","jq/updates")

I then ran the following on the bids.csv file

using SQLite, CSV

db = SQLite.SQLiteDB()

ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")

@time lines = SQLite.create(db, ff,"temp2")

It took 18 seconds on my newish MBP. From the R data.table package, the `fread` is the other fastest CSV I know of and it took 34 seconds on my machine. I'm actually pretty surprised by that, since in other tests I've done it was on par with the SQLite+CSV or sometimes slightly faster.

Now, you're not necessarily getting a Julia structure in this case, but it's loading the data into an SQLite table, that you can then run SQLite.query(db, sql_string) to do manipulations and such.

-Jacob

Johan Sigfrids

unread,
Jun 1, 2015, 1:12:27 PM6/1/15
to julia...@googlegroups.com, quinn....@gmail.com
Hmm. Makes you wonder what it would be like to use SQLite as a backend for DataFrames.

verylu...@gmail.com

unread,
Jun 1, 2015, 1:40:04 PM6/1/15
to julia...@googlegroups.com
Great, thank you Jacob, I will try it out! 

Do you have a writeup on differences in the way you read CSV files and the way it is currently done in Julia? Would love to know more!

Obvious perhaps but for completeness: Reading the data using readcsv or readdlm does not improve much the metrics I reported, suggesting that the overhead from DataFrames is not much.

Thank you again!

Jacob Quinn

unread,
Jun 1, 2015, 1:48:16 PM6/1/15
to julia...@googlegroups.com
The biggest single advantage SQLite has is the ability to mmap a file and just tell SQLite which pointer addresses start strings and how long they are, all without copying. The huge, huge bottleneck in most implementations, is not just identifying where a string starts and how long it is, but then allocating "in program" memory and copying the string into it. With SQLite, we can use an in-memory database, mmap the file, and tell SQLite where each string for a column lives by giving it the starting pointer address and how long it is. I've been looking into how to solve this problem over the last month or so (apart from Oscar's gc wizardry) and it just occurred to me last week that using SQLite may be the best way; so far, the results are promising!

-Jacob

verylucky Man

unread,
Jun 1, 2015, 3:01:59 PM6/1/15
to julia...@googlegroups.com
Thank you Jacob for detailed explanation!
Why can't one do something similar with Julia structures (instead of SQLite)? Sorry for asking what may be very basic questions.

Thank you!

Scott Jones

unread,
Jun 1, 2015, 3:04:06 PM6/1/15
to julia...@googlegroups.com
I agree, I would think this could be done nicely in pure Julia, just using those techniques that make SQLLite fast...

David Gold

unread,
Jun 1, 2015, 6:25:37 PM6/1/15
to julia...@googlegroups.com, quinn....@gmail.com
@Jacob I'm just developing a working understanding of these issues. Would you please help me to get a better handle on your solution?

My understanding thus far: Reading a (local) .csv file into a DataFrame using DataFrames.readtable involves reading the file into an IOStream and then parsing that stream into a form amenable to parsing by DataFrames.builddf, which builds the DataFrame object returned by readtable. The work required to get the contents of the .csv file into memory in a form that can be manipulated by Julia functions is work-intensive in this manner. However, with SQLite, the entire file can just be thrown into memory wholesale, along with some metadata (maybe not the right term?) that delineates the tabular properties of the data.

What I am curious about, then (if this understanding is not too misguided), is how SQLite returns, say, a column of data that doesn't include, say, a bunch of delimiters. That is, what sort of parsing *does* SQLite do, and when?

Jacob Quinn

unread,
Jun 6, 2015, 1:17:34 AM6/6/15
to julia...@googlegroups.com
@David,

Sorry for the slow response. It's been a busy week :)

Here's a quick rundown of the approach:

- In the still-yet-to-be-officially-published https://github.com/quinnj/CSV.jl package, the bulk of the code goes into creating a `CSV.File` type where the structure/metadata of the file is parsed/detected/saved in a type (e.g. header, delimiter, newline, # of columns, detected column types, etc.)
- `SQLite.create` and now `CSV.read` both take a `CSV.File` as input and follow a similar process in parsing:
  - The actual file contents are mmapped; i.e. the entire file is loaded into memory at once
  - There are currently three `readfield` methods (Int,Float64,String) that take an open `CSV.Stream` type (which holds the mmapped data and the current "position" of parsing), and read a single field according to what the type of that column is supposed to be
      - for example, readfield(io::CSV.Stream, ::Type{Float64}, row, col), will start reading at the current position of the `CSV.Stream` until it hits the next delimiter, newline, or end of the file and then interpret the contents as a Float64, returning `val, isnull`

That's pretty much it. One of the most critical performance keys for both SQLite and CSV.read is non-copying strings once the file has been mmapped. For SQLite, the sqlite3_bind_text library method actually has a flag to indicate whether the text should be copied or not, so we're able to pass the pointer to the position in the mmapped array directly. For the CSV.read method, which returns a Vector of the columns (as typed arrays), I've actually rolled a quick and dirty CString type that looks like

immutable CString
  ptr::Ptr{UInt8}
  len::Int
end

With a few extra method definitions, this type looks very close to a real string type, but we can construct it by pointing directly to the mmapped region (which currently isn't possible for native Julia string types). See https://github.com/quinnj/Strings.jl for more brainstorming around this alternative string implementation. You can convert a CString to a Julia string by calling string(x::CString) or map(string,column) for an Array of CSV.CStrings.

As an update on the performance on the Facebook Kaggle competition bids.csv file:

-readcsv: 45 seconds, 33% gc time
-CSV.read: 19 seconds, 3% gc time
-SQLite.create: 25 seconds, 3.25% gc time

Anyway, hopefully I'll get around to cleaning up CSV.jl to be released officially, but it's that last 10-20% that's always the hardest to finish up :)

-Jacob


David Gold

unread,
Jun 6, 2015, 8:41:36 AM6/6/15
to julia...@googlegroups.com, quinn....@gmail.com
@Jacob,

Thank you very much for your explanation! I expect having such a blueprint will make delving into the actual code more tractable for me. I'll be curious to see how your solution here and your proposal for string handling end up playing with the current Julia data ecosystem.

bernhard

unread,
Oct 7, 2015, 2:33:41 AM10/7/15
to julia-users, quinn....@gmail.com
Is there any update on this? Or maybe a timeline/roadmap?
I would love to see a faster CSV reader. 

I tried to take a look at Jacob's CSV.jl.
But I seem to be missing https://github.com/lindahua/DataStreams.jl 
I have no idea where to find DataStreams package....
Does it still exist?

Is there any (experimental) way to make CSV.jl work?

Jacob Quinn

unread,
Oct 7, 2015, 9:07:44 AM10/7/15
to julia...@googlegroups.com
Haha, nice timing. I just pushed a big CSV.jl overhaul for 0.4 yesterday afternoon. I just pushed the DataStreams.jl package, so you can find that at https://github.com/quinnj/DataStreams.jl, and you'll have to Pkg.clone it. Everything should work at that point.

I'm still cleaning up some other related packages, so that's why things aren't documented/registered/tagged quite yet as the interface may evolve slightly, probably more the low-level machinery. So `stream!(::CSV.Source, ::DataStream)` should stay the same.

I've already got a bit writeup started once everything's done, so if you'd rather wait another couple days or a week, I should have something ready by then.

-Jacob

Jonathan Malmaud

unread,
Oct 7, 2015, 9:08:10 AM10/7/15
to julia-users, quinn....@gmail.com
Independent of Jacob's (excellent) work, I've begun wrapping the SFrames library (https://github.com/dato-code/SFrame), which is used internally by Graphlab Create (https://dato.com/products/create/). One of its features is a fast and robust CSV reader. I think I'll have something to preview in the next week or so.

David Gold

unread,
Oct 7, 2015, 10:36:52 AM10/7/15
to julia-users, quinn....@gmail.com
Yaas. Very excited to see this.

bernhard

unread,
Oct 8, 2015, 1:54:21 AM10/8/15
to julia-users, quinn....@gmail.com
Thank you Quinn

Things do not work (for me) though.

is it possible you are missing a comma after "col" in lines 24 and 33 of Sink.jl
function writefield(io::Sink, val::AbstractString, col N)

Jacob Quinn

unread,
Oct 8, 2015, 8:05:24 AM10/8/15
to julia...@googlegroups.com
Pushed some fixes. Thanks for trying it out.

-Jacob

bernhard

unread,
Oct 8, 2015, 8:54:57 AM10/8/15
to julia-users, quinn....@gmail.com
Thank you it is working now.
And it is blazing fast (factor 8 on a 1GB file compared to readcsv or readtable, memory allocation is at 17mb). I love it.

Now I only need to modify the data. Is there any fast way to get an Array or a DataFrame of the imported table? (or would this defeat the purpose?)
Eventually I don't need a df but a custom data structure I created. I will do the conversion when I find time to do so.

Coincidentally I created an inexcat error in mmap 126 with a 3GB file. I don't have time to find the line which caused it right now though.

Bernhard

Jacob Quinn

unread,
Oct 8, 2015, 9:15:54 AM10/8/15
to julia...@googlegroups.com
So you could convert to a DataFrame without copying by doing:

DataFrame(convert(Vector{Any},ds.data),Symbol[symbol(x) for x in ds.schema.header])

where `ds` is your `DataStream` that you read your CSV file into. The result ends up looking a little awkward, mainly because I believe DataFrames is calling `show` on individual elements instead of tapping into the NullableArrays `show` code, but not a huge deal.

The other way is you can just work directly with `ds.data` which is a `Vector{NullableVector{T}}` by default. I'll eventually provide better functionality around specifying your own `DataStream` type to be whatever you want (even a DataFrame if you want), but for now, we have fairly sensible defaults.

-Jacob

p.s. on the mmap issue, are you running on a 32-bit machine? I believe it's known that 32-bit can have troubles with large mmaps, but I'm not sure what the collective wisdom is on dealing with that. If you're 64-bit, if you could just share you system details + sample file, that'd be really helpful in tracking down what's going on.
Reply all
Reply to author
Forward
0 new messages