@David,
Sorry for the slow response. It's been a busy week :)
Here's a quick rundown of the approach:
- In the still-yet-to-be-officially-published
https://github.com/quinnj/CSV.jl package, the bulk of the code goes into creating a `CSV.File` type where the structure/metadata of the file is parsed/detected/saved in a type (e.g. header, delimiter, newline, # of columns, detected column types, etc.)
- `SQLite.create` and now `CSV.read` both take a `CSV.File` as input and follow a similar process in parsing:
- The actual file contents are mmapped; i.e. the entire file is loaded into memory at once
- There are currently three `readfield` methods (Int,Float64,String) that take an open `CSV.Stream` type (which holds the mmapped data and the current "position" of parsing), and read a single field according to what the type of that column is supposed to be
- for example, readfield(io::CSV.Stream, ::Type{Float64}, row, col), will start reading at the current position of the `CSV.Stream` until it hits the next delimiter, newline, or end of the file and then interpret the contents as a Float64, returning `val, isnull`
That's pretty much it. One of the most critical performance keys for both SQLite and CSV.read is non-copying strings once the file has been mmapped. For SQLite, the sqlite3_bind_text library method actually has a flag to indicate whether the text should be copied or not, so we're able to pass the pointer to the position in the mmapped array directly. For the CSV.read method, which returns a Vector of the columns (as typed arrays), I've actually rolled a quick and dirty CString type that looks like
immutable CString
ptr::Ptr{UInt8}
len::Int
end
With a few extra method definitions, this type looks very close to a real string type, but we can construct it by pointing directly to the mmapped region (which currently isn't possible for native Julia string types). See
https://github.com/quinnj/Strings.jl for more brainstorming around this alternative string implementation. You can convert a CString to a Julia string by calling string(x::CString) or map(string,column) for an Array of CSV.CStrings.
As an update on the performance on the Facebook Kaggle competition bids.csv file:
-readcsv: 45 seconds, 33% gc time
-CSV.read: 19 seconds, 3% gc time
-SQLite.create: 25 seconds, 3.25% gc time
Anyway, hopefully I'll get around to cleaning up CSV.jl to be released officially, but it's that last 10-20% that's always the hardest to finish up :)
-Jacob