Reading space-separated data

872 views
Skip to first unread message

Simon Kornblith

unread,
Sep 11, 2013, 6:21:17 PM9/11/13
to julia...@googlegroups.com
I'm trying to read a bunch of files that look like https://gist.github.com/simonster/6530489/raw/0ca55eb5a049d2538de6350eb3d7ec66f765193d/gistfile1.txt. This format seems like it should be as simple as it gets, but I can't get readdlm to do what I want. The number of spaces between values differs from line to line, which means that if I set the delimiter to ' ', readdlm throws a BoundsError() (with a truncated backtrace due to #3469, groan) presumably because it interprets every space as a column delimiter. This seems like it's probably a common text format, so before I write my own function, I figured I'd ask if someone has already written code to read it, or if there's a way to get readdlm to do what I want that I'm overlooking.

Simon

John Myles White

unread,
Sep 11, 2013, 6:25:52 PM9/11/13
to julia...@googlegroups.com
I know that DataFrames readtable() hasn't been set up to do this yet. It shouldn't be too hard: it's just ugly. Adding a switch that keeps looping over multiple whitespace characters should be easy for both readtable() and readdlm().

 -- John

Adrian Cuthbertson

unread,
Sep 11, 2013, 10:20:40 PM9/11/13
to julia...@googlegroups.com
Here's a low-level solution:

f = open("fpath_to_data")
line = readline(f)
while length(line) > 0
   rec=map(float64,split(line))
   # ... use rec
   line = readline(f)
end
close(f)

-- Regards, Adrian.

RecentConvert

unread,
Dec 11, 2013, 4:16:29 AM12/11/13
to julia...@googlegroups.com
Whenever I try this on single spaced delimited data it gives me a BoundsError().

04/28/2013 13:47:04 49624.230 3450001624.245 12:47:04 SPEC:N2O,N2O,CO,CO,H2O
3450001623.890860 3.24297e2 3.25e2 1.79025e2 2.21956e3 6.90787e6
3450001623.990910 3.24163e2 3.25e2 1.79056e2 2.22118e3 6.92437e6
3450001624.090950 3.24244e2 3.25e2 1.78798e2 2.24119e3 6.94525e6
3450001624.191000 3.24314e2 3.25e2 1.78959e2 2.23028e3 6.978e6
3450001624.291050 3.24259e2 3.25e2 1.78645e2 2.22066e3 7.05206e6
3450001624.391100 3.243e2 3.25e2 1.78802e2 2.22303e3 7.01239e6
3450001624.491150 3.24294e2 3.25e2 1.78935e2 2.2222e3 7.02175e6
3450001624.591190 3.24172e2 3.25e2 1.7854e2 2.22271e3 7.0164e6
3450001624.691240 3.24257e2 3.25e2 1.79067e2 2.22521e3 6.99538e6
3450001624.791290 3.24277e2 3.25e2 1.78358e2 2.2416e3 7.06199e6
3450001624.891340 3.24382e2 3.25e2 1.7869e2 2.22428e3 7.05238e6
3450001624.991390 3.24035e2 3.25e2 1.78968e2 2.23876e3 6.97643e6
3450001625.091430 3.24257e2 3.25e2 1.79076e2 2.22578e3 7.03076e6
3450001625.191480 3.2415e2 3.25e2 1.78592e2 2.23124e3 6.95091e6
3450001625.291530 3.24096e2 3.25e2 1.78835e2 2.21803e3 6.95247e6
3450001625.391580 3.24197e2 3.25e2 1.78331e2 2.23744e3 6.93555e6
3450001625.491630 3.24094e2 3.25e2 1.78812e2 2.22798e3 6.93854e6
3450001625.591670 3.24315e2 3.25e2 1.78494e2 2.22471e3 6.99256e6
3450001625.691720 3.24108e2 3.25e2 1.78534e2 2.23321e3 6.93832e6
3450001625.791770 3.24202e2 3.25e2 1.78795e2 2.21372e3 6.93131e6

fid = open("file.dat","r")
(D,H) = readdlm(fid,' ',has_header=true)

ERROR: BoundsError()
  in getindex at ascii.jl:11
  in dlm_fill at datafmt.jl:116


Am I doing something wrong?

David van Leeuwen

unread,
Dec 11, 2013, 9:11:37 AM12/11/13
to julia...@googlegroups.com
Hi, 


On Thursday, September 12, 2013 12:25:52 AM UTC+2, John Myles White wrote:
I know that DataFrames readtable() hasn't been set up to do this yet. It shouldn't be too hard: it's just ugly. Adding a switch that keeps looping over multiple whitespace characters should be easy for both readtable() and readdlm().

I noticed that readtable() treats whitespace like this, and have to use a perl one-liner to deal with this type of input.  It would be great if readtable() would interpret separator ' '  as "\w+" in perl re notation, so that it is more similar to R's read.table().

Cheers, 

---david

John Myles White

unread,
Dec 11, 2013, 10:54:18 AM12/11/13
to julia...@googlegroups.com
We’d be happy to have a patch for this. I don’t see any way for us to efficiently support actual regexes, so I’d prefer that we just provide a mechanism for allowing multiple whitespaces to be treated as one delimiter.

With all that said, I think you would do a great service to a humanity by scrubbing any file with that formatting and using something like tabs instead.

— John

David van Leeuwen

unread,
Dec 12, 2013, 2:58:13 AM12/12/13
to julia...@googlegroups.com
Hello John,


On Wednesday, December 11, 2013 4:54:18 PM UTC+1, John Myles White wrote:
We’d be happy to have a patch for this. I don’t see any way for us to efficiently support actual regexes, so I’d prefer that we just provide a mechanism for allowing multiple whitespaces to be treated as one delimiter.

I just submitted a pull request that does this.  I added some lines to the test/io.jl, but I don't know how to run the test.  
 
With all that said, I think you would do a great service to a humanity by scrubbing any file with that formatting and using something like tabs instead.

I don't agree here.  One would provide a great service to computer file exchange in general if everything would consistently be converted to CSV, or while we're at it , XML or something fancier.  

But in reality there are a lot of cases where we have somewhat pretty-printed text in table form, which for the larger part of humanity is easier to interpret than, e.g., csv or xml.  It simply is easier if such cases can be read straight into readtable() without either filtering or, as you suggest, altering the source table.  The source table might be read-only, for whatever reason.  

There is also the case for compatibility with R, and standard interpretation of unix tools like awk, perl split " ", etc. 
 
Cheers, 

---david
Reply all
Reply to author
Forward
0 new messages