Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Using ReadList to read a string

153 views
Skip to first unread message

Donald DuBois

unread,
Nov 30, 2007, 6:00:22 AM11/30/07
to
Hello,

I am trying to get ReadList to read a string in a text file (filename.txt).

I would like NOT to have use Import because it is MUCH slower in reading
a text file than ReadList is. For example:

(1) a file with 50,000 records can be created
(2) Exported to disk
(3) read by ReadList[...] and
(4) read by Import[...]


dataFile1 =
Table[{2001, "nameA", "symbolA",
15.5}, {50000}]; Export["out1.txt", dataFile1, "Table"];

AbsoluteTiming[
out1ReadList = ReadList["out1.txt", {Number, Word, Word, Number}];]

AbsoluteTiming[out1Import = Import["out1.txt", "Table"];]

{0.1718750, Null}

{2.4375000, Null}

Import takes 14 times longer to read in the same file as compared to ReadList.
So, naturally, I would like to use ReadList whenever I have a .txt file to be read in from disk.

However, the file to be read is slightly more complicated than the one above (out1.txt).
There is a string that is added to the file as the second element of a record.
The first few records of the file (EWZ2.TXT below) look like the following with each record
consisting of eight elements: a number, string, word followed by five integer numbers for each record.
Each record is on a separate line.

EWZ2.TXT:

20000714 "iShares MSCI Brazil Index" EWZ 250 1627 1637 1627 1637
20000717 "iShares MSCI Brazil Index" EWZ 100 1730 1735 1730 1735
20000718 "iShares MSCI Brazil Index" EWZ 100 1730 1730 1730 1730
20000719 "iShares MSCI Brazil Index" EWZ 100 1686 1686 1686 1686
20000720 "iShares MSCI Brazil Index" EWZ 50 1724 1724 1724 1724

The format of the above file is: {Number, String, Word, Number, Number, Number, Number, Number}

If this file on disk is named "EWZ2.TXT" I am not able to use ReadList to read it.
I use two format specifications within ReadList
and neither of them works:

{Number, String, Word, Number, Number, Number, Number, Number}
and {Number, Word, Word, Number, Number, Number, Number, Number}.

ReadList["EWZ2.TXT", {Number, String, Word, Number, Number, Number,
Number, Number}]
ReadList["EWZ2.TXT", {Number, Word, Word, Number, Number, Number,
Number, Number}]

{{20000714,
" \"iShares MSCI Brazil Index\" EWZ 250 1627 \
1637 1627 1637", "20000717", $Failed, EndOfFile,
EndOfFile, EndOfFile, EndOfFile}}

{{20000714, "\"iShares", "MSCI", $Failed, EndOfFile, EndOfFile,
EndOfFile, EndOfFile}}


Using "String" for the format of the second element seems to have more success than "Word" but, when
read, none of the elements is separated by a comma as happened when using ReadList to read
out1.txt above.

"iShares MSCI Brazil Index" should be the second element of a sublist within
the entire list (Table) and EWZ (with or without quotes) should be the third element
within a sublist.

The defintion of a String in the function description for ReadList is
"string terminated by a newline" which does not describe the above file.
(EWZ2.TXT). If the string is moved in the file so that it is the last item
in any record, such as

20000714 EWZ 250 1627 1637 1627 1637 "iShares MSCI Brazil Index"

then a format of {Number, Word, Number, Number, Number, Number, Number, String}
in ReadList DOES work to read the file correclty.

But, is there anyway to get ReadList to read the above file (EWZ2.TXT) with the string
as the second item of a record so that the speed advantage of ReadList over Import
can be retained?

Or is there some other function I should be using other than Import and/or ReadList?

Thank you in advance for any help you can give me.
Don

Thomas Dowling

unread,
Dec 1, 2007, 5:40:19 AM12/1/07
to
Hello,

I am interested in your problem, but unfortunately I do not have a solution,
only a similar experience. I would be very interested to know, however, what
I am doing wrong. The following are my own observations:

1. If the data is tab-delimited (that it a tab between each element and a
paragraph mark
indicating end of record), there is not a problem.

For example,

list22 = ReadList[
"/EWZ22.txt", {Number, Word, Word, Number, Number, Number, Number,
Number}]

gives the following output:

{{20000714, "\"iSharesMSCIBrazilIndex\"", "EWZ", 250, 1627, 1637,
1627, 1637}, {20000717, "\"iSharesMSCIBrazilIndex\"", "EWZ", 100,
1730, 1735, 1730, 1735}, {20000718, "\"iSharesMSCIBrazilIndex\"",
"EWZ", 100, 1730, 1730, 1730, 1730}, {20000719,
"\"iSharesMSCIBrazilIndex\"", "EWZ", 100, 1686, 1686, 1686,
1686}, {20000720, "\"iSharesMSCIBrazilIndex\"", "EWZ", 50, 1724,
1724, 1724, 1724}}

where 'EWZ22.txt, is tab-delimited (and saved as text file)

and

Map[Head, list22, {2}]

gives the following:


{{Integer, String, String, Integer, Integer, Integer, Integer,
Integer}, {Integer, String, String, Integer, Integer, Integer,
Integer, Integer}, {Integer, String, String, Integer, Integer,
Integer, Integer, Integer}, {Integer, String, String, Integer,
Integer, Integer, Integer, Integer}, {Integer, String, String,
Integer, Integer, Integer, Integer, Integer}}


All is well.


However, with comma-delimited text (EWZ2.txt) and the following command,

ReadList["/EWZ2.txt", {Number, Word, Word, Number, Number, Number,
Number, Number}]

I get the following output:

Read::readn: Invalid real number found when reading from /EWZ2.txt. >>


{{20000714, ",\"iSharesMSCIBrazilIndex\",EWZ,250,1627,1637,1627,1637",
"20000717,\"iSharesMSCIBrazilIndex\",EWZ,100,1730,1735,1730,1735",
20000718, $Failed, EndOfFile, EndOfFile, EndOfFile}}


You can, of course, read the file as a string,

list3 = ReadList["/EWZ2.txt", String ]

but this is not what is desired:

Map[Head, list3]

{String, String, String, String, String}

The reason I am interested is that the same problem seem to occur with, say,
{x, time} data
from a recording device where x and time are Numbers.

Reading a tab-delimited text file (datatab.txt) with the following command

ReadList["/datatab.txt", {Number, Number}]

gives the following output

{{1.24, 0.00161925}, {1.25, 0.00162431}, {1.26, 0.00161994}, {1.27,
0.00161719}, {1.28, 0.00161219}, {1.29, 0.00160894}, {1.3,
0.00161663}, {1.31, 0.00161956}, {1.32, 0.00162194}, {1.33,
0.00161781}, {1.34, 0.001615}, {1.35, 0.00160962}, {1.36,
0.00161806}, {1.37, 0.00162575}, {1.38, 0.00162256}, {1.39,
0.00161581}, {1.4, 0.00161575}, {1.41, 0.00160694}, {1.42,
0.00161869}, {1.43, 0.00161644}, {1.44, 0.00162231}, {1.45,
0.00161681}, {1.46, 0.00161812}, {1.47, 0.00160969}, {1.48,
0.00161875}, {1.49, 0.00162512}, {1.5, 0.00162319}, {1.51,
0.0016135}, {1.52, 0.00161856}, {1.53, 0.00161231}, {1.54,
0.00161887}}

BUT ...

reading the same data which has been converted to comma-delimited format

(and saved as text as datacom.txt) gives the following

ReadList["/datacom.txt", {Number, Number}]

Read::readn: Invalid real number found when reading from \
/datacom.txt. >>

{{1.24, $Failed}}


I would be very interested in any suggestions. Experimenting with the
Options for ReadList, such as RecordSeparators and WordSeparators, does not
seem to work, at least for me. Although files may be saved as
tab-delimited, it is very easy to forget, and with large files conversion
takes quite a bit of time.

Sorry to be so long-winded!

Thanks for your help

Thomas Dowling.

Thomas Dowling

unread,
Dec 1, 2007, 5:58:39 AM12/1/07
to
Hello,

I need to update my previous post.

1. It is, in fact, quite easy to import a comma-delimited file in the form
{Number, Number}. The following
command, for example, solves the problem outlined in the previous post (how
to input a file containing
{x, time} data where the data are comma-delimited):

Partition[ReadList["/datacom.txt", Number, RecordSeparators -> {","}],
2]


and gives the following output

{{1.24, 0.00161925}, {1.25, 0.00162431}, {1.26, 0.00161994}, {1.27,
0.00161719}, {1.28, 0.00161219}, {1.29, 0.00160894}, {1.3,
0.00161663}, {1.31, 0.00161956}, {1.32, 0.00162194}, {1.33,
0.00161781}, {1.34, 0.001615}, {1.35, 0.00160962}, {1.36,
0.00161806}, {1.37, 0.00162575}, {1.38, 0.00162256}, {1.39,
0.00161581}, {1.4, 0.00161575}, {1.41, 0.00160694}, {1.42,
0.00161869}, {1.43, 0.00161644}, {1.44, 0.00162231}, {1.45,
0.00161681}, {1.46, 0.00161812}, {1.47, 0.00160969}, {1.48,
0.00161875}, {1.49, 0.00162512}, {1.5, 0.00162319}, {1.51,
0.0016135}, {1.52, 0.00161856}, {1.53, 0.00161231}, {1.54,
0.00161887}}


2. The above will not solve Don's problem, but if the hitch is due to
comma-delimited data the
following might work:


list4 = Partition[
ToExpression[
StringReplace[
ReadList["/EWZ2.txt", Record, RecordSeparators -> {","}],
Whitespace -> ""]], 7]

gives the following output:


{{20000714, "iSharesMSCIBrazilIndex", EWZ, 250, 1627, 1637,

1627}, {163720000717, "iSharesMSCIBrazilIndex", EWZ, 100, 1730,
1735, 1730}, {173520000718, "iSharesMSCIBrazilIndex", EWZ, 100,
1730, 1730, 1730}, {173020000719, "iSharesMSCIBrazilIndex", EWZ,
100, 1686, 1686, 1686}, {168620000720, "iSharesMSCIBrazilIndex",
EWZ, 50, 1724, 1724, 1724}}

AND

Map[Head, list4, {2}]

gives the following output :

{{Integer, String, Symbol, Integer, Integer, Integer,
Integer}, {Integer, String, Symbol, Integer, Integer, Integer,
Integer}, {Integer, String, Symbol, Integer, Integer, Integer,
Integer}, {Integer, String, Symbol, Integer, Integer, Integer,
Integer}, {Integer, String, Symbol, Integer, Integer, Integer,
Integer}}

I am assuming that EWZ2.txt is comma-delimited.

3. There is an excellent tutorial 'How do I read comma-delimited numbers
into Mathematica?'

available at:


http://support.wolfram.com/mathematica/kernel/files/csv3.html

I suppose there is a lesson for me in there somewhere!

I'd be interested to know if the above does the job, and in any other
suggestions.

Tom Dowling

Bill Rowe

unread,
Dec 1, 2007, 5:59:40 AM12/1/07
to
On 11/30/07 at 5:23 AM, don...@comcast.net (Donald DuBois) wrote:

>I am trying to get ReadList to read a string in a text file (filename.txt).

>I would like NOT to have use Import because it is MUCH slower in
>reading a text file than ReadList is. For example:

There is a good reason for Import being slower than ReadList.
Import is designed to work with complex data structures and
recognize strings from numbers automatically. The extra
computation needed to do this is why Import is slower.

<snip>

>EWZ2.TXT:

20000714 "iShares MSCI Brazil Index" EWZ 250
1627 1637 1627 1637
20000717 "iShares MSCI Brazil Index" EWZ 100
1730 1735 1730 1735
20000718 "iShares MSCI Brazil Index" EWZ 100
1730 1730 1730 1730
20000719 "iShares MSCI Brazil Index" EWZ 100
1686 1686 1686 1686
20000720 "iShares MSCI Brazil Index" EWZ 50
1724 1724 1724 1724

>The format of the above file is: {Number, String, Word, Number,
>Number, Number, Number, Number}

There are a several ways to approach this problem. One set of
approaches is to read the data as strings or records then use
Mathematica to convert those to the desired data types: For example,

In[19]:= data =
StringSplit[#, "\""] & /@ ReadList["test.txt", String];
Flatten[{ToExpression[First@#], #[[2]],
StringSplit[#[[3]], Whitespace][[1]],
ToExpression /@ Rest[StringSplit[#[[3]], Whitespace]]}] &
/@ data

Out[20]= (\[NoBreak]


20000714 iShares MSCI Brazil Index EWZ 250 1627 1637
1627 1637
20000717 iShares MSCI Brazil Index EWZ 100 1730 1735
1730 1735
20000718 iShares MSCI Brazil Index EWZ 100 1730 1730
1730 1730
20000719 iShares MSCI Brazil Index EWZ 100 1686 1686
1686 1686
20000720 iShares MSCI Brazil Index EWZ 50 1724 1724
1724 1724

\[NoBreak])

does the trick.

Alternatively,

data=ReadList["test.txt", {Number, Word, Word, Word, Word, Word, Number,
Number, Number, Number, Number}];
=46latten{First@#,StringJoin@@Take[#,{2,5}],Drop[#,6]}&/@data

will also work.

You might also be able to get ReadList to do everything by with
the appropriate TokenWords list and RecordSeparators.

But notice what is happening here. The time saved by being able
to read the file quickly is being consumed by post processing
the data to get it in the form you want. Additionally, there is
your time getting things to work and verifying they do work.

>dataFile1 = Table[{2001, "nameA", "symbolA", 15.5}, {50000}];
>Export["out1.txt", dataFile1, "Table"];
>
>AbsoluteTiming[
>out1ReadList = ReadList["out1.txt", {Number, Word, Word, Number}];]
>
>AbsoluteTiming[out1Import = Import["out1.txt", "Table"];]
>
>{0.1718750, Null}
>
>{2.4375000, Null}

Yes your example shows a 14x improvement in speed for ReadList
over Import. But note the absolute difference is only a bit more
than 2 seconds. Unless you are going to read numerous files with
the same format, it clearly costs you far more time to get
ReadList to do what you want than is saved. And for file sizes
on the order of 50,000 records, the post processing I am doing
to make things work combined with the time ReadList takes to
read the file, likely is more than the time Import would have
taken in the first place.

BTW, if you really are working with many large files where the
data originates in Mathematica, consider using Put to write the
data out as a Mathematica expression and reading it back with
Get. These will usually be faster than ReadList and take much
less thought to use. The disadvantage of this approach is the
file created by Put will require a lot of work to use outside of Mathematic=
a.
--
To reply via email subtract one hundred and four

Steve Luttrell

unread,
Dec 1, 2007, 6:03:44 AM12/1/07
to
When I have difficulty with getting ReadList to do what I want I then revert
to reading in the file thus

data = ReadList[<file>, String];

so that each record is read as a single string.

Then I use string processing to extract what I need. Here you will find that
StringCases does everything you want, using the details you will find
documented in the "More Information" section on StringExpression to
construct the string pattern that does the required job.

So you would use

StringCases[data, string pattern built using StringExpression]

or one of its variants.

Steve Luttrell
West malvern, UK

"Donald DuBois" <don...@comcast.net> wrote in message
news:fioqg6$blb$1...@smc.vnet.net...

David Annetts

unread,
Dec 2, 2007, 4:01:09 AM12/2/07
to
Hi Donald,

> Or is there some other function I should be using other than
> Import and/or ReadList?
>
> Thank you in advance for any help you can give me.

My suggestion is to read the entire file as a String, then post-process.

inpu = OpenRead["d:/Tmpfiles/DuBois.txt"]
idata = ReadList[inpu, String];
Close[inpu];

idata = StringReplace[idata, {"\"" -> ""}];

pdata = Read[
StringToStream[#], {Number, {Word, Word, Word, Word}, Word,
Number, Number, Number, Number, Number}] & /@ idata

fdata = Flatten@{#[[1]], addSpace[#[[2]]], Rest@Rest[#]} & /@ pdata

addSpace[str_List] := Module[
{tmp},
tmp = StringJoin[#, " "] & /@ str;
StringJoin[tmp]
]

Regards,

Dave.


Rolf....@gmail.com

unread,
Dec 2, 2007, 4:09:21 AM12/2/07
to
Hello,

Supposing your file is tab-delimited,the following kind of works:

ReadList["EWZ2.TXT",
{Number, Word, Word, Number, Number, Number, Number,

Number}, WordSeparators -> "\t"]

Unfortunately the keyword "Word" means that you might have
to apply ToExpression to the second and third column, in case
you want to have "EWZ" and not EWZ, etc.

I thought that maybe "Expression" would do the trick, but it does not.
Maybe it is time that the good WRI-programmers improve the very
important
function ReadList.
Import is not good (fast) enough.

Rolf

--
Rolf Mertig
GluonVision GmbH, Berlin , Germany
http://www.gluonvision.com

Igor C. Antonio

unread,
Dec 4, 2007, 4:26:40 AM12/4/07
to

Import is naturally slower than ReadList as it processes the data for you. It
parses dates, strings, numbers, currency, etc to their mathematica equivalent.
With that said, we have made speed improvements in Import as Table and they will
be in the next minor update to Mathematica.
-------------------
In[10]:= dataFile1 =


Table[{2001, "nameA", "symbolA",
15.5}, {50000}]; Export["out1.txt", dataFile1, "Table"];

In[11]:= a1 =


AbsoluteTiming[
out1ReadList = ReadList["out1.txt", {Number, Word, Word, Number}];]

Out[11]= {0.2031263, Null}

In[12]:= a2 =


AbsoluteTiming[out1Import = Import["out1.txt", "Table"];]

Out[12]= {0.4843781, Null}

In[13]:= a2[[1]]/a1[[1]]

Out[13]= 2.384615
-------------
*Disclaimer: speed of Import as Table largely depends on the amount of
processing Table has to do on the data. The more numbers, dates, currencies
they file has, the slower it will be.

You may want to try Import[<file>, "Table", "Numeric"->False], which disables
the parsing of the data while steal splitting the data correctly and handling
quotes:

In[70]:= data =Import["donald_short.txt", "Table","Numeric"->False]//InputForm
Out[70]//InputForm=
{{"20000714", "iShares MSCI Brazil Index", "EWZ", "250", "1627", "1637", "1627",
"1637"},
{"20000717", "iShares MSCI Brazil Index", "EWZ", "100", "1730", "1735",
"1730", "1735"},
{"20000718", "iShares MSCI Brazil Index", "EWZ", "100", "1730", "1730",
"1730", "1730"},
{"20000719", "iShares MSCI Brazil Index", "EWZ", "100", "1686", "1686",
"1686", "1686"},
{"20000720", "iShares MSCI Brazil Index", "EWZ", "50", "1724", "1724", "1724",
"1724"}}

You could then post-process the data on your own.

> But, is there anyway to get ReadList to read the above file (EWZ2.TXT) with the string
> as the second item of a record so that the speed advantage of ReadList over Import
> can be retained?

Most likely you won't be able to import that data correctly in one pass with a
single ReadList call. ReadList is a lower-level function than Import and
doesn't have the ability to parse the quotes out of strings (not Mathematica
String). Import as TSV, CSV, and Table process the data as it imports in order
to handle that case (and many others).

> Don

--
Igor C. Antonio
Software Engineer
Wolfram Research, Inc.
http://www.wolfram.com

To email me personally, remove the dash.

nigol

unread,
Dec 9, 2007, 6:38:15 AM12/9/07
to
Don,
here is a way to import that data correctly in one pass with a single
ReadList call:

ReadList["EWZ2.TXT", {Number, Word, Word, Number, Number, Number,

Number, Number}, WordSeparators -> {" \"", "\" ", " "}]

This solution assumes that the EWZ2.TXT file is exactly as stated in
your post (the data is space delimited and not tab delimited).
Please let me know if this solution works on your system.
Thanks,
Dario

0 new messages