Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

working with large text files

979 views
Skip to first unread message

Kirby Turner

unread,
Apr 10, 2007, 11:45:43 AM4/10/07
to
I'm writing a program that must read and write large text files.
Originally I was using the native Delphi approach, TextFile with the
routines AssignFile, ReadLn, and so on.

Due to a bug in my parser code I decided to swtich to TFileStream for
data read and writes. An added benefit to this change is that the
code now works with other stream types. However, I find the code is
not able to read through larger text files as quickly using
TFileStream as it could using the native Delphi file handler.

To improve performance I am using a buffered stream reader, which
helps a lot but it's not still as fast as the original code.
Additional I found this note in the Delphi 2007 help file for the
SysUtils.FileOpen routine:

"Use of the non-native Delphi language file handlers such as FileOpen
is not encouraged. These routines map to system routines and return OS
file handles, not normal Delphi file variables. These are low-level
file access routines. For normal file operations use AssignFile,
Rewrite, and Reset instead."

TFileStream uses the non-native file handlers. This makes me wonder
why non-native file handlers are not encouraged.

So my question is, which approach is the best for working with large
text files, native Delphi file handlers or TFileStream?

My thought at the moment is native Delphi file handlers wil perform
better but the slower performance I am experiencing could be the
buffered stream reader I am using.

Thanks,
-KIRBY

Peter Below (TeamB)

unread,
Apr 10, 2007, 3:24:06 PM4/10/07
to
Kirby Turner wrote:

> I'm writing a program that must read and write large text files.
> Originally I was using the native Delphi approach, TextFile with the
> routines AssignFile, ReadLn, and so on.
>
> Due to a bug in my parser code I decided to swtich to TFileStream for
> data read and writes. An added benefit to this change is that the
> code now works with other stream types. However, I find the code is
> not able to read through larger text files as quickly using
> TFileStream as it could using the native Delphi file handler.
>
> To improve performance I am using a buffered stream reader, which
> helps a lot but it's not still as fast as the original code.
> Additional I found this note in the Delphi 2007 help file for the
> SysUtils.FileOpen routine:
>
> "Use of the non-native Delphi language file handlers such as FileOpen
> is not encouraged. These routines map to system routines and return OS
> file handles, not normal Delphi file variables. These are low-level
> file access routines. For normal file operations use AssignFile,
> Rewrite, and Reset instead."
>
> TFileStream uses the non-native file handlers. This makes me wonder
> why non-native file handlers are not encouraged.

You can as well use the corresponding API functions directly. *Every*
file access boils down to them eventually, even the good old Textfile
I/O routines. The main advantage of using the higher level wrappers is
that they are easier to use and will raise exceptions when something
goes wrong instead of requiring you to check cryptic return codes for
every operation.

> So my question is, which approach is the best for working with large
> text files, native Delphi file handlers or TFileStream?

If would stay with the Textfile type and its support routines.
Especially if you assign a larger buffer (see SetTextBuf procedure)
they will beat the pants off naked TFilestream access. And you get
parsing into lines for free. The key is efficient buffering, so your
code has to go to the file system only to read large chunks of data,
not every single character.


--
Peter Below (TeamB)
Don't be a vampire (http://slash7.com/pages/vampires),
use the newsgroup archives :
http://www.tamaracka.com/search.htm
http://groups.google.com
http://www.prolix.be

danny heijl

unread,
Apr 10, 2007, 3:00:47 PM4/10/07
to
Kirby Turner schreef:


> To improve performance I am using a buffered stream reader, which
> helps a lot but it's not still as fast as the original code.
> Additional I found this note in the Delphi 2007 help file for the
> SysUtils.FileOpen routine:

A good implementation of a buffered stream reader should be roughly as
fast as the "old" SysUtils routines. A lot depends on an efficient
implementation of the character and line access methods.

If a buffered stream reader is still too slow you could switch to a
memory mapped file wrapper instead.

I use both (depending on size, for multi-gigabyte files I use buffered
file streams).

Danny
---

Kirby Turner

unread,
Apr 10, 2007, 5:05:10 PM4/10/07
to
On 10 Apr 2007 11:24:06 -0800, "Peter Below (TeamB)" <none> wrote:

>If would stay with the Textfile type and its support routines.
>Especially if you assign a larger buffer (see SetTextBuf procedure)
>they will beat the pants off naked TFilestream access. And you get
>parsing into lines for free.


Parsing into lines for free is why I originally used TextFile. But
the bug I encountered that led me to switch to TFileStream was an
embedded new line in values from a CSV text file. The embedded new
lines caused the records in the CSV file to split when calling ReadLn.
Switching to TFileStream was an easy fix but resulted in poorer
performance.

I think at this point I need to apply a bug fix for the embedded new
line using the TextFile approach and compare the performance to that
of the buffered stream reader approach. My gut tells me the TextFile
approach will be faster but code for the stream approach is easier to
write.

Thanks to all for the responses.

-KIRBY

John Herbster

unread,
Apr 11, 2007, 6:53:26 AM4/11/07
to

"Kirby Turner" <ki...@whitepeaksoftware.com> wrote
> ...

> Parsing into lines for free is why I originally used
> TextFile. But the bug I encountered that led me to
> switch to TFileStream was an embedded new line
> in values from a CSV text file. The embedded new
> lines caused the records in the CSV file to split
> when calling ReadLn. Switching to TFileStream was
> an easy fix but resulted in poorer performance.

Kirby, If performance is important and you can spend
the programming time, then I recommend using
the basic BlockRead() or direct OS API type of calls
and coding yourself the parsing of the lines and other
info with PChar pointers. I wrote a GEDCOM parser
years ago doing this. If after first coding this in Delphi,
you would like even more performance, then you can
use a profiler on the running code and convert the
bottlenecks over to asm code. Regards, JohnH

Chris Morgan

unread,
Apr 11, 2007, 9:50:33 AM4/11/07
to
> Parsing into lines for free is why I originally used TextFile. But
> the bug I encountered that led me to switch to TFileStream was an
> embedded new line in values from a CSV text file. The embedded new
> lines caused the records in the CSV file to split when calling ReadLn.
> Switching to TFileStream was an easy fix but resulted in poorer
> performance.

That's not a bug, it's a horribly formatted CSV file!
CSV format does not have much of a definition, but
quoted strings should certainly not span lines.

But if that is the data that you have to work with, then
you have to find some way around it.
You should still be able to use ReadLn though, just
don't assume that each string is a complete set of records.

Cheers,

Chris


Franz-Leo Chomse

unread,
Apr 11, 2007, 11:20:01 AM4/11/07
to

>That's not a bug, it's a horribly formatted CSV file!
>CSV format does not have much of a definition, but
>quoted strings should certainly not span lines.

Quoted strings can contain any other character except
an not escaped quote character, thus certainly end
of line sequences.

How else would you transfer the contents of memo fields.

Regards from Germany

Franz-Leo

Chris Morgan

unread,
Apr 11, 2007, 12:48:32 PM4/11/07
to

You shouldn't use CSV files for this IMHO.

But CSV is an ad-hoc format with no formal definition,
so it is very difficult to create a universal CSV-reader.
Microsoft Excel can't even read it's own CSV files if
they are created in a different locale (comma vs semi-colon
separator).

Cheers,

Chris


Igor Savkic

unread,
Apr 11, 2007, 5:02:32 PM4/11/07
to
> So my question is, which approach is the best for working with large
> text files, native Delphi file handlers or TFileStream?

I would recommend you to try memory mapped files, from my experience
they give the best performance to all other techniques.
There is a good implementation of MMF in TJclMappedTextReader from JEDI
library. TJclMappedTextReader also has ReadLn, so you wouldn't have to
change much in your application.


Iman L Crawford

unread,
Apr 11, 2007, 11:32:30 PM4/11/07
to
"Igor Savkic" <no_spam_i...@gmail.com> wrote in news:461d4cf3
@newsgroups.borland.com:

> There is a good implementation of MMF in TJclMappedTextReader from JEDI
> library. TJclMappedTextReader also has ReadLn, so you wouldn't have to
> change much in your application.

Deborah Pate has a TMemoryMappedStream. It's working pretty well for me.

Iman

Kirby Turner

unread,
Apr 14, 2007, 12:41:21 PM4/14/07
to
On Wed, 11 Apr 2007 05:53:26 -0500, "John Herbster"
<herb-sci1_AT_sbcglobal.net> wrote:

>Kirby, If performance is important and you can spend
>the programming time, then I recommend using
>the basic BlockRead() or direct OS API type of calls
>and coding yourself the parsing of the lines and other
>info with PChar pointers.


You are right. Performance is important for my application so I spent
the extra time programming and testing different scenarios.

I found performance using TFileStream with a buffered stream reader to
be on par TextFile and ReadLn. My initial performance problems with
the buffered stream reader were related to an event fired by my code
when a new line is read from the stream.

I'm now happy with the speed at which my application is able to load
large files. And I have the added benefit of supporting streams in
the code which gives me more options for reuse in the future.

-KIRBY

Kirby Turner

unread,
Apr 14, 2007, 12:46:19 PM4/14/07
to
On Wed, 11 Apr 2007 17:48:32 +0100, "Chris Morgan" <chris.nospam at
lynxinfo dot co dot uk> wrote:

>You shouldn't use CSV files for this IMHO.

I agree but sometimes you have no choice. For instance, a customer of
mine receives CSV data files from one of its partners. The partner
does not support XML or other formats so CSV files it is.

As much as I would love to see my customer move away from CSV files, I
don't see it happening anytime soon.

-KIRBY

Kirby Turner

unread,
Apr 14, 2007, 12:50:08 PM4/14/07
to
On Wed, 11 Apr 2007 23:02:32 +0200, "Igor Savkic"
<no_spam_i...@gmail.com> wrote:

>I would recommend you to try memory mapped files, from my experience
>they give the best performance to all other techniques.

Thanks for the suggestion. I'll look into memory mapped files in the
near future.

For the time being, I'm able to use a TFileStream with a buffered
stream reader and performance is on par with TextFile and ReadLn. At
first I thought using a stream was slower but I figure out I had a bug
in my code. An event fired by my code during the reads was causing
the performance issue when the code used a stream.

-KIRBY

Iman L Crawford

unread,
Apr 15, 2007, 11:19:29 PM4/15/07
to
Kirby Turner <ki...@whitepeaksoftware.com> wrote in
news:7b122310ej777hftf...@4ax.com:

> For the time being, I'm able to use a TFileStream with a buffered
> stream reader and performance is on par with TextFile and ReadLn

Deborah Pate has a TStream for memory mapped files. YOu should be able to
drop it in as a replacement for TFileStream.

--

Iman

0 new messages