Due to a bug in my parser code I decided to swtich to TFileStream for
data read and writes. An added benefit to this change is that the
code now works with other stream types. However, I find the code is
not able to read through larger text files as quickly using
TFileStream as it could using the native Delphi file handler.
To improve performance I am using a buffered stream reader, which
helps a lot but it's not still as fast as the original code.
Additional I found this note in the Delphi 2007 help file for the
SysUtils.FileOpen routine:
"Use of the non-native Delphi language file handlers such as FileOpen
is not encouraged. These routines map to system routines and return OS
file handles, not normal Delphi file variables. These are low-level
file access routines. For normal file operations use AssignFile,
Rewrite, and Reset instead."
TFileStream uses the non-native file handlers. This makes me wonder
why non-native file handlers are not encouraged.
So my question is, which approach is the best for working with large
text files, native Delphi file handlers or TFileStream?
My thought at the moment is native Delphi file handlers wil perform
better but the slower performance I am experiencing could be the
buffered stream reader I am using.
Thanks,
-KIRBY
> I'm writing a program that must read and write large text files.
> Originally I was using the native Delphi approach, TextFile with the
> routines AssignFile, ReadLn, and so on.
>
> Due to a bug in my parser code I decided to swtich to TFileStream for
> data read and writes. An added benefit to this change is that the
> code now works with other stream types. However, I find the code is
> not able to read through larger text files as quickly using
> TFileStream as it could using the native Delphi file handler.
>
> To improve performance I am using a buffered stream reader, which
> helps a lot but it's not still as fast as the original code.
> Additional I found this note in the Delphi 2007 help file for the
> SysUtils.FileOpen routine:
>
> "Use of the non-native Delphi language file handlers such as FileOpen
> is not encouraged. These routines map to system routines and return OS
> file handles, not normal Delphi file variables. These are low-level
> file access routines. For normal file operations use AssignFile,
> Rewrite, and Reset instead."
>
> TFileStream uses the non-native file handlers. This makes me wonder
> why non-native file handlers are not encouraged.
You can as well use the corresponding API functions directly. *Every*
file access boils down to them eventually, even the good old Textfile
I/O routines. The main advantage of using the higher level wrappers is
that they are easier to use and will raise exceptions when something
goes wrong instead of requiring you to check cryptic return codes for
every operation.
> So my question is, which approach is the best for working with large
> text files, native Delphi file handlers or TFileStream?
If would stay with the Textfile type and its support routines.
Especially if you assign a larger buffer (see SetTextBuf procedure)
they will beat the pants off naked TFilestream access. And you get
parsing into lines for free. The key is efficient buffering, so your
code has to go to the file system only to read large chunks of data,
not every single character.
--
Peter Below (TeamB)
Don't be a vampire (http://slash7.com/pages/vampires),
use the newsgroup archives :
http://www.tamaracka.com/search.htm
http://groups.google.com
http://www.prolix.be
> To improve performance I am using a buffered stream reader, which
> helps a lot but it's not still as fast as the original code.
> Additional I found this note in the Delphi 2007 help file for the
> SysUtils.FileOpen routine:
A good implementation of a buffered stream reader should be roughly as
fast as the "old" SysUtils routines. A lot depends on an efficient
implementation of the character and line access methods.
If a buffered stream reader is still too slow you could switch to a
memory mapped file wrapper instead.
I use both (depending on size, for multi-gigabyte files I use buffered
file streams).
Danny
---
>If would stay with the Textfile type and its support routines.
>Especially if you assign a larger buffer (see SetTextBuf procedure)
>they will beat the pants off naked TFilestream access. And you get
>parsing into lines for free.
Parsing into lines for free is why I originally used TextFile. But
the bug I encountered that led me to switch to TFileStream was an
embedded new line in values from a CSV text file. The embedded new
lines caused the records in the CSV file to split when calling ReadLn.
Switching to TFileStream was an easy fix but resulted in poorer
performance.
I think at this point I need to apply a bug fix for the embedded new
line using the TextFile approach and compare the performance to that
of the buffered stream reader approach. My gut tells me the TextFile
approach will be faster but code for the stream approach is easier to
write.
Thanks to all for the responses.
-KIRBY
Kirby, If performance is important and you can spend
the programming time, then I recommend using
the basic BlockRead() or direct OS API type of calls
and coding yourself the parsing of the lines and other
info with PChar pointers. I wrote a GEDCOM parser
years ago doing this. If after first coding this in Delphi,
you would like even more performance, then you can
use a profiler on the running code and convert the
bottlenecks over to asm code. Regards, JohnH
That's not a bug, it's a horribly formatted CSV file!
CSV format does not have much of a definition, but
quoted strings should certainly not span lines.
But if that is the data that you have to work with, then
you have to find some way around it.
You should still be able to use ReadLn though, just
don't assume that each string is a complete set of records.
Cheers,
Chris
Quoted strings can contain any other character except
an not escaped quote character, thus certainly end
of line sequences.
How else would you transfer the contents of memo fields.
Regards from Germany
Franz-Leo
You shouldn't use CSV files for this IMHO.
But CSV is an ad-hoc format with no formal definition,
so it is very difficult to create a universal CSV-reader.
Microsoft Excel can't even read it's own CSV files if
they are created in a different locale (comma vs semi-colon
separator).
Cheers,
Chris
I would recommend you to try memory mapped files, from my experience
they give the best performance to all other techniques.
There is a good implementation of MMF in TJclMappedTextReader from JEDI
library. TJclMappedTextReader also has ReadLn, so you wouldn't have to
change much in your application.
Iman
>Kirby, If performance is important and you can spend
>the programming time, then I recommend using
>the basic BlockRead() or direct OS API type of calls
>and coding yourself the parsing of the lines and other
>info with PChar pointers.
You are right. Performance is important for my application so I spent
the extra time programming and testing different scenarios.
I found performance using TFileStream with a buffered stream reader to
be on par TextFile and ReadLn. My initial performance problems with
the buffered stream reader were related to an event fired by my code
when a new line is read from the stream.
I'm now happy with the speed at which my application is able to load
large files. And I have the added benefit of supporting streams in
the code which gives me more options for reuse in the future.
-KIRBY
>You shouldn't use CSV files for this IMHO.
I agree but sometimes you have no choice. For instance, a customer of
mine receives CSV data files from one of its partners. The partner
does not support XML or other formats so CSV files it is.
As much as I would love to see my customer move away from CSV files, I
don't see it happening anytime soon.
-KIRBY
>I would recommend you to try memory mapped files, from my experience
>they give the best performance to all other techniques.
Thanks for the suggestion. I'll look into memory mapped files in the
near future.
For the time being, I'm able to use a TFileStream with a buffered
stream reader and performance is on par with TextFile and ReadLn. At
first I thought using a stream was slower but I figure out I had a bug
in my code. An event fired by my code during the reads was causing
the performance issue when the code used a stream.
-KIRBY
Deborah Pate has a TStream for memory mapped files. YOu should be able to
drop it in as a replacement for TFileStream.
--
Iman