On 10.09.15 00.15, MikeCopeland wrote:
> Given files of several thousand records, each up to 1400 characters,
> scanning every character, deleting many and replacing some with other
> data characters. Is it better (more efficient/faster) to use a string
> iterator or the ".at(pos)" function to do this type of work? TIA
Any string or vector will end up with a complexity of O(n²) which is
evil for large files.
You could reduce this to O(n log n) by using segmented containers. But
this is still in the order of sorting the entire file content.
If you are looking for a /fast/ solution you need to implement a stream
processor.
I.e. a buffered source stream reads blocks of the original stream, a
buffered output stream writes to the destination file. The stream
processor in the middle forwards all unchanged characters from the
source stream to the destination stream, skips characters to delete and
replaces others. This is an O(n) operation as long as the context
required to decide which characters to keep, delete or replace has a
limited constant size (probably 1400 in your case).
This implementation will never keep the large file in memory at all,
which might be an important property for files with some GB of size.
I.e. it is O(1) with respect to memory usage.
If the latter (memory) does not count, you could read the entire source
file into a buffer and process this one in the same way into the
destination buffer which might be the same since you did not mention
inserts. However, this is likely to be slightly slower on real life
hardware, because the larger working set of your process probably impact
memory cache efficiency.
Another question is which stream buffers to use. Although the standard
buffered iostreams probably fit your needs, I have seen many really slow
iostream implementations. You have to test whether you target platform
if affected here. If the standard implementation is not too bad and the
algorithm for choosing characters to skip or replace is not too complex
you will likely be I/O bound.
Make the stream buffers large enough to keep the I/O subsystem
efficient. A few MB are a good choice to avoid latency problems even on
rotational disks.
Marcel