Advantage or Not?

MikeCopeland

unread,

Sep 9, 2015, 6:15:56 PM9/9/15

to

Given files of several thousand records, each up to 1400 characters,
scanning every character, deleting many and replacing some with other
data characters. Is it better (more efficient/faster) to use a string
iterator or the ".at(pos)" function to do this type of work? TIA

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Mr Flibble

unread,

Sep 9, 2015, 6:23:53 PM9/9/15

to

On 09/09/2015 23:15, MikeCopeland wrote:
> Given files of several thousand records, each up to 1400 characters,
> scanning every character, deleting many and replacing some with other
> data characters. Is it better (more efficient/faster) to use a string
> iterator or the ".at(pos)" function to do this type of work? TIA

The problem is not iterating/indexing but the complexity of
delete/insert operations which for std::string is O(n). You might want
to look at my container neolib::segmented_array instead.

/Flibble

Christopher Pisz

unread,

Sep 9, 2015, 6:31:07 PM9/9/15

to

Without any knowledge of what said records look like, how they can be
broken up, etc. I doubt much advice could be given.

Given a string of n characters, iterating from begin to character x is
exactly the same complexity as calling std::string::at(x) afaik.

Deleting and replacing can get a little more hairy as to whether you
want to use an index or an iterator, but you must be careful with both,
because they will no longer point to where you think they do after a delete.

I would wonder though, in your description "given files of several
thousand records" why you would be playing with string at all. Storing
files in records is a common enough task that there are tons of third
party libraries out there for it that will probably do a better job of
it then one could do on their own. XML, Json, Binary serializers,
they're all out there.

--
I have chosen to troll filter/ignore all subthreads containing the
words: "Rick C. Hodgins", "Flibble", and "Islam"
So, I won't be able to see or respond to any such messages
---

Jorgen Grahn

unread,

Sep 9, 2015, 6:34:26 PM9/9/15

to

On Wed, 2015-09-09, MikeCopeland wrote:
> Given files of several thousand records, each up to 1400 characters,
> scanning every character, deleting many and replacing some with other
> data characters. Is it better (more efficient/faster)

Your first focus should be on clarity and maintainability. You may
find that the work takes a few milliseconds no matter how you do it.

> to use a string
> iterator or the ".at(pos)" function to do this type of work? TIA

Depends entirely on what you're doing to the strings. Perhaps some
third technique is the best, e.g. regular expressions.

Of the two you mention I prefer iterators (or const char* used the
same way) because I'm so used to that idiom, and because they mix well
with the standard algorithms.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

bartekltg

unread,

Sep 9, 2015, 7:09:57 PM9/9/15

to

On 10.09.2015 00:23, Mr Flibble wrote:
> On 09/09/2015 23:15, MikeCopeland wrote:
>> Given files of several thousand records, each up to 1400 characters,
>> scanning every character, deleting many and replacing some with other
>> data characters. Is it better (more efficient/faster) to use a string
>> iterator or the ".at(pos)" function to do this type of work? TIA

I agree with the others, to give reasonable answer we have
to known more.

> The problem is not iterating/indexing but the complexity of
> delete/insert operations which for std::string is O(n). You might want
> to look at my container neolib::segmented_array instead.

Isn't rope from 'almost standard' a similar container?
#include <ext/rope>
https://www.sgi.com/tech/stl/Rope.html
It probably is installed on OP's computer already.

bartekltg

MikeCopeland

unread,

Sep 9, 2015, 8:09:37 PM9/9/15

to

In article <msqbrh$tvt$1...@dont-email.me>, nos...@notanaddress.com says...

>
> > Given files of several thousand records, each up to 1400
characters,
> > scanning every character, deleting many and replacing some with other
> > data characters. Is it better (more efficient/faster) to use a string
> > iterator or the ".at(pos)" function to do this type of work? TIA
>

> Without any knowledge of what said records look like, how they can be
> broken up, etc. I doubt much advice could be given.
>

> Deleting and replacing can get a little more hairy as to whether you
> want to use an index or an iterator, but you must be careful with both,
> because they will no longer point to where you think they do after a delete.

Understood.

> I would wonder though, in your description "given files of several
> thousand records" why you would be playing with string at all. Storing
> files in records is a common enough task that there are tons of third
> party libraries out there for it that will probably do a better job of
> it then one could do on their own. XML, Json, Binary serializers,
> they're all out there.

These are text data files I'm given (to manipulate and gather data
from). However, I must normalize the data record for parsing and
converting activities, and I can't control the input formatting. 8<{{

Öö Tiib

unread,

Sep 10, 2015, 12:27:42 AM9/10/15

to

On Thursday, 10 September 2015 01:15:56 UTC+3, MikeCopeland wrote:
> Given files of several thousand records, each up to 1400 characters,
> scanning every character, deleting many and replacing some with other
> data characters. Is it better (more efficient/faster) to use a string
> iterator or the ".at(pos)" function to do this type of work? TIA

If the result of your scan is always either shorter or of same length
and you need to scan only once then destructive parse might be
fastest. Basically it is scanning by iterating over input buffer with
iterator and same time building the result to same buffer using other
iterator.

Marcel Mueller

unread,

Sep 10, 2015, 5:25:08 AM9/10/15

to

On 10.09.15 00.15, MikeCopeland wrote:
> Given files of several thousand records, each up to 1400 characters,
> scanning every character, deleting many and replacing some with other
> data characters. Is it better (more efficient/faster) to use a string
> iterator or the ".at(pos)" function to do this type of work? TIA

Any string or vector will end up with a complexity of O(n²) which is
evil for large files.

You could reduce this to O(n log n) by using segmented containers. But
this is still in the order of sorting the entire file content.

If you are looking for a /fast/ solution you need to implement a stream
processor.
I.e. a buffered source stream reads blocks of the original stream, a
buffered output stream writes to the destination file. The stream
processor in the middle forwards all unchanged characters from the
source stream to the destination stream, skips characters to delete and
replaces others. This is an O(n) operation as long as the context
required to decide which characters to keep, delete or replace has a
limited constant size (probably 1400 in your case).
This implementation will never keep the large file in memory at all,
which might be an important property for files with some GB of size.
I.e. it is O(1) with respect to memory usage.
If the latter (memory) does not count, you could read the entire source
file into a buffer and process this one in the same way into the
destination buffer which might be the same since you did not mention
inserts. However, this is likely to be slightly slower on real life
hardware, because the larger working set of your process probably impact
memory cache efficiency.

Another question is which stream buffers to use. Although the standard
buffered iostreams probably fit your needs, I have seen many really slow
iostream implementations. You have to test whether you target platform
if affected here. If the standard implementation is not too bad and the
algorithm for choosing characters to skip or replace is not too complex
you will likely be I/O bound.
Make the stream buffers large enough to keep the I/O subsystem
efficient. A few MB are a good choice to avoid latency problems even on
rotational disks.

Marcel

Jorgen Grahn

unread,

Sep 10, 2015, 7:54:01 AM9/10/15

to

On Thu, 2015-09-10, MikeCopeland wrote:
> In article <msqbrh$tvt$1...@dont-email.me>, nos...@notanaddress.com says...

...

>> I would wonder though, in your description "given files of several
>> thousand records" why you would be playing with string at all. Storing
>> files in records is a common enough task that there are tons of third
>> party libraries out there for it that will probably do a better job of
>> it then one could do on their own. XML, Json, Binary serializers,
>> they're all out there.
>
> These are text data files I'm given (to manipulate and gather data
> from). However, I must normalize the data record for parsing and
> converting activities, and I can't control the input formatting. 8<{{

Yes -- and it's not just you; it's a common situation.

Also, given a choice, some of us want to stay away from XML and JSON
(not to mention binary formats). If it's possible to define the
language so that it's easily parsed by Awk, Perl and so it fits well
in a Unix pipeline[0], that's what I do.

(On the other hand, that often makes Perl a better choice than C++ for
manipulating the data.)

/Jorgen

[0] Things like
zcat foo.gz | perl -pe '...' | uniq -c | sort -nr

mark

unread,

Sep 10, 2015, 8:14:15 AM9/10/15

to

On 2015-09-10 00:15, MikeCopeland wrote:
> Given files of several thousand records, each up to 1400 characters,
> scanning every character, deleting many and replacing some with other
> data characters. Is it better (more efficient/faster) to use a string
> iterator or the ".at(pos)" function to do this type of work? TIA

The iterator will usually be a bit faster. There is typically an extra
pointer indirection when at() is used.

E.g., disassembly for loop incrementing each character in the string:

Visual C++ 2015 x64

--- Iterator -----------------------------------------------------
<+0x1830> inc byte ptr [rcx]
<+0x1832> lea rcx,[rcx+1]
<+0x1836> inc rbx
<+0x1839> cmp rbx,rdx
<+0x183c> jne 0x1830
------------------------------------------------------------------

--- .at() --------------------------------------------------------
<+0x17a0> cmp qword ptr [rsp+38h],10h
<+0x17a6> lea rax,[rsp+20h]
<+0x17ab> cmovae rax,qword ptr [rsp+20h]
<+0x17b1> inc byte ptr [rax+rbx]
<+0x17b4> inc rbx
<+0x17b7> cmp rbx,rcx
<+0x17ba> jb 0x17a0
------------------------------------------------------------------

GCC 5.2 x64

--- Iterator -----------------------------------------------------
<+0x1880> add byte ptr [rax],1
<+0x1883> add rax,1
<+0x1887> cmp rcx,rax
<+0x188a> jne 0x1880
------------------------------------------------------------------

--- .at() --------------------------------------------------------
<+0x1871> mov rdx,rax
<+0x1874> add rdx,qword ptr [rsp+30h]
<+0x1879> add rax,1
<+0x187d> add byte ptr [rdx],1
<+0x1880> cmp rcx,rax
<+0x1883> ja 0x1871
------------------------------------------------------------------

Don't do anything that causes data blocks in the string to be moved
around (e.g. by deleting characters). Either do destructive parsing and
keep an output pointer/iterator to the current string or append to a new
output string (preallocate/reserve the output string).

Louis Krupp

unread,

Sep 10, 2015, 9:54:56 AM9/10/15

to

On Thu, 10 Sep 2015 11:24:48 +0200, Marcel Mueller
<news.5...@spamgourmet.org> wrote:

<snip>

>If you are looking for a /fast/ solution you need to implement a stream
>processor.
>I.e. a buffered source stream reads blocks of the original stream, a
>buffered output stream writes to the destination file

<snip>

A convenient approach would be to use mmap() to map a block of
virtual memory to the file and then step through the file as one would
step through an array. That makes the coding easier; has anyone had
any experience comparing performance with reading or writing files?

Louis

mark

unread,

Sep 10, 2015, 10:27:33 AM9/10/15

to

On 2015-09-10 15:54, Louis Krupp wrote:
> A convenient approach would be to use mmap() to map a block of
> virtual memory to the file and then step through the file as one would
> step through an array. That makes the coding easier; has anyone had
> any experience comparing performance with reading or writing files?

If you do things properly, there is normally little difference. But this
is platform-dependent and you need to set the right flags for your
access pattern (fadvice/madvice).

On Windows, mmap is slower for sequential access (madvice equivalent
missing unless you have >= Win8).

On 32-bit systems, contiguous address space is a limited resource and
mmap isn't going to be simpler for large files.

Christian Gollwitzer

unread,

Sep 10, 2015, 1:17:23 PM9/10/15

to

Am 10.09.15 um 02:09 schrieb MikeCopeland:

> In article <msqbrh$tvt$1...@dont-email.me>, nos...@notanaddress.com says...

> I would wonder though, in your description "given files of several
>> thousand records" why you would be playing with string at all. Storing
>> files in records is a common enough task that there are tons of third
>> party libraries out there for it that will probably do a better job of
>> it then one could do on their own. XML, Json, Binary serializers,
>> they're all out there.
>
> These are text data files I'm given (to manipulate and gather data
> from). However, I must normalize the data record for parsing and
> converting activities, and I can't control the input formatting. 8<{{

Understood. Still my advice would be to suck it into a database engine,
like sqlite, and then execute queries instead of lengthy programs. This
has the additional advantage that you can keep the db file around for
ultrafast loading/querying if you happen to work on the same data again.
Many questions you asked in te past would be trivial to do from a
relational database (joins, multiple index etc.)

Christian

Jorgen Grahn

unread,

Sep 11, 2015, 7:03:25 AM9/11/15

to

Plus you cannot operate on streams which can't be mmap()ed. No
pipelines, no reading from stdin[0], no string streams for unit
testing ... I've considered using mmap() many times, but the drawbacks
have always put me off.

/Jorgen

[0] As a Unix guy, I really hate using software which cannot be
used in pipelines. It's a major usability problem.

Jorgen Grahn

unread,

Sep 11, 2015, 10:06:19 AM9/11/15

to

On Thu, 2015-09-10, Christian Gollwitzer wrote:
> Am 10.09.15 um 02:09 schrieb MikeCopeland:
>> In article <msqbrh$tvt$1...@dont-email.me>, nos...@notanaddress.com says...
>> I would wonder though, in your description "given files of several
>>> thousand records" why you would be playing with string at all. Storing
>>> files in records is a common enough task that there are tons of third
>>> party libraries out there for it that will probably do a better job of
>>> it then one could do on their own. XML, Json, Binary serializers,
>>> they're all out there.
>>
>> These are text data files I'm given (to manipulate and gather data
>> from). However, I must normalize the data record for parsing and
>> converting activities, and I can't control the input formatting. 8<{{
>
> Understood. Still my advice would be to suck it into a database engine,
> like sqlite, and then execute queries instead of lengthy programs.

This is, I think, an area where there's a difference between "database
people" and "text file people". There's a lot of data processing you
can do very quickly and conveniently without SQL, even for large data
sets.

But yes, I acknowledge that someone can be very productive with
database queries. I haven't /met/ anyone like that, but that's
probably because they are rare in my immediate environment.

What's right for MikeCopeland, I cannot tell. I don't recall what
kinds of problems he's trying to solve, or if he has access to Unix
tools for text processing, and so on.

/Jorgen

mark

unread,

Sep 11, 2015, 10:19:28 AM9/11/15

to

On 2015-09-10 19:17, Christian Gollwitzer wrote:
>
> Understood. Still my advice would be to suck it into a database engine,
> like sqlite, and then execute queries instead of lengthy programs. This
> has the additional advantage that you can keep the db file around for
> ultrafast loading/querying if you happen to work on the same data again.
> Many questions you asked in te past would be trivial to do from a
> relational database (joins, multiple index etc.)

If he can afford the performance hit from stuffing the data into a
database, his initial question was utterly pointless.

As long as the data fits into RAM, it's extremely easy to be several
times faster than a database engine like sqlite.

Öö Tiib

unread,

Sep 11, 2015, 12:16:42 PM9/11/15

to

On Friday, 11 September 2015 17:06:19 UTC+3, Jorgen Grahn wrote:
> On Thu, 2015-09-10, Christian Gollwitzer wrote:
> > Am 10.09.15 um 02:09 schrieb MikeCopeland:
> >> In article <msqbrh$tvt$1...@dont-email.me>, nos...@notanaddress.com says...
> >> I would wonder though, in your description "given files of several
> >>> thousand records" why you would be playing with string at all. Storing
> >>> files in records is a common enough task that there are tons of third
> >>> party libraries out there for it that will probably do a better job of
> >>> it then one could do on their own. XML, Json, Binary serializers,
> >>> they're all out there.
> >>
> >> These are text data files I'm given (to manipulate and gather data
> >> from). However, I must normalize the data record for parsing and
> >> converting activities, and I can't control the input formatting. 8<{{
> >
> > Understood. Still my advice would be to suck it into a database engine,
> > like sqlite, and then execute queries instead of lengthy programs.
>
> This is, I think, an area where there's a difference between "database
> people" and "text file people". There's a lot of data processing you
> can do very quickly and conveniently without SQL, even for large data
> sets.

Perl is typically the tool of text file people. One can win particular
query with C or C++. If the tendency of the complexity of queries that
we need to do seems to grow gradually then it is good time to filter
the data into some database with single sniff and suddenly the
flexibility what can be further done grows enormously.

>
> But yes, I acknowledge that someone can be very productive with
> database queries. I haven't /met/ anyone like that, but that's
> probably because they are rare in my immediate environment.

There are several orders of magnitude differences. I have seen
python script doing queries on described complexity SQL base in
0.05 seconds so there is appearance of real time processing on
web page serviced with that python script.

> What's right for MikeCopeland, I cannot tell. I don't recall what
> kinds of problems he's trying to solve, or if he has access to Unix
> tools for text processing, and so on.

Mike has always somehow managed to make his programs slow and resists
against using profiler. He uses Visual Studio on Windows.
See his "Building a Large Container"
http://www.thecodingforums.com/threads/building-a-large-container.966741/