CString to CStringA conversion and other questions :)

Oto BREZINA

unread,

Aug 19, 2012, 10:19:25 AM8/19/12

to d...@tortoisesvn.tigris.org

Finally I have some spare time so I start with some work on T-Merge

1.
Do you know if "CStringA sLine = CStringA(sLineT)" is internally using WideCharToMultiByte?
What is for "CStringA(sLineT)" conversion here?

2.
I don't use STL much what is your preferred container for BYTE array, according some webs candidates are:
CStringA - have count, operator [], copy on write, but may be missleading of use
vector - have count and []
unique_ptr - have [], but lack count

3.
Thru code I have seen lot of *ptr++ based algos (sometimes in unnatural way for me), are those quicker then based on ptr[]?
According some quite old optimalisation guide ptr[] could be faster becouse of less increment instructions, and simpler cache management, however it was in about '97.
Do you have any real performance tests/data - I tried to run my own, but I was unable to start performance tests as I'm not admin ... will try again later.

4.
I would like to write filters classes for ASCII, UTF8, UTF16BE, UTF16LE and add UTF32s reads/writes. Do you have/know any preferred interface/template for that job?

5.
Have you any specific reason to not support UTF32, or just too small use cases.
Have you any reason to not support other EOLs? According http://en.wikipedia.org/wiki/Newline only NEL seems be questionable.

--
Oto ot(ik) BREZINA - 오토

Stefan Küng

unread,

Aug 19, 2012, 2:33:03 PM8/19/12

to d...@tortoisesvn.tigris.org

On 19.08.2012 16:19, Oto BREZINA wrote:
> Finally I have some spare time so I start with some work on T-Merge
>
> 1.
> Do you know if "CStringA sLine = CStringA(sLineT)" is internally using

> *WideCharToMultiByte*
> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx>?

> What is for "CStringA(sLineT)" conversion here?

yes, it does, via some c-runtime functions.
But this conversion is utf16 to ansi, not utf8.

why do you ask?

> 2.
> I don't use STL much what is your preferred container for BYTE array,
> according some webs candidates are:
> CStringA - have count, operator [], copy on write, but may be
> missleading of use
> vector - have count and []
> unique_ptr - have [], but lack count

it depends on your use case.
for example, if you don't need [] access but only iteration, use a deque
instead of a vector - especially for big arrays.

> 3.
> Thru code I have seen lot of *ptr++ based algos (sometimes in unnatural
> way for me), are those quicker then based on ptr[]?
> According some quite old optimalisation guide ptr[] could be faster
> becouse of less increment instructions, and simpler cache management,
> however it was in about '97.
> Do you have any real performance tests/data - I tried to run my own, but
> I was unable to start performance tests as I'm not admin ... will try
> again later.

ptr incrementations can be faster than index based access, usually when
using std containers.

> 4.
> I would like to write filters classes for ASCII, UTF8, UTF16BE, UTF16LE
> and add UTF32s reads/writes. Do you have/know any preferred
> interface/template for that job?

why? such filter classes would be good for streams, but we don't use
streams in TMerge but load the files completely in one go.

> 5.
> Have you any specific reason to not support UTF32, or just too small use
> cases.

Is there even a tool/app/whatever that writes such files?
I've never seen such a file myself.
So why implement something that won't be used?

> Have you any reason to not support other EOLs? According
> http://en.wikipedia.org/wiki/Newline only NEL seems be questionable.

actually, yes: the svn diff library doesn't support them, so supporting
them in TMerge makes no sense. We could split the lines there, but the
diffing engine would treat those as one line and so the diff would be
shown wrong.

Stefan

--
___
oo // \\ "De Chelonian Mobile"
(_,\/ \_/ \ TortoiseSVN
\ \_/_\_/> The coolest Interface to (Sub)Version Control
/_/ \_\ http://tortoisesvn.net

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999453

To unsubscribe from this discussion, e-mail: [dev-uns...@tortoisesvn.tigris.org].

Oto BREZINA

unread,

Aug 19, 2012, 3:15:24 PM8/19/12

to d...@tortoisesvn.tigris.org

Thanks for prompt answers ...

On 2012-08-19 20:33, Stefan Küng wrote:
> On 19.08.2012 16:19, Oto BREZINA wrote:
>> Finally I have some spare time so I start with some work on T-Merge
>>
>> 1.
>> Do you know if "CStringA sLine = CStringA(sLineT)" is internally using
>> *WideCharToMultiByte*
>> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx>?
>> What is for "CStringA(sLineT)" conversion here?
> yes, it does, via some c-runtime functions.
> But this conversion is utf16 to ansi, not utf8.
>
> why do you ask?

Just want be sure Load and Save are in pair and loaded and saved file
have same content. I'll try some test cases to make sure.
For load is used:
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, (LPCSTR)pFileBuf,
dwReadBytes, pWideBuf, ret)
And for save:
CStringA sLine = sLineT;
Does not look same. ...
Which one is better? I guess with some persistant buffer(as is for save
in UTF16BE) WideCharToMultiByte should be better/faster ...

>
>> 2.
>> I don't use STL much what is your preferred container for BYTE array,
>> according some webs candidates are:
>> CStringA - have count, operator [], copy on write, but may be
>> missleading of use
>> vector - have count and []
>> unique_ptr - have [], but lack count
> it depends on your use case.
> for example, if you don't need [] access but only iteration, use a deque
> instead of a vector - especially for big arrays.

I need [] or iteration AND .get() or &[0] AND count. I'll check deque.

>> 3.
>> Thru code I have seen lot of *ptr++ based algos (sometimes in unnatural
>> way for me), are those quicker then based on ptr[]?
>> According some quite old optimalisation guide ptr[] could be faster
>> becouse of less increment instructions, and simpler cache management,
>> however it was in about '97.
>> Do you have any real performance tests/data - I tried to run my own, but
>> I was unable to start performance tests as I'm not admin ... will try
>> again later.
> ptr incrementations can be faster than index based access, usually when
> using std containers.

What about char * and wchar_t * ?
like in CheckUnicodeType, of Load "fill in the lines into the array"
part. Data can be quite big.
Most important part is access instruction *ptr vs ptr[i] and number of
cache miss. But easiest way is to check that for big data.

>> 4.
>> I would like to write filters classes for ASCII, UTF8, UTF16BE, UTF16LE
>> and add UTF32s reads/writes. Do you have/know any preferred
>> interface/template for that job?
> why? such filter classes would be good for streams, but we don't use
> streams in TMerge but load the files completely in one go.

If you check my last few commits there are four save encodings, which
share same pattern/algo, but differ in one "line" encoding itself.
First I thought about functor (?), but classes seems be more readable.

Most important part here will be buffer (question 2) which will be part
of object. This will allow reduce allocations.

>> 5.
>> Have you any specific reason to not support UTF32, or just too small use
>> cases.
> Is there even a tool/app/whatever that writes such files?
> I've never seen such a file myself.
> So why implement something that won't be used?

Quite agree. If this is only reason I would implemented that. I guess
this format is really rare. And if used then on Linux. But it makes me
feel, that application is unfinished whenever I read, that UTF32 is
thread as binary ...
To add support for UTF32 is write load and write filter (x2 BE, LE) ...
should be easy ...
Can be good new feature for 1.8.

>> Have you any reason to not support other EOLs? According
>> http://en.wikipedia.org/wiki/Newline only NEL seems be questionable.
> actually, yes: the svn diff library doesn't support them, so supporting
> them in TMerge makes no sense. We could split the lines there, but the
> diffing engine would treat those as one line and so the diff would be
> shown wrong.

Ok, sounds reasonable.

There is as much use for thoses EOLS as for UTF32, so no big deal, but:
If I get that correctly, diff is made on temp files in UTF8 format. In
case other EOLs is used we can convert them to EOL_AUTOLINE, and make diff.

This lead me to other question:
6.
When saving to enforced UTF8 for all but UTF8BOM BOM iwas not saved -
was this intentional? Can be BOM missing for all UTF8 enforced files.
Correct? This was implemented in r23192.

7.
If I read code correctly On CFileTextLines::CheckLineEndings you detect
EOL_LFCR, but in CFileTextLines::Load this is one is decoded as EOL_LF
and EOL_CR or EOL_CRLF.
Is this intentional?
I guess EOL_LFCR is too rare to be really wanted, but why to detect it
in Check then. This makes all EOL_AUTOLINE EOL_LFCR.
>
> Stefan

>

--
Oto ot(ik) BREZINA - 오토

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999456

Stefan Küng

unread,

Aug 20, 2012, 12:44:48 PM8/20/12

to d...@tortoisesvn.tigris.org

On 19.08.2012 21:15, Oto BREZINA wrote:
> Thanks for prompt answers ...
>
> On 2012-08-19 20:33, Stefan Küng wrote:
>> On 19.08.2012 16:19, Oto BREZINA wrote:
>>> Finally I have some spare time so I start with some work on T-Merge
>>>
>>> 1.
>>> Do you know if "CStringA sLine = CStringA(sLineT)" is internally using
>>> *WideCharToMultiByte*
>>> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx>?
>>> What is for "CStringA(sLineT)" conversion here?
>> yes, it does, via some c-runtime functions.
>> But this conversion is utf16 to ansi, not utf8.
>>
>> why do you ask?
> Just want be sure Load and Save are in pair and loaded and saved file
> have same content. I'll try some test cases to make sure.
> For load is used:
> MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, (LPCSTR)pFileBuf,
> dwReadBytes, pWideBuf, ret)
> And for save:
> CStringA sLine = sLineT;
> Does not look same. ...
> Which one is better? I guess with some persistant buffer(as is for save
> in UTF16BE) WideCharToMultiByte should be better/faster ...

We use CStringA conversion when saving because we save line-by-line.
We use MultiByteToWideChar because we're converting the whole file
content in one go (using CString here would mean to first create a copy
of the content).

I'm wondering here: why do you want to change that part of the code?
Does it not work? Is it too slow?

>>> 4.
>>> I would like to write filters classes for ASCII, UTF8, UTF16BE, UTF16LE
>>> and add UTF32s reads/writes. Do you have/know any preferred
>>> interface/template for that job?
>> why? such filter classes would be good for streams, but we don't use
>> streams in TMerge but load the files completely in one go.
> If you check my last few commits there are four save encodings, which
> share same pattern/algo, but differ in one "line" encoding itself.
> First I thought about functor (?), but classes seems be more readable.
>
> Most important part here will be buffer (question 2) which will be part
> of object. This will allow reduce allocations.

>>> 5.
>>> Have you any specific reason to not support UTF32, or just too small use
>>> cases.
>> Is there even a tool/app/whatever that writes such files?
>> I've never seen such a file myself.
>> So why implement something that won't be used?
> Quite agree. If this is only reason I would implemented that. I guess
> this format is really rare. And if used then on Linux. But it makes me
> feel, that application is unfinished whenever I read, that UTF32 is
> thread as binary ...
> To add support for UTF32 is write load and write filter (x2 BE, LE) ...
> should be easy ...
> Can be good new feature for 1.8.

not really: while the svn diff lib doesn't support even utf16, it
doesn't break either for those because it skips over the null bytes.
But utf32 wouldn't work - too many null bytes in a normal text.

>>> Have you any reason to not support other EOLs? According
>>> http://en.wikipedia.org/wiki/Newline only NEL seems be questionable.
>> actually, yes: the svn diff library doesn't support them, so supporting
>> them in TMerge makes no sense. We could split the lines there, but the
>> diffing engine would treat those as one line and so the diff would be
>> shown wrong.
> Ok, sounds reasonable.
>
> There is as much use for thoses EOLS as for UTF32, so no big deal, but:
> If I get that correctly, diff is made on temp files in UTF8 format. In
> case other EOLs is used we can convert them to EOL_AUTOLINE, and make diff.

That would work for *showing* the diff. But when saving edited content,
you would save the converted EOLs.

> This lead me to other question:
> 6.
> When saving to enforced UTF8 for all but UTF8BOM BOM iwas not saved -
> was this intentional? Can be BOM missing for all UTF8 enforced files.
> Correct? This was implemented in r23192.

There's an option in the settings to save files as utf8 even if they're
detected as ANSI.
You're now saving those with a BOM, which isn't what we did before. In
that case you should write the file without a BOM (always without BOM if
possible, only if the file had one when loading, then we write the BOM too).

>
> 7.
> If I read code correctly On CFileTextLines::CheckLineEndings you detect
> EOL_LFCR, but in CFileTextLines::Load this is one is decoded as EOL_LF
> and EOL_CR or EOL_CRLF.
> Is this intentional?
> I guess EOL_LFCR is too rare to be really wanted, but why to detect it
> in Check then. This makes all EOL_AUTOLINE EOL_LFCR.

I don't understand what you mean here.
In Load(), the line endings are checked by calling CheckLineEndings(),
there's no separate detection.

Stefan

--
___
oo // \\ "De Chelonian Mobile"
(_,\/ \_/ \ TortoiseSVN
\ \_/_\_/> The coolest Interface to (Sub)Version Control
/_/ \_\ http://tortoisesvn.net

------------------------------------------------------

http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999662

Oto BREZINA

unread,

Aug 20, 2012, 2:40:49 PM8/20/12

to d...@tortoisesvn.tigris.org

On 2012-08-20 18:44, Stefan Küng wrote:

3.
Thru code I have seen lot of *ptr++ based algos (sometimes in unnatural
way for me), are those quicker then based on ptr[]?
According some quite old optimalisation guide ptr[] could be faster
becouse of less increment instructions, and simpler cache management,
however it was in about '97.
Do you have any real performance tests/data - I tried to run my own, but
I was unable to start performance tests as I'm not admin ... will try
again later.

ptr incrementations can be faster than index based access, usually when
using std containers.

What about char * and wchar_t * ?
like in CheckUnicodeType, of Load "fill in the lines into the array"
part. Data can be quite big.
Most important part is access instruction *ptr vs ptr[i] and number of
cache miss. But easiest way is to check that for big data.

I'm wondering here: why do you want to change that part of the code?
Does it not work? Is it too slow?

I'm NOT about to rewrite CheckUnicodeType, just was wonder when I read it if *ptr++ is faster.
In fact I was little bit about - I wrote simple UTF8 validator some time ago, and seeing your implementation with lot of nested ifs, ++ etc, I was wondering what are cons and pros to compare with my - more state based implementation.
Everything stopped on Performance Analysis tool.

In CheckUnicodeType you get thru array twice just to verify if it is UTF8.
And it can be enhanced to UTF16BE/LE detection - statisticaly based, so not 100% accurate, but it would cost some processing. For now you need BOM for UTF16s.

It started with simple task:
Last line of file should have no newline on it. In current implementation you keep information in attribute m_bReturnAtEnd. This makes some editing at file end quite hard. For example you can add/remove new line on last line using CTRL enter, but this is usually not applied on edited file. Keeping this attribute actual seems to be hard task with lot of possible bugs. But even I was able to remove this attribute and add last line (when needed) it does not apeared in views...
It seems that you use line data from diff where is empty last line missing. I'll come back to this later.

In that time I found that there is quite lot of code duplication in Load and Save and duplices are not same... Making code harder to read and maintain. Plus missing UTF32 encoding

Other motivation was to something simple, before starting code editing in multiple views.

5.
Have you any specific reason to not support UTF32, or just too small use
cases.

Is there even a tool/app/whatever that writes such files?
I've never seen such a file myself.
So why implement something that won't be used?

Quite agree. If this is only reason I would implemented that. I guess
this format is really rare. And if used then on Linux. But it makes me
feel, that application is unfinished whenever I read, that UTF32 is
thread as binary ...
To add support for UTF32 is write load and write filter (x2 BE, LE) ...
should be easy ...
Can be good new feature for 1.8.

not really: while the svn diff lib doesn't support even utf16, it 
doesn't break either for those because it skips over the null bytes.
But utf32 wouldn't work - too many null bytes in a normal text.

So T-Merge do a diff directly on UTF16 files?
Utf8 temp file is created only if encoding differ like UTF16 and UTF8, or never ?
Can that be simply enforced for UTF32 files?

From CDiffData::Load it seems that UTF16 files are saved as UTF8 in temp, for diff purposes, Am I right?

Have you any reason to not support other EOLs? According
http://en.wikipedia.org/wiki/Newline only NEL seems be questionable.

actually, yes: the svn diff library doesn't support them, so supporting
them in TMerge makes no sense. We could split the lines there, but the
diffing engine would treat those as one line and so the diff would be
shown wrong.

Ok, sounds reasonable.

There is as much use for thoses EOLS as for UTF32, so no big deal, but:
If I get that correctly, diff is made on temp files in UTF8 format. In
case other EOLs is used we can convert them to EOL_AUTOLINE, and make diff.

That would work for *showing* the diff. But when saving edited content, 
you would save the converted EOLs.

Of course one, we load(ed). We only need to get know to upper layer e.g CDiffData::Load, that diff needs enforce UTF8, while on enforced UTF8 save all non standard EOLS can be converted to AUTO. We'll lose little bit of diff this way through.

Making temp UTF8 needed in UTF16,32 Exotic EOLS, different Encodings ASCII and UTF8 ...

This lead me to other question:
6.
When saving to enforced UTF8 for all but UTF8BOM BOM iwas not saved -
was this intentional? Can be BOM missing for all UTF8 enforced files.
Correct? This was implemented in r23192.

There's an option in the settings to save files as utf8 even if they're 
detected as ANSI.
You're now saving those with a BOM, which isn't what we did before. In 
that case you should write the file without a BOM (always without BOM if 
possible, only if the file had one when loading, then we write the BOM too).

r23193 is like: ((!bSaveAsUTF8)&&(m_UnicodeType == CFileTextLines::UTF8BOM))
Should mean If NOT SaveAsUtf8 and ... then save BOM.

bSaveAsUTF8 is only for user requests, or for diff purposes too?

7.
If I read code correctly On CFileTextLines::CheckLineEndings you detect
EOL_LFCR, but in CFileTextLines::Load this is one is decoded as EOL_LF
and EOL_CR or EOL_CRLF.
Is this intentional?
I guess EOL_LFCR is too rare to be really wanted, but why to detect it
in Check then. This makes all EOL_AUTOLINE EOL_LFCR.

I don't understand what you mean here.
In Load(), the line endings are checked by calling CheckLineEndings(), 
there's no separate detection.

CheckLineEndings can detect EOL_LFCR, Load not. Is this what you want?


Stefan

--
Oto ot(ik) BREZINA - 오토, mob: +421 903 653 470
Printflow s.r.o, tel +421 2 4488 1086, Bratislava, Slovakia, EU If I toppost I do it because:

I don't have time to edit out irrelevant context and signatures
I expect you to remember the context for my email messages
I want you do the work to figure out what I said
My time is more important than your time

Stefan Küng

unread,

Aug 20, 2012, 3:03:43 PM8/20/12

to d...@tortoisesvn.tigris.org

Looks like a good idea.
If you want to do some profiling:
have a look at the src\utils\profiling.h file. It's a simple time
measuring tool that helps finding out if a change has an effect on
performance or not.

Simply add
PROFILE_BLOCK;
somewhere inside a function, then make a release build, run the app and
that function, exit the app. A file will be written to the same folder
as the exe is where all the time measurements are shown.

to profile a simple line, use
PROFILE_LINE(call_function());

But I prefer the
PROFILE_BLOCK;

in case you need to only profile part of a function, create a 'block'
using brackets:

function(...)
{
...
...
{
PROFILE_BLOCK;
// code to profile
...

}
}

> So T-Merge do a diff directly on UTF16 files?
> Utf8 temp file is created only if encoding differ like UTF16 and UTF8,
> or never ?
> Can that be simply enforced for UTF32 files?
>
> From CDiffData::Load it seems that UTF16 files are saved as UTF8 in
> temp, for diff purposes, Am I right?

hmm, right. We could just always save those files as utf8 and then run
the svn diff lib functions on those. In that case, even utf32 files
could be supported by TMerge.

>> That would work for *showing* the diff. But when saving edited content,
>> you would save the converted EOLs.
> Of course one, we load(ed). We only need to get know to upper layer e.g
> CDiffData::Load, that diff needs enforce UTF8, while on enforced UTF8
> save all non standard EOLS can be converted to AUTO. We'll lose little
> bit of diff this way through.
>
> Making temp UTF8 needed in UTF16,32 Exotic EOLS, different Encodings
> ASCII and UTF8 ...

not much of a problem I think. Unless of course the files are huge, then
converting them first to utf8 will take a few seconds. But I guess
that's better than not working with such files at all.

>>> This lead me to other question:
>>> 6.
>>> When saving to enforced UTF8 for all but UTF8BOM BOM iwas not saved -
>>> was this intentional? Can be BOM missing for all UTF8 enforced files.
>>> Correct? This was implemented in r23192.
>> There's an option in the settings to save files as utf8 even if they're
>> detected as ANSI.
>> You're now saving those with a BOM, which isn't what we did before. In
>> that case you should write the file without a BOM (always without BOM if
>> possible, only if the file had one when loading, then we write the BOM too).
> r23193 is like: ((!bSaveAsUTF8)&&(m_UnicodeType ==
> CFileTextLines::UTF8BOM))
> Should mean If NOT SaveAsUtf8 and ... then save BOM.

We don't need the BOM for the diff. We only want to save the BOM if it
was there when loading the file.

> bSaveAsUTF8 is only for user requests, or for diff purposes too?

for both.

>
>>> 7.
>>> If I read code correctly On CFileTextLines::CheckLineEndings you detect
>>> EOL_LFCR, but in CFileTextLines::Load this is one is decoded as EOL_LF
>>> and EOL_CR or EOL_CRLF.
>>> Is this intentional?
>>> I guess EOL_LFCR is too rare to be really wanted, but why to detect it
>>> in Check then. This makes all EOL_AUTOLINE EOL_LFCR.
>> I don't understand what you mean here.
>> In Load(), the line endings are checked by calling CheckLineEndings(),
>> there's no separate detection.
> CheckLineEndings can detect EOL_LFCR, Load not. Is this what you want?

Load() calls CheckLineEndings(), which can detect EOL_LFCR.
Or am I missing something here?

Stefan

--
___
oo // \\ "De Chelonian Mobile"
(_,\/ \_/ \ TortoiseSVN
\ \_/_\_/> The coolest Interface to (Sub)Version Control
/_/ \_\ http://tortoisesvn.net

------------------------------------------------------

http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999683

Oto BREZINA

unread,

Aug 20, 2012, 3:19:41 PM8/20/12

to d...@tortoisesvn.tigris.org

You are right - I miss this point so I was not clear enough:

Load() on begining calls CheckLineEndings() to select right AUTO EOL -
is support four OELs.
But when analyzing file text it does not recognize EOL_LFCR.

for (int i = 0; i<nReadChars; ++i)
'\r' + '\n' -> EOL_CRLF
'\r' -> EOL_CR
'\n' -> EOL_LF

'\n' + '\r' is not recognized for EOL_LFCR here.

> Stefan
>

--
Oto ot(ik) BREZINA - 오토

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999686

Stefan Küng

unread,

Aug 20, 2012, 4:13:28 PM8/20/12

to d...@tortoisesvn.tigris.org

in that case, that would be a bug.

Stefan

--
___
oo // \\ "De Chelonian Mobile"
(_,\/ \_/ \ TortoiseSVN
\ \_/_\_/> The coolest Interface to (Sub)Version Control
/_/ \_\ http://tortoisesvn.net

------------------------------------------------------

http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999700

Oto BREZINA

unread,

Aug 21, 2012, 9:57:33 AM8/21/12

to d...@tortoisesvn.tigris.org

And what is resolution? Do not detect it on Check or add support to Load
and EOL change - Ctrl+Enter, and draw it, or do not AUTO detect it. In
fact it is quite rare and in edge case may lead to other issues.
Let say you mix Linux (LF) and windows (CRLF) EOLs - not good but common
case - then LF + CRLF would become LFCR + LF, what may be fine until you
edit second line ...

> Stefan
>

--
Oto ot(ik) BREZINA - 오토

------------------------------------------------------

http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999903

Oto BREZINA

unread,

Aug 21, 2012, 10:07:18 AM8/21/12

to d...@tortoisesvn.tigris.org

On 2012-08-20 21:03, Stefan Küng wrote:

> On 20.08.2012 20:40, Oto BREZINA wrote:
>
>> I'm NOT about to rewrite CheckUnicodeType, just was wonder when I read
>> it if *ptr++ is faster.
>> In fact I was little bit about - I wrote simple UTF8 validator some time
>> ago, and seeing your implementation with lot of nested ifs, ++ etc, I
>> was wondering what are cons and pros to compare with my - more state
>> based implementation.
>> Everything stopped on Performance Analysis tool.
>>
>> In CheckUnicodeType you get thru array twice just to verify if it is UTF8.
>> And it can be enhanced to UTF16BE/LE detection - statisticaly based, so
>> not 100% accurate, but it would cost some processing. For now you need
>> BOM for UTF16s.
>>
>>

> Looks like a good idea.
> If you want to do some profiling:
> have a look at the src\utils\profiling.h file. It's a simple time
> measuring tool that helps finding out if a change has an effect on
> performance or not.
>
> Simply add
> PROFILE_BLOCK;
> somewhere inside a function, then make a release build, run the app and
> that function, exit the app. A file will be written to the same folder
> as the exe is where all the time measurements are shown.
>
> to profile a simple line, use
> PROFILE_LINE(call_function());
>
> But I prefer the
> PROFILE_BLOCK;
>
> in case you need to only profile part of a function, create a 'block'
> using brackets:
>
> function(...)
> {
> ...
> ...
> {
> PROFILE_BLOCK;
> // code to profile
> ...
> }
> }

Thanx for this hint, I was trying to make VS Analyses working but not
success yet, this one is great... so I give up with VS.
Only backside is that Load and Save works with files, and my computer is
used all the time with lot of scheduling, so measuring wall time is not
most informative for algo efficiency checking.

I'll have closer look on this tool later.

Result:
*ptr++ vs ptr[] is case to case even it runs on simple char*

for (int nDword = 0; nDword<nDwords; nDword++)
{
p32[nDword] = DwordSwapBytes(p32[nDword]);
}

vs. something like
for (int nDword = nDwords; nDword; --nDword)
{
*p32++ = DwordSwapBytes(*p32);
}
was faster with about 3:4 time units on 100MB file. Other parts was +/-
equal or slower.

Utf8 validity check is double speed by using bit test instead of byte.
Not sure about *ptr++ vs ptr[].

>
> Stefan
>

--
Oto ot(ik) BREZINA - 오토

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999905

Reply all

Reply to author

Forward