Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

simple question? byte count of unicodestring

720 views
Skip to first unread message

A

unread,
Sep 10, 2011, 9:36:17 AM9/10/11
to
I guess the answer is simple - but where do I get byte count of a
UnicodeString?

UnicodeString.Length won't do as it reports number of characters.

I need an access to raw buffer so I can checksum it using CRC32 which of
course needs void * as a pointer (I'm mixing a bit of delphi and c++ code).


A

unread,
Sep 10, 2011, 9:44:48 AM9/10/11
to
By the way, I know Delphi uses UTF-16 so I can use UnicodeString.data() to
get raw data pointer and multiply Lenght by 2 to get widechar instead of
char.

However, what I am uncertain of is whether UTF-16 can under some
circumstances use 4-bytes per character so in that case the multipy by 2
calculation would fail?


Dr Engelbert Buxbaum

unread,
Sep 10, 2011, 6:11:47 PM9/10/11
to
In article <j4fp4q$jr8$1...@gregory.bnet.hr>, a@a.a says...
SizeOf()

A

unread,
Sep 10, 2011, 6:27:16 PM9/10/11
to
> SizeOf()

Yes, I could multiply by size of widechar, but that would fail to convert
codepoints that have 4 bytes -
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
Take a look at codepoints of UTF-16 010000 - 03FFFF and 040000 - 10FFFF that
require 4 bytes in UTF-16.

Nevermind, I solved it by assigning UnicodeString to UTF8String which does
the conversion and then doing checksum of UTF8String buffer. It seems that
UTF8String.Length reports number of bytes of the buffer (which may be larger
than number of characters) rather than number of Unicode characters - and
that was exactly what I needed so UTF8 is just fine for this purpose.



Arivald

unread,
Sep 12, 2011, 2:59:25 AM9/12/11
to
W dniu 2011-09-11 00:27, A pisze:
>> SizeOf()
>


Formula is simple, and work in all versions of Delphi.

bytesize := string.Length() * sizeof(char);


Both "string" and "char" are just aliases to real types.

in older Delphi it will translate to
bytesize := AnsiString.Length() * sizeof(AnsiChar);

in new, unicode Delphi it will translate to
bytesize := UnicodeString.Length() * sizeof(UnicodeChar);


Delphi guarantee that "string" type is default string type for VCL.
"char" type is type of character used in "string". To get bytesize of
"string" you need to multiply its length and size of character. You do
not need to know real size of character, it may be one, two or four
bytes. But as long as you use sizeof(char) You code will work correctly.


> Yes, I could multiply by size of widechar, but that would convert
> fail to codepoints that have 4 bytes -
> http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
> Take a look at codepoints of UTF-16 010000 - 03FFFF and
> 040000 - 10FFFF that require 4 bytes in UTF-16.

No, it will not fail. UnicodeString.Length() does not count characters.
It count code units (character may be one unit or two units).
Basically Length() return number of allocated bytes, and divided by
length of character, for all types of strings.

Reason for this is simple: length of string is stored in memory few
bytes before string data. This value is returned by Length() call, also
this value is used to reallocate memory. You can retrieve string length
manually, call Pchar() to get pointer to data, move it back some offset
(i do not remember how many), cast pointer to integer, and read pointed
integer.

Note that UTFstring works in exactly same way. Indeed, it is implemented
using same structures and functions (all string types are)! Only one
difference is different value in "encoding" field of string record.


Characters composed from two code units are indeed problem, but not in
this case.


> Nevermind, I solved it by assigning UnicodeString to UTF8String which does
> the conversion and then doing checksum of UTF8String buffer. It seems that
> UTF8String.Length reports number of bytes of the buffer (which may be larger
> than number of characters) rather than number of Unicode characters - and
> that was exactly what I needed so UTF8 is just fine for this purpose.

Yes, because character counting in UTF-8 is complicated, while Delphi
guarantee that Length() of string type executes in constant time. But
Your method may be slow, because of unicode to UTF8 conversion.

--
Arivald

A

unread,
Sep 12, 2011, 7:52:28 AM9/12/11
to
> > Yes, I could multiply by size of widechar, but that would convert
> > fail to codepoints that have 4 bytes -
> > http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
> > Take a look at codepoints of UTF-16 010000 - 03FFFF and
> > 040000 - 10FFFF that require 4 bytes in UTF-16.
> No, it will not fail. UnicodeString.Length() does not count characters. It
> count code units (character may be one unit or two units).
> Basically Length() return number of allocated bytes, and divided by length
> of character, for all types of strings.

So what would Length report in this case:

2 bytes | 2 bytes | 4 bytes | 2 bytes
A | B | chinese letter | C

would Length report 4 or 5? As size of normal UTF16 char is 2 bytes it
should report 5 right?


Arivald

unread,
Sep 12, 2011, 8:29:40 AM9/12/11
to
W dniu 2011-09-12 13:52, A pisze:

Yes.

Here is article about strings:
http://www.justphukit.com/handbook_2009/the_internal_structure_of_strings.html

As You can see, in Delphi 2009+ You can get size of single string
element from string itself.


By the way, in new Delphi there is type RawByteString. From help:

> RawByteString enables the passing of string data of any code
> page without doing any code page conversions

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/System_RawByteString.html

It may be good for Your CRC routines to get string of this type, and
work on raw bytes.

--
Arivald

A

unread,
Sep 12, 2011, 11:48:20 AM9/12/11
to
> Yes.

Well the, yes, getting the length of UnicodeString buffer is not a problem
for CRC purposes. Then it is only a problem is 4-byte letters are used to
check on these letters - checking on surrogate pairs etc. I get it now :)
I don't really need to mess around the internal structure of UnicodeString.
Pointer to a buffer (c_str() or data()) and Length * sizeof(char) is more
than enough for me.

> By the way, in new Delphi there is type RawByteString. From help:
> It may be good for Your CRC routines to get string of this type, and work
> on raw bytes.

I don't think it is good for that. From everywhere I read it is meant to be
used for one purpose alone - to be passed as function parameter but to
prevent codepage from converting - just to pass data buffer - as it is...
which is almost useless in most cases I can think of. But I only need to use
data buffer so pointer to buffer is more than enough.


A

unread,
Sep 12, 2011, 11:56:28 AM9/12/11
to
It might be a look ahead to use this:

SysUtils.ByteLength
http://docwiki.embarcadero.com/VCL/2010/en/SysUtils.ByteLength

Internally it has the same calculation:

function ByteLength(const S: string): Integer;
begin
Result := Length(S) * SizeOf(Char);
end;

But if they decide to modify this, they will modify ByteLength too and
therefore you might not have to update your code...


Rudy Velthuis

unread,
Sep 27, 2011, 8:51:49 PM9/27/11
to
SizeOf(what) exactly?

The byte count is Length(Str)*SizeOf(Char).

--
Rudy Velthuis

"Glory is fleeting, but obscurity is forever."
-- Napoleon Bonaparte (1769-1821)

Rudy Velthuis

unread,
Sep 27, 2011, 8:54:56 PM9/27/11
to
A wrote:

> So what would Length report in this case:
>
> 2 bytes | 2 bytes | 4 bytes | 2 bytes
> A | B | chinese letter | C
>
> would Length report 4 or 5?

The Chinese character is made up of a surrogate pair, which are TWO
elements of the string. There are 5 elements, so the length is 5.
--
Rudy Velthuis

"The process of preparing programs for a digital computer is
especially attractive, not only because it can be economically
and scientifically rewarding, but also because it can be an
aesthetic experience much like composing poetry or music."
-- Donald Knuth
0 new messages