why a newline (Lf, '\n', 0xA) is not of the class print?

MartinLemburg@Siemens-PLM

unread,

Nov 30, 2012, 8:42:39 AM11/30/12

to

Hi,

since ...

% string is space "\n"
1

..., than why ...

% string is print "\n"
0

I took a look at the 8.6b3 sources and their is in tclUtf8.c in Tcl_UniCharIsSpace the exceptional treatment for codepoints lower than 128, which are tested with "isspace".

Tcl_UniCharIsPrint does not use this exceptional treatment for codepoints lower than 128 and so suddenly the newline character is no print character.

Should I file a bug?

I ask because it would be even right for "string is space" with newline caring only for the "general category" of the Unicode character newline to return false.
If it cares for the Bidi Category, which is "Paragraph Separator" the newline must be a "space" character.

The function Tcl_UniCharIsPrint checks for graphical and space bits set in the bitmask of the Unicode character and using only this the newline is simply a control character and not a paragraph separator!

So to ask … are the tables in tclUnidata.c wrong for a newline classifying it as control character or is it needed, that every function trying to classify a newline or other, similar characters with the "isspace" ISO function, classifying this character correctly?

If "isspace" is about to be used, than Tcl_UniCharIsPrint hast to be corrected, hasn't it?

Best regards,

Martin

Les Cargill

unread,

Nov 30, 2012, 1:39:18 PM11/30/12

to

MartinLemburg@Siemens-PLM wrote:
<snip>

SFAIsuspect, Tcl simply uses the character classes contained
in ctype.h, of the 'C' language library header file
constellation.

--- BEGIN EXCERPT
/*
* The following flags are used to tell iswctype and _isctype what
character
* types you are looking for.
*/
#define _UPPER 0x0001
#define _LOWER 0x0002
#define _DIGIT 0x0004
#define _SPACE 0x0008 /* HT LF VT FF CR SP */
#define _PUNCT 0x0010
#define _CONTROL 0x0020
/* _BLANK is set for SP and non-ASCII horizontal space chars (eg,
"no-break space", 0xA0, in CP1250) but not for HT. */
#define _BLANK 0x0040
#define _HEX 0x0080
#define _LEADBYTE 0x8000

#define _ALPHA 0x0103

--- END EXCERPT
--
Les Cargill

MartinLemburg@Siemens-PLM

unread,

Dec 3, 2012, 4:15:12 AM12/3/12

to

Hi,

here the sources from tcl:

int
Tcl_UniCharIsSpace(
int ch) /* Unicode character to test. */
{
/*
* If the character is within the first 127 characters, just use the
* standard C function, otherwise consult the Unicode table.
*/

if (((Tcl_UniChar) ch) < ((Tcl_UniChar) 0x80)) {
return isspace(UCHAR(ch)); /* INTL: ISO space */
} else {
return ((SPACE_BITS >> GetCategory(ch)) & 1);
}
}

int
Tcl_UniCharIsPrint(
int ch) /* Unicode character to test. */
{
return (((GRAPH_BITS|SPACE_BITS) >> GetCategory(ch)) & 1);
}

As you see, the function Tcl_UniCharIsSpace asks for the '\n' the native c library using their character classes, so you are right!

But the function Tcl_UniCharIsPrint don't use the native c library for 1 byte codepoints!
If spaces are part of the print'able characters, than IMHO the function Tcl_UniCharIsPrint should behave similar to Tcl_UniCharIsSpace in detecting spaces:

int
Tcl_UniCharIsPrint(
int ch) /* Unicode character to test. */
{
if (Tcl_UniCharIsSpace(ch))
return 1;

return ((GRAPH_BITS >> GetCategory(ch)) & 1);
}

Wouldn't this be better? Or do I miss something Unicode specific?

Any Tcl core team member around?

Best regards,

Martin