AbtXMLDomParser, XML entities and DBString

298 views
Skip to first unread message

Joachim Tuchel

unread,
Nov 6, 2023, 5:46:47 AM11/6/23
to VAST Community Forum
I am having trouble with an RSS feed that we are parsing in our Application. The problem is that if a tag includes an XML/HTML Entity, the reuslting String is returned as a DBString by the DOM parser, and a DBString does not understand #utf8 (which it gets sent from GRVASTUtf8Codec in order to render the text on a web page).

So here is a little snippet to demonstrate my problem:

|test parser dom |
test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test></xml>'.
parser := AbtXmlDOMParser newNonValidatingParser .
dom := parser parse: test.
(dom getElementsByTagName: 'test') first contents first abrAsString


The problem is that The contents of the test tag are returned as a DBString and cannot be processed any further. Sending asSBString returns nil (which is logical), and it seems there is nothing I can do with a DBString do convert it to anything.

So how can I proceed to in order to render the contents of my RSS feed onto a web page?
In the debugger, you can see that the parser decides to creata a DBString. I'd rather have it return a utf-String, which I can process further. I guess the best solution here would be to force the parser to create utf-8 instead of a DBString. I cannot find such an option, however... What am I doing wrong or missing?


Joachim


As a side note: abrAsString is deprecated. And I thought I can simply replace it with asString. This is the first place I find where that's not the case. abrAsString returns a DBString, asString returns nil. So asString is not a full replacement for abrAsString, at least not in the case of DBStrings.






Richard Sargent

unread,
Nov 6, 2023, 5:35:06 PM11/6/23
to VAST Community Forum
Joachim,

Does Utf8View on: aDBString work?

Alternatively, collect the codepoints from aDBString and transform them as eachCodePoint asGrapheme utf8.

Joachim Tuchel

unread,
Nov 7, 2023, 3:30:33 AM11/7/23
to VAST Community Forum
Richard,

thanks for your suggestions and time.

|test parser dom dbString |

test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test></xml>'.
parser := AbtXmlDOMParser newNonValidatingParser decodingEnabled: false .

dom := parser parse: test.
dbString := (dom getElementsByTagName: 'test') first contents first abrAsString.
(Utf8View on: dbString)


gives me a primitive failed on VAST 12.0. So your first idea deosn't work.

I also played with your second idea, ot sounds extremely logical. but so far I haven't succeeded.
This is where I currently am:

|test parser dom dbString charArray |

test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test></xml>'.
parser := AbtXmlDOMParser newNonValidatingParser decodingEnabled: false .

dom := parser parse: test.
dbString := (dom getElementsByTagName: 'test') first contents first abrAsString.
charArray := dbString  asArray.

Utf8View on: charArray. "--> Primitve failed in: UnicodeView>>#primInitializeOn:immutable: "
charArray := dbString asUtf8 "--> DBString does not understand asUtf8"
utf8String := charArray collect: [:dbChar| dbChar asGrapheme]. "--> Index out of range in Charater>>asGrapheme"



So, unfortunately, to me it seems like AbtXmlDOMParser leads to a dead end because a DBString is a dumb dead end...?
I think it shouldn't give me a DBString at all, but a String in UTF-8 codepage (or a real Utf8String . But I guess I am just too stupid to use it properly.

Joachim

Joachim Tuchel

unread,
Nov 7, 2023, 4:16:12 AM11/7/23
to VAST Community Forum
Okay, I found something that at least doesn't throw an error and seems to be a workable starting point (the dash becomes an ascii dash):

|test parser dom dbString charArray outStr |

test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test></xml>'.
parser := AbtXmlDOMParser newNonValidatingParser decodingEnabled: false .
dom := parser parse: test.
dbString := (dom getElementsByTagName: 'test') first contents first abrAsString.

outStr := WriteStream on: String new.
(TextConverter newForEncoding:  'utf-8' ) nextPutAll: dbString toStream: outStr.
outStr contents. "'Text with an xml entity: –'"
outStr contents convertFromCodePage: 'utf-8'


The code looks a bit clumsy, but leads to a result that should be okay in my scenario.
Is this really the way to go or am I just missing some configuration of AbtXmlDOMParser that would retunr either the utf-8 encoded String or (even better) converted to my codepage.
What would be the right way of doing this with the new Unicode classes?

Joachim

Joachim Tuchel

unread,
Nov 7, 2023, 4:22:26 AM11/7/23
to VAST Community Forum
Oh, and one more question remains: now that #abrAsString is deprecated, my "solution" won't work any more with #asString, because it returns nil...

Joachim Tuchel

unread,
Nov 7, 2023, 4:23:48 AM11/7/23
to VAST Community Forum
...and my solution requires Grease to be loaded, because TextConvcerter is in GreaseVASTCore... So I defintitely am doing something wrong.

Richard Sargent

unread,
Nov 7, 2023, 9:21:07 AM11/7/23
to VAST Community Forum
See below.

On Tuesday, November 7, 2023 at 12:30:33 AM UTC-8 Joachim Tuchel wrote:
Richard,

thanks for your suggestions and time.

|test parser dom dbString |

test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test></xml>'.
parser := AbtXmlDOMParser newNonValidatingParser decodingEnabled: false .

dom := parser parse: test.
dbString := (dom getElementsByTagName: 'test') first contents first abrAsString.
(Utf8View on: dbString)


gives me a primitive failed on VAST 12.0. So your first idea deosn't work.

I also played with your second idea, ot sounds extremely logical. but so far I haven't succeeded.
This is where I currently am:

|test parser dom dbString charArray |

test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test></xml>'.
parser := AbtXmlDOMParser newNonValidatingParser decodingEnabled: false .

dom := parser parse: test.
dbString := (dom getElementsByTagName: 'test') first contents first abrAsString.
charArray := dbString  asArray.

At this point, use Array>>#asUnicodeString. Then, send #utf8 to that result.
e.g.

dbString asArray asUnicodeString utf8

If that works, it forms the basis of an extension method on DBString (and if you don't want to risk collision with Instantiations' future work, one on String, too).



Utf8View on: charArray. "--> Primitve failed in: UnicodeView>>#primInitializeOn:immutable: "
charArray := dbString asUtf8 "--> DBString does not understand asUtf8"
utf8String := charArray collect: [:dbChar| dbChar asGrapheme]. "--> Index out of range in Charater>>asGrapheme"



So, unfortunately, to me it seems like AbtXmlDOMParser leads to a dead end because a DBString is a dumb dead end...?

I don't think that's quite it. Although, DBString is an unusual beast. (I suspect it derived from IBM requirements for string handling. EBCDIC, anyone?)

I think the reality is that AbtXmlDOMParser is (or should be) somewhere on the list of things to update because of Instantiations efforts to support Unicode.

Joachim Tuchel

unread,
Nov 7, 2023, 11:07:42 AM11/7/23
to VAST Community Forum
Richard,

thanks for your ideas.

Unfortunately

|test parser dom dbString charArray outStr |
test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test>'.

parser := AbtXmlDOMParser newNonValidatingParser decodingEnabled: false .
dom := parser parse: test.
dbString := (dom getElementsByTagName: 'test') first contents first abrAsString.

charArray := dbString asArray.
charArray asUnicodeString utf8

ends in a Prmitive Failed in UnicodeString class>>#value due to invalid class in Argument, teh argument being the Array ($T $e $x $t $  $w $i $t $h $  $a $n $  $x $m $l $  $e $n $t $i $t $y $: $  $[DC3])
dbString asUnicodeString doesn't work either: self shouldNotImplement  ---> dead end, because I guess somebody once thought that all that unicodce stuff would one day be a subclass of DBString, which hasn't happened, fo a reason, I guess.

I didn't want to say AbtXmlDOMParser is useless or doomed. It just fails in my situation and I see no good reason for it to return a DBString, especially not in the future when VAST supports unicode. DBString just feels like a completely useless container to hiold a bunch of Characters that cannot do anything.

Joachim

Richard Sargent

unread,
Nov 7, 2023, 11:23:54 AM11/7/23
to va-sma...@googlegroups.com
My apologies, Joachim! You need to collect the code-points from the DBString. Array>>#asUnicodeString requires an array of integers.

--
You received this message because you are subscribed to a topic in the Google Groups "VAST Community Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/va-smalltalk/GaWt5SemNYY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/8e509626-dc6f-4525-bfbe-ed6a6c262b9dn%40googlegroups.com.

Joachim Tuchel

unread,
Nov 8, 2023, 2:37:55 AM11/8/23
to VAST Community Forum
Hi Richard,

I have to apologize, I should have read your comments more carefully. You had already mentioned the fact that it needs to be an Array of numbers.

So here is what I ended up with with your help:

|test parser dom dbString charArray outStr |
test := '<?xml version="1.0" encoding="UTF-8"?><test>Text with an xml entity: &#8211;</test>'.
parser := AbtXmlDOMParser newNonValidatingParser decodingEnabled: false .
dom := parser parse: test.
dbString := (dom getElementsByTagName: 'test') first contents first abrAsString.


charArray := dbString asArray collect: [:e| e value].
charArray asUnicodeString asSBString


You may wonder why I want the incoming utf-8 to be converted to an SBString. That is because I render my Seaside components in ISO-8859-15, and sonvert the end result to utf-8. One of the wonders I hope to get rid of with one of the upcoming VAST releases ;-)

Again, thanks for your help!

Joachim

Esteban A. Maringolo

unread,
Nov 8, 2023, 8:31:07 AM11/8/23
to va-sma...@googlegroups.com
Hi Joachim,

We've analyzed this particular case, your error seemed to be related to a change in GRVASTUtf8Codec we introduced some time ago, but after a second check we realized it was related to something that has probably been there for a long time. And in fact, prior to this change in GRVASTUtf8Codec, the output from the code was worse than this.

The AbtXmlDOMParser parses this XML entity as a character with a value above 255 (that's the purpose of using such entities), but at some point that character is converted to a string, which is a DBString given its value. But DBStrings are meant to be used in double-byte codepages, not to hold UTF-8 codepoints, so this could be considered a misuse of DBString. And by the time it gets to the Grease UTF-8 codec, it's too late.

So the current workaround might be something along the lines of what Richard suggested, or passing it to the codec:
((dom getElementsByTagName: 'test') first contents first abrAsString asArray collect: [:char | char value asUnicodeScalar]) asUnicodeString

To provide proper Unicode support for the parser, we're thinking of ways to enable Unicode (it is, to output UnicodeStrings) without disrupting existing users, as we did with other frameworks.

We'll keep you updated and thanks for reporting your finding,

Best regards,

Esteban Maringolo

Senior Software Developer

 emari...@instantiations.com
 @emaringolo
 /emaringolo
 instantiations.com
TwitterLinkedInVAST Community ForumGitHubYouTubepub.dev


You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/c6b81e00-4b16-4b5a-9fff-87c28419c023n%40googlegroups.com.

Adriaan van Os

unread,
Nov 13, 2023, 6:06:59 AM11/13/23
to VAST Community Forum
Hi Joachim and others,

For the frameworks that are not yet updated for Unicode, we convert a Unicode code point to a local code page character by little mods in those frameworks. We do so in all implementors of #parseCharacterHex. One can do the same for all relevant implementors of #characterReference:.

Cheers,
Adriaan,



Esteban A. Maringolo

unread,
Nov 13, 2023, 10:07:11 AM11/13/23
to va-sma...@googlegroups.com
Adriaan,

What do you do when an XML entity cannot be mapped to the target codepage? Use the replacement character?

Regards,

Esteban Maringolo

Senior Software Developer

 emari...@instantiations.com
 @emaringolo
 /emaringolo
 instantiations.com
TwitterLinkedInVAST Community ForumGitHubYouTubepub.dev

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.

Adriaan van Os

unread,
Nov 13, 2023, 10:11:31 AM11/13/23
to VAST Community Forum
Yep, we just use $? as a replacement character.

Adriaan van Os

unread,
Nov 13, 2023, 12:42:29 PM11/13/23
to VAST Community Forum
Esteban, I guess I can elaborate some more. We try to get the best possible character.

First we try the fast and strict version: (UnicodeString value: anInteger) asSBString firstOrNil.
If that gives nil we try the non-strict version: (UnicodeString value: anInteger) convertToUtf8: String) convertFromCodePage: EsAbstractCodePageConverter utf8CodePage) firstOrNil.

If we don't like the result from above we sometimes add a custom conversion. And when there is no suitable mapping, we use $?

Because #parseCharacterHex works with Characters, we do use #firstOrNil. For #characterReference:, one could use the whole string. I imagine there are use-cases where you perhaps just want to maintain the entity ('&...;')?

Cheers,
Adriaan.

vcost...@gmail.com

unread,
Nov 14, 2023, 10:21:39 AM11/14/23
to VAST Community Forum
It seems to me that the VAST XML parser is a subclass, if you will, of SGML. The correct fix is to generate a new parser that works with the Unicode changes but I doubt an ST parser generator exists to do that. That means we're left with trying to wrestle with the current parser and beat it into submission, if possible.

Adriaan van Os

unread,
Nov 19, 2023, 7:04:08 AM11/19/23
to VAST Community Forum
BTW, Seaside/Grease in recent versions of the VAST Platform renders utf-8 output for UnicodeStrings just fine, thanks to a customer/community contribution. ;-)

Cheers,
Adriaan.

Adriaan van Os

unread,
Nov 19, 2023, 7:19:14 AM11/19/23
to VAST Community Forum
That was a bit too optimistic. Beside the a recent version of VAST Platform, you need some or all of the attached methods.

Cheers,
Adriaan.
greaseInteger.st
seasideMimeDocument.st
encodeOn.st
renderOn.st

Esteban A. Maringolo

unread,
Nov 29, 2023, 2:57:04 PM11/29/23
to va-sma...@googlegroups.com
Hi Adriaan, 

This customer contribution will likely go into the next version of VAST as well.

Do you have test cases for the methods you provided?

Thanks!

Esteban Maringolo

Senior Software Developer

 emari...@instantiations.com
 @emaringolo
 /emaringolo
 instantiations.com
TwitterLinkedInVAST Community ForumGitHubYouTubepub.dev

You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/3a90afd7-6adc-4cd9-a7a3-76769dac81a4n%40googlegroups.com.

Adriaan van Os

unread,
Nov 29, 2023, 3:06:58 PM11/29/23
to VAST Community Forum
Hi Esteban,

I provided a functional test with Case INST72197.

And yes, we are looking forward to the Unciode integration for the xml parser as well. :)

Cheers,
Adriaan.
Reply all
Reply to author
Forward
0 new messages