Inlining raw XML

21 views
Skip to first unread message

mgri...@gmail.com

unread,
Jun 30, 2023, 9:50:39 AM6/30/23
to Woodstox User Mailing List
Code snippets are compiled and ran with the following JDK
"java.vm.vendor" -> "Oracle Corporation"
"java.specification.version" -> "17"
"java.runtime.version" -> "17.0.1+12-39"
Woodstox-core:6.4.0
jaxb-impl:2.3.3

While going through woodstox-core I found that com.ctc.wstx.sw.SimpleNsStreamWriter (produced via WstxOutputFactory) seems to have an ability to write "raw text" (via writeRaw(String) which does not perform the necessary escaping for XML output format. How do I consistently trigger this method with woodstox? With xerces (provided via com.sun.xml.bind:jaxb-impl package) i can consistently inline XML by setting "escapeCharacters" to false as follows

import javax.xml.bind.*;
import javax.xml.bind.annotation.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import java.io.ByteArrayOutputStream;

class Scratch {
    public static void main(String[] args) throws JAXBException, XMLStreamException {
        JAXBContext context = JAXBContext.newInstance(StringCont.class);
        StringCont content = new StringCont();
        content.it = "<xml>content</xml>";
        QName q = new QName("", "strwrap");
        JAXBElement<StringCont> jaxbElement = new JAXBElement<>(q, StringCont.class, content);
        Marshaller m = context.createMarshaller();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();
        outputFactory.setProperty("escapeCharacters", false);
        XMLStreamWriter delegate = outputFactory.createXMLStreamWriter(baos);
        m.marshal(jaxbElement, delegate);
        System.out.printf("%s%n", baos);
    }

    @XmlType(name = "strwrap")
    static class StringCont {
        @XmlElement
        String it;
    }
}

output with xerces

<?xml version='1.0' encoding='UTF-8'?>
<strwrap>
  <it><xml>content</xml></it>
</strwrap>

So what would be the equivalent escpaeCharacters option in woodstox?

Meanwhile woodstox does not understand the property and fails fast.


In addition, I've noticed an interesting behavior. When not using the escape feature, xerces and woodstox seem to output different XML structures. As with previous java snippet (sans the escapeCharacters option), xerces outputs the following:

<?xml version="1.0" ?>
<strwrap>
  <it>&lt;xml&gt;content&lt;/xml&gt;</it>
</strwrap>

Meanwhile woodstox does not seem to escape the gt symbol

<?xml version='1.0' encoding='UTF-8'?>
<strwrap>
  <it>&lt;xml>content&lt;/xml></it>
</strwrap>

I suppose it is valid xml. Just a tad bit confused about the result.

mgri...@gmail.com

unread,
Jul 3, 2023, 5:26:19 AM7/3/23
to Woodstox User Mailing List
Well it seems that I can set my own escape factory, but sadly this is a global option, and as a result I shall need to run two factories, one for proper behavior, and the other for escaping certain fields.

        XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();
        //outputFactory.setProperty("escapeCharacters", false);
        if(outputFactory instanceof WstxOutputFactory) {
            WriterConfig config = ((WstxOutputFactory) outputFactory).getConfig();
            config.setTextEscaperFactory(new EscapingWriterFactory() {
                @Override
                public Writer createEscapingWriterFor(Writer w, String enc) throws UnsupportedEncodingException {
                    Writer wrapped = new BufferedWriter(w) {
                        @Override
                        public void write(char[] cbuf, int off, int len) throws IOException {
                            super.write(cbuf, off, len);
                            flush();
                        }
                    };
                    return wrapped;
                }

                @Override
                public Writer createEscapingWriterFor(OutputStream out, String enc) throws UnsupportedEncodingException {
                    Writer w = new OutputStreamWriter(out, enc) {
                        @Override
                        public void write(char[] cbuf, int off, int len) throws IOException {
                            super.write(cbuf, off, len);
                            flush();
                        }
                    };
                    return w;
                }
            });
        }

But even then if I were to use XMLAdapter annotation to bridge the two, woodstox would postprocess my unescaped xml and escape it again. Blegh.

Tatu Saloranta

unread,
Jul 10, 2023, 8:36:15 PM7/10/23
to mgri...@gmail.com, Woodstox User Mailing List
Correct: Woodstox has no way to force unsafe writing of all String
values: caller absolutely must use specific call to by-pass escaping
required by XML specification.
This is by design and not something I would want to change.

I am also rather curious as to what would be the need for such a
feature? It is not immediately obvious to me why one would even use
XML writer if escaping is to be applied manually.

-+ Tatu +-
> --
> You received this message because you are subscribed to the Google Groups "Woodstox User Mailing List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to woodstox-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/woodstox-user/a0935e94-95c5-4e4c-92a1-72fd35b13a32n%40googlegroups.com.

mgri...@gmail.com

unread,
Jul 11, 2023, 5:55:06 AM7/11/23
to Woodstox User Mailing List
> This is by design and not something I would want to change.

That's fair.

The application I'm migrating sort of "stitches" new xml documents out of existing ones. Suppose the two following documents

<?xml>
<BucketList>
  <Goal id="wealth">Earn a million</Goal>
  <Goal id="health">Live to 100</Goal>
</BucketList>

<?xml>
<BucketList>
  <Goal id="travel">Visit Canada</Goal>
  <Goal id="nature">Build a park</Goal>
</BucketList>

The application gets list of ids that it should match between existing documents to produce a new output. There's a process in place that extracts the "stitchable" elements, so the application just queries that. Suppose input of IDs is "travel" and "health", so result document shall be as follows

<?xml>
<BucketList>
  <Goal id="travel">Visit Canada</Goal>
  <Goal id="health">Live to 100</Goal>
</BucketList>


These are very simplified documents that I'm using as an example. Yes, for this particular example I could use JAXB to parse each goal into an object and build a new collection, and serialize that.
In reality the structure is unstable. The equivalent Goal element can have either text content or multiple other elements under it, kind of how HTML works. This is mostly due to poor versioning practices, and never having
defined a schema for the documents. In addition, there are multiple versions of the structure out in the wild that we are expected to support at the same time. A closer example of first document would be as follows

<?xml>
<BucketList>
  <Goal id="wealth">Earn a million</Goal>
  <Goal id="health">
    <LifeExpectency>100</LifeExpectency>
  </Goal>
</BucketList>

And the final result should be

<?xml>
<BucketList>
  <Goal id="travel">Visit Canada</Goal>
  <Goal id="health">
    <LifeExpectency>100</LifeExpectency>
  </Goal>
</BucketList>

As a result, I've also opted to also query the system in place that has all the "stitchable" parts to get them as strings, and then only define BucketList element which has text content field via JAXB. I join the records using Collectors#joining(String, String, String) collector, and wallah. I've replicated the behavior of original system

From your response I understand I've misdirected you. The original intention was to use only one OutputFactory (as this functionality is only part of the application, the other parts that build XML documents are *fairly* consistent, so I could define JAXB equivalent classes), but as per my comment I found that the writer config is global and would affect all serializations, which isn't ideal as some containers expect escaped XML. So my next idea was to somehow bridge the two output factories using XMLAdapter feature provided by JAXB. Sadly, as per my comment, I couldn't figure out how to bridge the two and instead opted to choose one of the two factories and depending on annotations of classes I need to serialize. 

    void serialize(Object target, Class<?> type, OutputStream outputStream) throws JAXBException, XMLStreamException {
        Marshaller marshaller = jaxbContext.createMarshaller();
        XMLStreamWriter writer;
        if (type.getAnnotation(InlineRawXml.class) == null) { // assert if there's special annotation on container that expects unescaped xml in output
            writer = outputFactory.createXMLStreamWriter(outputStream); // use the safe factory if there isn't one
        } else {
            writer = unescapedOutputFactory.createXMLStreamWriter(outputStream); // use the unsafe factory if there is one
        }
        String qNameKey = type.getName();
        Class adapted = type; // remove generic information to prevent compiler from complaining
        QName qName = this.qnameConfigs.get(qNameKey); // query prebuilt cache of qnames
        JAXBElement<Object> jaxbElement = new JAXBElement<>(qName, adapted, target); // wrap the object into jaxb element because there can be multiple roots
        marshaller.marshal(jaxbElement, writer);
    }

Of course the final function handwaves a lot of detail such as how JAXBContext, the qname cache and etc. are built.

I am aware this is stupid approach, but this was my attempt to bring in some structural stability in a process that used to manipulate XML Strings directly on output stream.

Reply all
Reply to author
Forward
0 new messages