Jackson and XHTML

392 views
Skip to first unread message

st...@heliossoftware.com

unread,
Mar 13, 2017, 1:25:34 PM3/13/17
to jackson-user
Is it possible to use Jackson to parse XHTML?  I am trying to parse this fragment, and the inline <i>,</i> and <b></b> tags are giving me some problems.

     <div xmlns="http://www.w3.org/1999/xhtml">
     <p>
       This is an <i>example</i> with some <b>xhtml</b> formatting.
     </p>
     </div>

I'm getting an exception:
java.io.IOException: Expected END_ELEMENT, got event of type 1

Is there a way to configure Jackson to make this work?

Thanks,
Steve

Tatu Saloranta

unread,
Mar 13, 2017, 1:33:20 PM3/13/17
to jackson-user
Jackson XML backend does not really support mixed content -- content model that has both non-whitespace text AND elements. This is difficult to represent with databinding, and is mostly operated with XML-centric models like DOM.

There has been some talk about exposing this in some form or fashion, and I think there's an open issue or two.
But I am not aware of particularly clean design for exposing this; it seems fundamentally at odds with typical POJOs that do not cater for XML infoset.

-+ Tatu +-


--
You received this message because you are subscribed to the Google Groups "jackson-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jackson-user+unsubscribe@googlegroups.com.
To post to this group, send email to jackso...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

st...@heliossoftware.com

unread,
Mar 13, 2017, 1:38:39 PM3/13/17
to jackson-user
Thank you Tatu.  If I were to try to override this behavior, where should I look in Jackson?  For example, if I wanted to try to have Jackson skip over/ignore certain tags like <i>, </i>, <b> and </b>, where should I look?


On Monday, March 13, 2017 at 1:33:20 PM UTC-4, Tatu Saloranta wrote:
Jackson XML backend does not really support mixed content -- content model that has both non-whitespace text AND elements. This is difficult to represent with databinding, and is mostly operated with XML-centric models like DOM.

There has been some talk about exposing this in some form or fashion, and I think there's an open issue or two.
But I am not aware of particularly clean design for exposing this; it seems fundamentally at odds with typical POJOs that do not cater for XML infoset.

-+ Tatu +-

On Mon, Mar 13, 2017 at 6:55 AM, <st...@heliossoftware.com> wrote:
Is it possible to use Jackson to parse XHTML?  I am trying to parse this fragment, and the inline <i>,</i> and <b></b> tags are giving me some problems.

     <div xmlns="http://www.w3.org/1999/xhtml">
     <p>
       This is an <i>example</i> with some <b>xhtml</b> formatting.
     </p>
     </div>

I'm getting an exception:
java.io.IOException: Expected END_ELEMENT, got event of type 1

Is there a way to configure Jackson to make this work?

Thanks,
Steve

--
You received this message because you are subscribed to the Google Groups "jackson-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jackson-user...@googlegroups.com.

Tatu Saloranta

unread,
Mar 13, 2017, 4:27:47 PM3/13/17
to jackson-user
To be completely honest I don't think you can easily modify components
to do that, since pieces (FromXmlParser) are constructed by others.
Your best bet may be to pre-process content. But beyond that, how
would and should data be mapped?
JsonNode does not work that well with XML content (it is not
officially supported although does work for some cases), so ideally
result would be a POJO. But how would separate text (cdata) segments
be bound?

-+ Tatu +-

st...@heliossoftware.com

unread,
Mar 13, 2017, 4:38:30 PM3/13/17
to jackson-user
Thanks Tatu.  I was looking through the code, and noticed InputDecorator.  I'm going to try to decorate the formatting tags differently and see how that goes.  I really just want the contents of that <div> as a String in my POJO anyway, so I might be able to decorate the formatting tags away before parsing, then re-insert them later.  I'll let you know how it works out.

Tatu Saloranta

unread,
Mar 15, 2017, 7:34:31 PM3/15/17
to jackson-user
That does sound like a possible path, as
InputDecorator/OutputDecorator allow wrapping of parser/generator
using delegation.
Implementation of such wrapper can extend
JsonParserDelegate/JsonGeneratorDelegate (or sub-classes
FilteringParserDelegate/FilteringGeneratorDelegate) and those are
designed to allow efficient if not convenient removal/addition of
low-level tokens/events.

-+ Tatu +-

Steve Munini

unread,
Mar 15, 2017, 7:36:19 PM3/15/17
to jackso...@googlegroups.com
Hi Tatu,

Thank you so much for your help.  It worked!  I implemented a InputDecorator which appears to be working now.  Thank you!

Steve Munini 
CEO & CTO
978-590-4493
heliossoftware.com




> To post to this group, send email to jackso...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "jackson-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jackson-user/22a8gsSE8ZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jackson-user+unsubscribe@googlegroups.com.

Tatu Saloranta

unread,
Mar 15, 2017, 8:24:21 PM3/15/17
to jackson-user
I am very happy to be proven my pessimism was not warranted :)
Great that it works for your use case,

-+ Tatu +-

ps. I don't think anyone has written about such approach, so if you
wanted to write a blog post or article about your approach that'd
probably be well received.
>> You received this message because you are subscribed to a topic in the
>> Google Groups "jackson-user" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/jackson-user/22a8gsSE8ZU/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to

Alek Modi

unread,
Jan 17, 2020, 5:31:49 PM1/17/20
to jackson-user
Hi Steve, glad that you found a work around for this issue. 

I have the exact same issue where I am expecting simple text but my client is sending text with HTML tags in the string value which is causing jackson to fail.

Can you please share the snippet of the code which worked for you?

Thanks, Alek.

> To post to this group, send email to jackso...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "jackson-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jackson-user/22a8gsSE8ZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jackso...@googlegroups.com.

Steve Munini

unread,
Jan 21, 2020, 8:57:25 AM1/21/20
to jackso...@googlegroups.com
Hi Alex, here you go...

package com.heliossoftware.fhir.utils;


import java.io.FilterInputStream;

import java.io.IOException;

import java.io.InputStream;

import java.io.Reader;

import java.util.Iterator;

import java.util.LinkedList;


import com.fasterxml.jackson.core.io.IOContext;

import com.fasterxml.jackson.core.io.InputDecorator;


public class XmlInputDecorator extends InputDecorator {


private static final long serialVersionUID = -408865289583579353L;


@Override

public InputStream decorate(IOContext ctxt, InputStream in) throws IOException {

return new ReplacingInputStream(in);

}

class ReplacingInputStream extends FilterInputStream {

LinkedList<Integer> inQueue = new LinkedList<Integer>();

    LinkedList<Integer> outQueue = new LinkedList<Integer>();

   

byte[] search = "http://www.w3.org/1999/xhtml".getBytes();

// byte[] div = "<div".getBytes();

byte[] slashDiv = "</div>".getBytes();


@Override

public int read(byte[] b, int off, int len) throws IOException {

// This gets called... not the others.

int numBytesRead = 0; // Number of bytes read into the buffer

// Continue to read() until we can't.

while (true) {

int next = read();

if (next == -1)

break;

if (numBytesRead >= len - off) {  // do we have space in the buffer?

outQueue.addFirst(next); // put what was read, back in the buffer.

break;

}

b[numBytesRead + off] = (byte)next;

numBytesRead++;

}

return numBytesRead;

}



protected ReplacingInputStream(InputStream in) {

super(in);

}


@Override

public int read() throws IOException {


// Next byte already determined.

        if (outQueue.isEmpty()) {


            readAheadForSearchStart();


            if (isMatchFoundSearchStart()) {

           

            for (int i = 0; i < search.length; i++)

            outQueue.add(inQueue.remove());

           

            // read until next > is encountered, adding to the outQueue

            while (true) {

                       

            int next = super.read();

           

                outQueue.offer(next);

               

                if (next == -1)

                    break;

               

                if (next == '>')

                break;

            }

           

            // Add <![CDATA[

            byte[] cdataStart = "<![CDATA[".getBytes();

           

            for (int i = 0; i < cdataStart.length; i++)

            outQueue.offer((int)cdataStart[i]);

           

            // Search for matching </div>, and before it add the end of CDATA marker

            while (true) {

                       

            readAheadForSlashDiv();

           

            if (isMatchFoundSlashDiv()) {

           

            // Look back in the output queue for <div> and </div>s - make sure this is the right one

            if (countDivs() == countSlashDivs()) { 

           

            // Add ]]>

            byte[] cdataEnd = "]]>".getBytes();

           

            for (int i = 0; i < cdataEnd.length; i++)

            outQueue.offer((int)cdataEnd[i]);

           

            outQueue.addAll(inQueue);

            inQueue.clear();

            break;

           

            } else

            outQueue.add(inQueue.remove());

           

            } else {

            outQueue.add(inQueue.remove());

            }

                       

            }


            } else

                outQueue.add(inQueue.remove());


        }


        return outQueue.remove();

}

private boolean isMatchFoundSearchStart() {

        Iterator<Integer> inIter = inQueue.iterator();

        for (int i = 0; i < search.length; i++)

            if (!inIter.hasNext() || search[i] != inIter.next())

                return false;

        return true;

    }

private boolean isMatchFoundSlashDiv() {

        Iterator<Integer> inIter = inQueue.iterator();

        for (int i = 0; i < slashDiv.length; i++)

            if (!inIter.hasNext() || slashDiv[i] != inIter.next())

                return false;

        return true;

    }

private int countDivs() {

int numDivs = 0;

        Iterator<Integer> outIter = outQueue.iterator();

        while (outIter.hasNext()) {

        if (outIter.next() == '<')

        if (outIter.next() == 'd')

        if (outIter.next() == 'i')

        if (outIter.next() == 'v') {

        numDivs++;

        if (outIter.next() == '/' && outIter.next() == '>')  //<div/> should have no effect

        numDivs--;

        }


        }

        return numDivs;

    }

private int countSlashDivs() {

int numSlashDivs = 0;

        Iterator<Integer> outIter = outQueue.iterator();

        while (outIter.hasNext()) {

        if (outIter.next() == '<')

        if (outIter.next() == '/')

        if (outIter.next() == 'd')

        if (outIter.next() == 'i')

        if (outIter.next() == 'v')

        if (outIter.next() == '>')

        numSlashDivs++;

        }

        return numSlashDivs;

    }


    private void readAheadForSearchStart() throws IOException {

        // Work up some look-ahead.

        while (inQueue.size() < search.length) {

            int next = super.read();

            inQueue.offer(next);

            if (next == -1)

                break;

        }

    }

   

    private void readAheadForSlashDiv() throws IOException {

        // Work up some look-ahead.

        while (inQueue.size() < slashDiv.length) {

            int next = super.read();

            inQueue.offer(next);

            if (next == -1)

                break;

        }

    }

}


@Override

public InputStream decorate(IOContext ctxt, byte[] src, int offset, int length) throws IOException {

// TODO Auto-generated method stub

return null;

}


@Override

public Reader decorate(IOContext ctxt, Reader r) throws IOException {

// TODO Auto-generated method stub

return null;

}


}





To unsubscribe from this group and all its topics, send an email to jackson-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jackson-user/2cb756fa-e6aa-4da3-8a19-d39b1d87a910%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages