org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.

Byte Array

unread,

Aug 8, 2013, 1:22:54 PM8/8/13

to tagsoup...@googlegroups.com

Hello!

I have problems when converting HTML in byte [] form into string and passing it to Tagsoup Parser.

Oddly, When I store that same HTML into file and read it with FileReader it works.

Is it an encoding problem? How could I work around it?

...

String htmlContent = new String( ((Content) value).getContent() );

tagsoupParser.parse(new InputSource(new StringReader(htmlContent)));

...

results with:

org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.

Thanks for any suggestions,

Regards

John Cowan

unread,

Aug 8, 2013, 2:11:08 PM8/8/13

to Byte Array, tagsoup...@googlegroups.com

Byte Array scripsit:

> I have problems when converting HTML in byte [] form into string and
> passing it to Tagsoup Parser. Oddly, When I store that same HTML
> into file and read it with FileReader it works. Is it an encoding
> problem? How could I work around it?

It may well be an encoding problem: new String(byte []) converts the
bytes using the current encoding, typically 8859-1 or Windows-1252.
However, without seeing the bytes I have no hope of saying for sure.

--
John Cowan co...@ccil.org http://ccil.org/~cowan
[R]eversing the apostolic precept to be all things to all men, I usually [before
Darwin] defended the tenability of the received doctrines, when I had to do
with the [evolution]ists; and stood up for the possibility of [evolution] among
the orthodox --thereby, no doubt, increasing an already current, but quite
undeserved, reputation for needless combativeness. --T. H. Huxley

Byte Array

unread,

Aug 8, 2013, 2:34:11 PM8/8/13

to tagsoup...@googlegroups.com, Byte Array, co...@mercury.ccil.org

Thanks,

new String(content, "UTF-8") failed,

new String(content, "UTF-16") produces no exceptions but the parser still fails to find elements.

Regards

John Cowan

unread,

Aug 8, 2013, 2:42:38 PM8/8/13

to Byte Array, tagsoup...@googlegroups.com

Byte Array scripsit:

> new String(content, "UTF-8") failed,
> new String(content, "UTF-16") produces no exceptions but the parser still
> fails to find elements.

I have no clue.

--
John Cowan co...@ccil.org http://ccil.org/~cowan

The penguin geeks is happy / As under the waves they lark
The closed-source geeks ain't happy / They sad cause they in the dark
But geeks in the dark is lucky / They in for a worser treat
One day when the Borg go belly-up / Guess who wind up on the street.

Byte Array

unread,

Aug 9, 2013, 3:04:25 AM8/9/13

to tagsoup...@googlegroups.com, Byte Array, co...@mercury.ccil.org

Hello,

I tried to dump some bytes of the HTML in the following manner (hopefully without conversion):

BufferedOutputStream bos = new BufferedOutputStream(fs.create(new Path(bytearrayDump), true));

bos.write(con.getContent());

bos.flush();

bos.close();

Part of the byte array dump is:

0000000: 3c 21 44 4f 43 54 59 50 45 20 48 54 4d 4c 20 50 <!DOCTYPE HTML P

0000010: 55 42 4c 49 43 20 22 2d 2f 2f 57 33 43 2f 2f 44 UBLIC "-//W3C//D

0000020: 54 44 20 48 54 4d 4c 20 34 2e 30 31 2f 2f 45 4e TD HTML 4.01//EN

0000030: 22 20 22 68 74 74 70 3a 2f 2f 77 77 77 2e 77 33 " "http://www.w3

0000040: 2e 6f 72 67 2f 54 52 2f 68 74 6d 6c 34 2f 73 74 .org/TR/html4/st

0000050: 72 69 63 74 2e 64 74 64 22 3e 0a 3c 68 74 6d 6c rict.dtd">.<html

0000060: 20 78 6d 6c 6e 73 3a 66 62 3d 22 68 74 74 70 3a xmlns:fb="http:

0000070: 2f 2f 77 77 77 2e 66 61 63 65 62 6f 6f 6b 2e 63 //www.facebook.c

0000080: 6f 6d 2f 32 30 30 38 2f 66 62 6d 6c 22 20 78 6d om/2008/fbml" xm

0000090: 6c 6e 73 3a 6f 67 3d 22 68 74 74 70 3a 2f 2f 6f lns:og="http://o

00000a0: 70 65 6e 67 72 61 70 68 70 72 6f 74 6f 63 6f 6c pengraphprotocol

00000b0: 2e 6f 72 67 2f 73 63 68 65 6d 61 2f 22 3e 0a 3c .org/schema/">.<

00000c0: 68 65 61 64 3e 0a 3c 6d 65 74 61 20 68 74 74 70 head>.<meta http

00000d0: 2d 65 71 75 69 76 3d 22 63 6f 6e 74 65 6e 74 2d -equiv="content-

00000e0: 74 79 70 65 22 20 63 6f 6e 74 65 6e 74 3d 22 74 type" content="t

00000f0: 65 78 74 2f 68 74 6d 6c 3b 20 63 68 61 72 73 65 ext/html; charse

0000100: 74 3d 75 74 66 2d 38 22 2f 3e 0a 3c 6c 69 6e 6b t=utf-8"/>.<link

...

I hope it tells something.

Thanks a lot,

Regards

John Cowan

unread,

Aug 9, 2013, 9:37:50 AM8/9/13

to Byte Array, tagsoup...@googlegroups.com

Byte Array scripsit:

> I tried to dump some bytes of the HTML in the following manner (hopefully
> without conversion):
> BufferedOutputStream bos = new BufferedOutputStream(fs.create(new
> Path(bytearrayDump), true));
> bos.write(con.getContent());
> bos.flush();
> bos.close();

Well, your error is coming from the DOM, and TagSoup doesn't use the DOM
in any way, so that must be in your application.. Try a stripped-down case.

--
As we all know, civil libertarians are not John Cowan
the friskiest group around --comes from co...@ccil.org
forever being on the qui vive for the sound http://www.ccil.org/~cowan
of jack-booted fascism coming down the pike. --Molly Ivins

Fuad Efendi

unread,

Sep 25, 2013, 12:06:09 AM9/25/13

to tagsoup...@googlegroups.com

Hello,

RE: Caused by: org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.

TagSoup 1.2.1

I had the same issue with Java 7. As I found, the problem is not incorrect decoding of the bytearray (which was correct in my case).

The problem was caused by "xercesImpl.jar" in my classpath, which was dependency of latest and "greatest" Apache Tika 1.4

After I added it to "exclusions" section of my Maven POM, the problem disappeared.

NOTE: I am going to avoid dependency on Apache Tika in the future. Too much unnecessary functionality, not the best approach.

This is what I use:

package org.tokenizer.core.parser;

import java.io.ByteArrayOutputStream;

import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerFactoryConfigurationError;
import javax.xml.transform.dom.DOMResult;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;

import org.apache.commons.lang.StringUtils;
import org.ccil.cowan.tagsoup.HTMLSchema;
import org.ccil.cowan.tagsoup.Parser;
import org.ccil.cowan.tagsoup.Schema;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;
import org.w3c.dom.ls.LSSerializerFilter;
import org.w3c.dom.traversal.NodeFilter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;

public class HtmlParser {

    private static final LSSerializerFilter defaultLSSerializerFilter = new OutputFilter();
    private static final Logger LOG = LoggerFactory.getLogger(HtmlParser.class);
    /**
    * HTML schema singleton used to amortise the heavy instantiation time.
    */
    private static final Schema HTML_SCHEMA = new HTMLSchema();

    public static Document parse(final InputSource inputSource) {
        try {
            Parser parser = new Parser();
            parser.setFeature(Parser.namespacesFeature, false);
            parser.setFeature(Parser.namespacePrefixesFeature, false);
            parser.setProperty(org.ccil.cowan.tagsoup.Parser.schemaProperty,
                    HTML_SCHEMA);
            parser.setFeature(Parser.ignoreBogonsFeature, true);

            SAXTransformerFactory stf = (SAXTransformerFactory) TransformerFactory
                    .newInstance();

            TransformerHandler transformerHandler = stf.newTransformerHandler();

            DOMResult domResult = new DOMResult();
            transformerHandler.setResult(domResult);
            parser.setContentHandler(transformerHandler);
            parser.parse(inputSource);

            return (Document) domResult.getNode();

        } catch (SAXNotRecognizedException e) {
            LOG.error(StringUtils.EMPTY, e);
        } catch (SAXNotSupportedException e) {
            LOG.error(StringUtils.EMPTY, e);
        } catch (TransformerConfigurationException e) {
            LOG.error(StringUtils.EMPTY, e);
        } catch (TransformerFactoryConfigurationError e) {
            LOG.error(StringUtils.EMPTY, e);
        } catch (Throwable t) {
            LOG.error("", t);
        }

        return null;
    }

Thanks!

http://www.tokenizer.ca

Fuad Efendi

Reply all

Reply to author

Forward