java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal

Vincenzo Turco

unread,

Nov 4, 2013, 6:14:47 AM11/4/13

to okapi...@googlegroups.com

Hi all,

I'm trying to re-use the net.sf.okapi.common.BOMNewlineEncodingDetector to detect the encoding of files.

So, very simply I create an instance of the object and then try to detect the encoding like this:

BOMNewlineEncodingDetector detector = new BOMNewlineEncodingDetector(fileInputStream);

detector.detectBom();

String encoding = detector.getEncoding();

Upon invoking detectBom I get an exception like:

java.io.IOException: Resetting to invalid mark

at java.io.BufferedInputStream.reset(Unknown Source)

at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBomInternal(BOMNewlineEncodingDetector.java:594)

at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBom(BOMNewlineEncodingDetector.java:349)

I've passed in 7 different files with different encodings, and all get the exception.

However, the detection algorithm in the detectBOM() runs all the same and provides an output.

Please note that hasBOM() returns false for each file of the set.

Can anyone please shed some light on why this is happening and whether we can safely ignore the exception?

Thanks, regards

Vincenzo

Yves Savourel

unread,

Nov 4, 2013, 7:20:31 AM11/4/13

to okapi...@googlegroups.com

Hi Vincenzo,

I think your issue is related to the input fileInputStream: somehow it’s not able to be reset. Maybe the underlying file is closed, or not open.

...Or it’s related to not calling mark():

Jim: I don’t see any call to mark() in the BOM detector itself. I see it set in the RawDocument.createStream() method.

But at the same time the unit tests for the BOM detector don’t use RawDocument and they pass. But maybe it’s because they use only ByteArrayInputStream and that type of stream allows reset without mark?

-ys

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Vincenzo Turco

unread,

Nov 4, 2013, 9:12:32 AM11/4/13

to okapi...@googlegroups.com

Hi Yves,

thanks for your prompt feedback.

In my environment different files get different encodings, so I guess the algorithm is executed even if there's an exception in the finally block (also the eclipse debugger confirms that).

As a consequence, I guess that one can safely use the output of the processing (i.e. the detected encoding). Is that correct?

Thanks, regards

Vincenzo

Yves Savourel

unread,

Nov 4, 2013, 9:56:05 AM11/4/13

to okapi...@googlegroups.com

The result of the guess is likely correct.

But you will likely have problems at some point if you are ignoring the exception: First the exception may be caused by something else that the reset() not working, and also it means you’ll be starting to process your input after some bytes have been read.

A better workaround would probably be to set the marker before calling the BOM detector, so the exception doesn’t occurs.

After we get Jim’s opinion we may also have to change the way the BOM detector works, or at the minimum document that a mark should be set before calling it.

Jim Hargrave

unread,

Nov 4, 2013, 10:23:49 AM11/4/13

to okapi...@googlegroups.com

We set the mark here (and convert to a mark supported stream if needed) - but perhaps the mark position is too large? 
I'd say we should reduce it to 1024 or even 512 - that should be enough bytes to detect the info we need.

public BOMNewlineEncodingDetector(final InputStream inputStream) {

        if (inputStream.markSupported()) {

            this.inputStream = inputStream;

        } else {

            this.inputStream = new BufferedInputStream(inputStream);


        inputStream.mark(8192);

        autodetected = false;

        bomSize = 0;

Jim Hargrave

unread,

Nov 4, 2013, 10:29:06 AM11/4/13

to okapi...@googlegroups.com

Sorry too early :-) The 8192 is the max number of bytes before the mark is invalid. So the problem is probably that we are reading to far ahead and then try a reset. We should have logic to stop us reading past 8192.

Jim

Jim Hargrave

unread,

Nov 4, 2013, 10:36:58 AM11/4/13

to okapi...@googlegroups.com

If you send me a sample file that fails I will debug this.
Jim

On 11/04/2013 05:20 AM, Yves Savourel wrote:

Yves Savourel

unread,

Nov 4, 2013, 10:41:29 AM11/4/13

to okapi...@googlegroups.com

Thanks Jim: I missed that line.

-ys

From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Jim Hargrave

Sent: Monday, November 4, 2013 8:29 AM
To: okapi...@googlegroups.com

Chase Tingley

unread,

Nov 4, 2013, 2:20:06 PM11/4/13

to okapi...@googlegroups.com

FWIW, this file has been the source of a bunch of headaches for me in my ongoing, Quixotic quest to clean up RawDocument. (I was at it again over the weekend.) But I won't have any concrete advice on what I think could be done better until I sort out some other things first.

Jim Hargrave

unread,

Nov 4, 2013, 3:19:11 PM11/4/13

to okapi...@googlegroups.com

It looks like its possible to read too many bytes looking for a newline - that's when the reset will throw an exception. We can add some code to prevent reading past a predefined number of bytes. At least a monkey patch for this issue.

But I'm only guessing this is the issue - I'd need a test file to be sure.

Jim

On 11/04/2013 12:20 PM, Chase Tingley wrote:

FWIW, this file has been the source of a bunch of headaches for me in my ongoing, Quixotic quest to clean up RawDocument. ï¿½(I was at it again over the weekend.) But I won't have any concrete advice on what I think could be done better until I sort out some other things first.

On Mon, Nov 4, 2013 at 7:41 AM, Yves Savourel <yves.s...@gmail.com> wrote:

Thanks Jim: I missed that line.

-ys

ï¿½

ï¿½

From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Jim Hargrave
Sent: Monday, November 4, 2013 8:29 AM
To: okapi...@googlegroups.com
Subject: Re: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal

ï¿½

Sorry too early :-) The 8192 is the max number of bytes before the mark is invalid.ï¿½ So the problem is probably that we are reading to far ahead and then try a reset. We should have logic to stop us reading past 8192.

Jim

On 11/04/2013 08:23 AM, Jim Hargrave wrote:

We set the mark here (and convert to a mark supported stream if needed) - but perhaps the mark position is too large?

I'd say we should reduce it to 1024 or even 512 - that should be enough bytes to detect the info we need.

public BOMNewlineEncodingDetector(final InputStream inputStream) {

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ if (inputStream.markSupported()) {

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ this.inputStream = inputStream;

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ } else {

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ this.inputStream = new BufferedInputStream(inputStream);

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ }

ï¿½

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ inputStream.mark(8192);

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ autodetected = false;

ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ bomSize = 0;

ï¿½ï¿½ï¿½ }

On 11/04/2013 05:20 AM, Yves Savourel wrote:

Hi Vincenzo,

ï¿½

I think your issue is related to the input fileInputStream: somehow itï¿½s not able to be reset. Maybe the underlying file is closed, or not open.

ï¿½

...Or itï¿½s related to not calling mark():

ï¿½

Jim: I donï¿½t see any call to mark() in the BOM detector itself. I see it set in the RawDocument.createStream() method.

But at the same time the unit tests for the BOM detector donï¿½t use RawDocument and they pass. But maybe itï¿½s because they use only ByteArrayInputStream and that type of stream allows reset without mark?

ï¿½

-ys

ï¿½

From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Vincenzo Turco
Sent: Monday, November 4, 2013 4:15 AM
To: okapi...@googlegroups.com
Subject: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal

ï¿½

Hi all,

I'm trying to re-use theï¿½net.sf.okapi.common.BOMNewlineEncodingDetector to detect the encoding of files.

So, very simply I create an instance of the object and then try to detect the encoding like this:

ï¿½

BOMNewlineEncodingDetector detector = new BOMNewlineEncodingDetector(fileInputStream);

detector.detectBom();

String encoding = detector.getEncoding();

ï¿½

Upon invoking detectBom I get an exception like:

java.io.IOException: Resetting to invalid mark

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ at java.io.BufferedInputStream.reset(Unknown Source)

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBomInternal(BOMNewlineEncodingDetector.java:594)

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBom(BOMNewlineEncodingDetector.java:349)

ï¿½

I've passed in 7 different files with different encodings, and all get the exception.

However, the detection algorithm in the detectBOM() runs all the same and provides an output.

Please note that hasBOM() returns false for each file of the set.

Can anyone please shed some light on why this is happening and whether we can safely ignore the exception?

ï¿½

Thanks, regards

Vincenzo

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ï¿½

ï¿½

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vinc....@gmail.com

unread,

Nov 4, 2013, 3:52:52 PM11/4/13

to okapi...@googlegroups.com

Hi all,
Thank you all for your kind support.
I can provide the test files where the exception happens.
This will be tomorrow (Tuesday) as I'm out of office at the moment (I'm on CET).
Thanks again
Regards
Vincenzo

Le mail ti raggiungono ovunque con BlackBerry® from Vodafone!

From: Jim Hargrave <jhargr...@gmail.com>

Sender: okapi...@googlegroups.com

Date: Mon, 04 Nov 2013 13:19:11 -0700

To: <okapi...@googlegroups.com>

ReplyTo: okapi...@googlegroups.com

Subject: Re: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal

It looks like its possible to read too many bytes looking for a newline - that's when the reset will throw an exception. We can add some code to prevent reading past a predefined number of bytes. At least a monkey patch for this issue.

But I'm only guessing this is the issue - I'd need a test file to be sure.

Jim

On 11/04/2013 12:20 PM, Chase Tingley wrote:

FWIW, this file has been the source of a bunch of headaches for me in my ongoing, Quixotic quest to clean up RawDocument. (I was at it again over the weekend.) But I won't have any concrete advice on what I think could be done better until I sort out some other things first.

On Mon, Nov 4, 2013 at 7:41 AM, Yves Savourel <yves.s...@gmail.com> wrote:

Thanks Jim: I missed that line.

-ys

From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Jim Hargrave

Sent: Monday, November 4, 2013 8:29 AM
To: okapi...@googlegroups.com
Subject: Re: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal

Sorry too early :-) The 8192 is the max number of bytes before the mark is invalid. So the problem is probably that we are reading to far ahead and then try a reset. We should have logic to stop us reading past 8192.

Jim

On 11/04/2013 08:23 AM, Jim Hargrave wrote:

We set the mark here (and convert to a mark supported stream if needed) - but perhaps the mark position is too large?

I'd say we should reduce it to 1024 or even 512 - that should be enough bytes to detect the info we need.

public BOMNewlineEncodingDetector(final InputStream inputStream) {

        if (inputStream.markSupported()) {

            this.inputStream = inputStream;

        } else {

            this.inputStream = new BufferedInputStream(inputStream);

        inputStream.mark(8192);

        autodetected = false;

        bomSize = 0;

On 11/04/2013 05:20 AM, Yves Savourel wrote:

Hi Vincenzo,

I think your issue is related to the input fileInputStream: somehow it’s not able to be reset. Maybe the underlying file is closed, or not open.

...Or it’s related to not calling mark():

Jim: I don’t see any call to mark() in the BOM detector itself. I see it set in the RawDocument.createStream() method.

But at the same time the unit tests for the BOM detector don’t use RawDocument and they pass. But maybe it’s because they use only ByteArrayInputStream and that type of stream allows reset without mark?

-ys

From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Vincenzo Turco
Sent: Monday, November 4, 2013 4:15 AM
To: okapi...@googlegroups.com
Subject: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal

Hi all,

I'm trying to re-use the net.sf.okapi.common.BOMNewlineEncodingDetector to detect the encoding of files.

So, very simply I create an instance of the object and then try to detect the encoding like this:

BOMNewlineEncodingDetector detector = new BOMNewlineEncodingDetector(fileInputStream);

detector.detectBom();

String encoding = detector.getEncoding();

Upon invoking detectBom I get an exception like:

java.io.IOException: Resetting to invalid mark

at java.io.BufferedInputStream.reset(Unknown Source)

at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBomInternal(BOMNewlineEncodingDetector.java:594)

at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBom(BOMNewlineEncodingDetector.java:349)

I've passed in 7 different files with different encodings, and all get the exception.

However, the detection algorithm in the detectBOM() runs all the same and provides an output.

Please note that hasBOM() returns false for each file of the set.

Can anyone please shed some light on why this is happening and whether we can safely ignore the exception?

Thanks, regards

Vincenzo

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Vincenzo Turco

unread,

Nov 5, 2013, 6:05:31 AM11/5/13

to okapi...@googlegroups.com

Hi all,

as promised, please find attached two examples of files which generate the error on my machine (Win 7, 64 bits, Java 6)

Thanks, regards

Vincenzo

2013/11/4 <vinc....@gmail.com>

--

Vincenzo Turco

strings.xml

Test.resx

Jim Hargrave

unread,

Nov 5, 2013, 11:11:03 AM11/5/13

to okapi...@googlegroups.com

Thanks - I'll add this as a unit test and track down the bug. Are the test files private? Do you mind if we add them to our unit tests?

Jim

Vincenzo Turco

unread,

Nov 21, 2013, 1:59:59 PM11/21/13

to okapi...@googlegroups.com

Hi Jim,

sorry for the delay in answering, no problem, please go ahead.

Thanks for your continued support

Regards

Vincenzo

2013/11/5 Jim Hargrave <jhargr...@gmail.com>

--

Vincenzo Turco

Jim Hargrave

unread,

Nov 22, 2013, 2:01:04 PM11/22/13

to okapi...@googlegroups.com

With this unit test I see no problem - is there something else I can do in the code?

@Test
    public void resettingInvalidMarkException() throws IOException {
        BOMNewlineEncodingDetector detector = new BOMNewlineEncodingDetector(
                getClass().getResourceAsStream("/strings.xml"));
        detector.detectBom();
        assertFalse(detector.hasBom());
        assertEquals(BOMNewlineEncodingDetector.NewlineType.CRLF, detector.getNewlineType());
        assertEquals("ISO-8859-1", detector.getEncoding());

        detector = new BOMNewlineEncodingDetector(getClass().getResourceAsStream("/Test.resx"));
        detector.detectBom();
        assertTrue(detector.hasBom());
        assertEquals(BOMNewlineEncodingDetector.NewlineType.CRLF, detector.getNewlineType());
        assertEquals("UTF-8", detector.getEncoding());

}

On 11/05/2013 04:05 AM, Vincenzo Turco wrote:

Jim Hargrave

unread,

Nov 22, 2013, 2:06:56 PM11/22/13

to okapi...@googlegroups.com

It may be a particular pipeline - passing down RawDocument and trying to call the same code twice. Can you send me a rainbow pipeline config file?

Jim

Vincenzo Turco

unread,

Nov 25, 2013, 6:24:39 AM11/25/13

to okapi...@googlegroups.com

Hi Jim, thanks for testing this one out.

The code I'm getting exceptions with is the following:

File input = ....

FileInputStream fileInputStream = new FileInputStream(input);

BOMNewlineEncodingDetector detector = new BOMNewlineEncodingDetector(fileInputStream);

detector.detectBom(); <<< exception is generated here

The EncodingDetector is invoked on his own, with no pipeline involved.

I'm using a FileInputStream, while the Class.getResourceAsStream() might return a different subclass of abstract class InputStream.

Maybe that could make the difference?

Thanks a lot

Regards

Vincenzo

2013/11/22 Jim Hargrave <jhargr...@gmail.com>

--

Vincenzo Turco

Jim Hargrave

unread,

Nov 25, 2013, 11:12:02 AM11/25/13

to okapi...@googlegroups.com

On 11/25/2013 04:24 AM, Vincenzo Turco wrote:
> while the Class.getResourceAsStream()

That could be the difference - I'll give it a try

Jim

Vincenzo Turco

unread,

Nov 25, 2013, 1:57:04 PM11/25/13

to okapi...@googlegroups.com

Thanks a lot

Vincenzo

2013/11/25 Jim Hargrave <jhargr...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.

To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--

Vincenzo Turco

Jim Hargrave

unread,

Nov 26, 2013, 5:57:37 PM11/26/13

to okapi...@googlegroups.com

recreated the issue and fixed the bug. It seems that for BufferedInputStream if you set the mark larger than the buffer the makkPos is invalidated. This seems to happen even f you don't read past the marked position. Strange. I reduced our max lookahaed to be smaller than the default buffer and it works.

The fix will show up in the next snapshot

Jim

On 11/25/2013 11:57 AM, Vincenzo Turco wrote:

Thanks a lot
Vincenzo

2013/11/25 Jim Hargrave <jhargr...@gmail.com>

On 11/25/2013 04:24 AM, Vincenzo Turco wrote:

while the Class.getResourceAsStream()

That could be the difference - I'll give it a try

Jim

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.

To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--

Vincenzo Turco

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.

To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.

Vincenzo Turco

unread,

Nov 26, 2013, 6:00:18 PM11/26/13

to okapi...@googlegroups.com

Great news, thanks a lot

Regards

Vincenzo

2013/11/26 Jim Hargrave <jhargr...@gmail.com>

--

Vincenzo Turco

Reply all

Reply to author

Forward