Hi Vincenzo,
I think your issue is related to the input fileInputStream: somehow it’s not able to be reset. Maybe the underlying file is closed, or not open.
...Or it’s related to not calling mark():
Jim: I don’t see any call to mark() in the BOM detector itself. I see it set in the RawDocument.createStream() method.
But at the same time the unit tests for the BOM detector don’t use RawDocument and they pass. But maybe it’s because they use only ByteArrayInputStream and that type of stream allows reset without mark?
-ys
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
The result of the guess is likely correct.
But you will likely have problems at some point if you are ignoring the exception: First the exception may be caused by something else that the reset() not working, and also it means you’ll be starting to process your input after some bytes have been read.
A better workaround would probably be to set the marker before calling the BOM detector, so the exception doesn’t occurs.
After we get Jim’s opinion we may also have to change the way the BOM detector works, or at the minimum document that a mark should be set before calling it.
We set the mark here (and convert to a mark supported stream if needed) - but perhaps the mark position is too large? I'd say we should reduce it to 1024 or even 512 - that should be enough bytes to detect the info we need.
if (inputStream.markSupported()) {
this.inputStream = inputStream;
} else {
this.inputStream = new BufferedInputStream(inputStream);
}
inputStream.mark(8192);
autodetected = false;
bomSize = 0;
Thanks Jim: I missed that line.
-ys
From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Jim Hargrave
Sent: Monday, November 4, 2013 8:29 AM
To: okapi...@googlegroups.com
FWIW, this file has been the source of a bunch of headaches for me in my ongoing, Quixotic quest to clean up RawDocument. �(I was at it again over the weekend.) But I won't have any concrete advice on what I think could be done better until I sort out some other things first.
On Mon, Nov 4, 2013 at 7:41 AM, Yves Savourel <yves.s...@gmail.com> wrote:
Thanks Jim: I missed that line.
-ys
�
From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Jim Hargrave
Sent: Monday, November 4, 2013 8:29 AM
To: okapi...@googlegroups.com
Subject: Re: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal
�
Sorry too early :-) The 8192 is the max number of bytes before the mark is invalid.� So the problem is probably that we are reading to far ahead and then try a reset. We should have logic to stop us reading past 8192.
Jim
On 11/04/2013 08:23 AM, Jim Hargrave wrote:
We set the mark here (and convert to a mark supported stream if needed) - but perhaps the mark position is too large?
I'd say we should reduce it to 1024 or even 512 - that should be enough bytes to detect the info we need.
public BOMNewlineEncodingDetector(final InputStream inputStream) {
��� ��� if (inputStream.markSupported()) {
��� ��� ��� this.inputStream = inputStream;
��� ��� } else {
��� ��� ��� this.inputStream = new BufferedInputStream(inputStream);
��� ��� }
�
��� ��� inputStream.mark(8192);
��� ��� autodetected = false;
��� ��� bomSize = 0;
��� }
On 11/04/2013 05:20 AM, Yves Savourel wrote:
Hi Vincenzo,
�
I think your issue is related to the input fileInputStream: somehow it�s not able to be reset. Maybe the underlying file is closed, or not open.
�
...Or it�s related to not calling mark():
�
Jim: I don�t see any call to mark() in the BOM detector itself. I see it set in the RawDocument.createStream() method.
But at the same time the unit tests for the BOM detector don�t use RawDocument and they pass. But maybe it�s because they use only ByteArrayInputStream and that type of stream allows reset without mark?
�
-ys
�
From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Vincenzo Turco
Sent: Monday, November 4, 2013 4:15 AM
To: okapi...@googlegroups.com
Subject: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal
�
Hi all,
I'm trying to re-use the�net.sf.okapi.common.BOMNewlineEncodingDetector to detect the encoding of files.
So, very simply I create an instance of the object and then try to detect the encoding like this:
�
BOMNewlineEncodingDetector detector = new BOMNewlineEncodingDetector(fileInputStream);
detector.detectBom();
String encoding = detector.getEncoding();
�
Upon invoking detectBom I get an exception like:
java.io.IOException: Resetting to invalid mark
���������� at java.io.BufferedInputStream.reset(Unknown Source)
���������� at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBomInternal(BOMNewlineEncodingDetector.java:594)
���������� at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBom(BOMNewlineEncodingDetector.java:349)
�
I've passed in 7 different files with different encodings, and all get the exception.
However, the detection algorithm in the detectBOM() runs all the same and provides an output.
Please note that hasBOM() returns false for each file of the set.
Can anyone please shed some light on why this is happening and whether we can safely ignore the exception?
�
Thanks, regards
Vincenzo
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
�
�
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
FWIW, this file has been the source of a bunch of headaches for me in my ongoing, Quixotic quest to clean up RawDocument. (I was at it again over the weekend.) But I won't have any concrete advice on what I think could be done better until I sort out some other things first.
On Mon, Nov 4, 2013 at 7:41 AM, Yves Savourel <yves.s...@gmail.com> wrote:
From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Jim Hargrave
Sent: Monday, November 4, 2013 8:29 AM
To: okapi...@googlegroups.com
Subject: Re: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal
Sorry too early :-) The 8192 is the max number of bytes before the mark is invalid. So the problem is probably that we are reading to far ahead and then try a reset. We should have logic to stop us reading past 8192.
Jim
On 11/04/2013 08:23 AM, Jim Hargrave wrote:
We set the mark here (and convert to a mark supported stream if needed) - but perhaps the mark position is too large?
I'd say we should reduce it to 1024 or even 512 - that should be enough bytes to detect the info we need.
public BOMNewlineEncodingDetector(final InputStream inputStream) {
if (inputStream.markSupported()) {
this.inputStream = inputStream;
} else {
this.inputStream = new BufferedInputStream(inputStream);
}
inputStream.mark(8192);
autodetected = false;
bomSize = 0;
}
On 11/04/2013 05:20 AM, Yves Savourel wrote:
Hi Vincenzo,
I think your issue is related to the input fileInputStream: somehow it’s not able to be reset. Maybe the underlying file is closed, or not open.
...Or it’s related to not calling mark():
Jim: I don’t see any call to mark() in the BOM detector itself. I see it set in the RawDocument.createStream() method.
But at the same time the unit tests for the BOM detector don’t use RawDocument and they pass. But maybe it’s because they use only ByteArrayInputStream and that type of stream allows reset without mark?
-ys
From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Vincenzo Turco
Sent: Monday, November 4, 2013 4:15 AM
To: okapi...@googlegroups.com
Subject: [okapi-devel] java.io.IOException: Resetting to invalid mark in BOMNewlineEncodingDetector.detectBomInternal
Hi all,
I'm trying to re-use the net.sf.okapi.common.BOMNewlineEncodingDetector to detect the encoding of files.
So, very simply I create an instance of the object and then try to detect the encoding like this:
BOMNewlineEncodingDetector detector = new BOMNewlineEncodingDetector(fileInputStream);
detector.detectBom();
String encoding = detector.getEncoding();
Upon invoking detectBom I get an exception like:
java.io.IOException: Resetting to invalid mark
at java.io.BufferedInputStream.reset(Unknown Source)
at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBomInternal(BOMNewlineEncodingDetector.java:594)
at net.sf.okapi.common.BOMNewlineEncodingDetector.detectBom(BOMNewlineEncodingDetector.java:349)
I've passed in 7 different files with different encodings, and all get the exception.
However, the detection algorithm in the detectBOM() runs all the same and provides an output.
Please note that hasBOM() returns false for each file of the set.
Can anyone please shed some light on why this is happening and whether we can safely ignore the exception?
Thanks, regards
Vincenzo
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel+unsubscribe@googlegroups.com.
Thanks a lotVincenzo
On 11/25/2013 04:24 AM, Vincenzo Turco wrote:
while the Class.getResourceAsStream()
That could be the difference - I'll give it a try
Jim
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
--
Vincenzo Turco
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.