EOF character(s) to be ignored or specified within BeanIO

1,252 views
Skip to first unread message

jsapar.d...@gmail.com

unread,
Jun 18, 2013, 8:41:44 AM6/18/13
to bea...@googlegroups.com
Hi Kevin,

Is there a chance that BeanIO supports (or will support) EOF characters to be predefined at the <stream> element? 
My "problem" is that I get an org.beanio.UnidentifiedRecordException (unidentified record at line xx) error when the BeanReader hits the EOF character within the file (EOF character is the ASCII 'SUB' character).
BeanIO now sees this as a new line that doesn't match any defined record within the stream. I also couldn't find any properties settings in the reference guide (8.1. Settings) to define such EOF character.

I can't strip off the EOF character because the files are read only. What I would do as a workaround is defining another record (outside of the record group, but within the same stream) that defines only a single field that is marked as a record-id field and contains the literal value of the ASCII 'SUB'-value. So it detects the line as an EOF-record, and just doesn't do anything with it. I hope this helps to bypass the exception? The library shouldn't be throwing an exception because it reached EOF :( 

Example (assuming this could work, but probably not):

<record name="eoftoken">
<!-- SUB character used as EOF marker (unicode=U+2282, xml=&#2282) -->
<field name="eof" literal="&#2282" ignore="true" rid="true" required="false" />
</record>

Is there a good way to deal with these old EOF characters within (readonly) files?

Maybe I have overlooked something in the reference guide (v2.1) ??

Cheers,
JD 

Kevin

unread,
Jun 18, 2013, 9:34:21 PM6/18/13
to bea...@googlegroups.com
Hi JD,

I have never heard of using SUB to terminate a stream, so no, that will not be supported.  But it should be fairly easy to create your own reader (extending from FilterReader) that recognizes SUB as the end of the stream, and then simply wrap your FileReader or other kind of Reader in that before passing it to BeanIO.

Thanks,
Kevin

jsapar.d...@gmail.com

unread,
Jun 19, 2013, 9:47:38 AM6/19/13
to bea...@googlegroups.com
Hi Kevin,

Thanks for your quick response. See my comments market with %%


On Wednesday, June 19, 2013 3:34:21 AM UTC+2, Kevin wrote:
Hi JD,

I have never heard of using SUB to terminate a stream,

%% Please read: http://en.wikipedia.org/wiki/End-of-file , you will see that the SUB-character is used (or should I say abused?) a lot in the old days (MS-DOS, CP/M, DEC) to represent an EOF sign/mark.
 
so no, that will not be supported.  

%% Hold that thought for a minute. I think different about this, because it is a defacto standard and is part of the files that are being read. I can't change history. Read on below.
 
But it should be fairly easy to create your own reader (extending from FilterReader) that recognizes SUB as the end of the stream, and then simply wrap your FileReader or other kind of Reader in that before passing it to BeanIO.


%% I see your point-of-view here, but somehow this makes me disappointed because I now need more than one Reader to accomplish this. It CAN be done the way you describe here, but then again: why should I read the input more than twice, when there is no need for it to read it twice? Your solution solves the problem, but also creates a speed gap: reading the lines twice just to figure out if the EOF was reached. Why not let BeanIO handle that by configuring an EOF character setting?
%% This is also true for situations when BeanIO discovers undefined lines in the stream that are not defined within the mapping: it gives an UnidentifiedRecordException as a result. 
%% Now you would think: well, that is logically correct. Yep, true that is, BUT what if I know what kind of records there are within the file/stream that I don't want to read but still can be found within that file/stream? Normally BeanIO will throw an UnidentifiedRecordException. Well maybe it isn't an unidentified record, and maybe it isn't an error after all. Maybe I just want BeanIO to skip/ignore these kind of records/lines (suppress them) and just carry on reading the rest of the file/stream?
That can also be said about the EOF character, that can be interpreted as a special record that needs to be ignored by BeanIO. That way BeanIO doesn't throw an UnidentifiedRecordException and I don't need to use more that one Reader class for reading in the files/streams. That saves me a lot of extra work and provides faster speeds (FYI: reading 10.000 files twice gives an unnecessary delay).

%% So I was thinking about these options:
1. possibility for the user to define special characters in BeanIO, like an EOF character/line,
2. possibility for the user to define records in a stream that can be detected by BeanIO but are fully skipped/ignored by BeanIO.
3. catch an UnidentifiedRecordException by the user and ignore the record/line.
4. read all lines and filter out lines that don't need to be in the stream before feeding the "leftover" lines into BeanIO, so that no UnidentifiedRecordExceptions are thrown. This is the solution your wrote here.

%% To be honest, options 1 and 2 are IMHO better looking solutions then options 3 and 4. When you look closer to solution number 4, you will see that what the user has to do, is write its own mini-parser in order to suppress BeanIO exceptions.
I think that the power of BeanIO should be to let the user give complete control over how the file/stream is being parsed without creating the need for the user to write it's own mini-parser as a (sort of) pass-through filter. That is somewhat overkill to me.
%% I think BeanIO should handle the parsing completely and not a combination of BeanIO+some handwritten glue code to make it work with BeanIO :-(

%% BTW: There is no option in BeanIO to fully ignore a defined record. This could be handy in case records need to be skipped and to let BeanIO NOT throw an exception. I don't need to handle an exception when I know beforehand that these records exist within the stream and that I know I want to ignore them.

%% I hope that I have convinced you to implement these features in BeanIO? They will be helpful for a lot of people I think.

%% Anyways, thanks for your help and quick responses!
 
%% Cheers,
%% JD

jsapar.d...@gmail.com

unread,
Jun 19, 2013, 10:09:04 AM6/19/13
to bea...@googlegroups.com
Hi Kevin,

A small update from my point-of-view here...

On Wednesday, June 19, 2013 3:34:21 AM UTC+2, Kevin wrote:
Hi JD,

I have never heard of using SUB to terminate a stream,

%% Please read: http://en.wikipedia.org/wiki/End-of-file , you will see that the SUB-character is used (or should I say abused?) a lot in the old days (MS-DOS, CP/M, DEC) to represent an EOF sign/mark.
 
so no, that will not be supported.  

Please read the following standard: http://mastpoint.curzonnassau.com/csv-1203/csv-1203.pdf (page 15, The Logical End-Of-File Marker)

I will quote it here: 
[quote]
6 The Logical End-of-File Marker
Today, few producer applications write a logical end-of-file (EOF) marker (SUB - ASCII 0x1A) to a file to indicate its end. To permit backward compatibility, a CSV consumer application must handle this marker correctly.

The presence of an EOF must be tested for if the producer application intends to append data records to an existing CSV file. In these cases, the EOF marker must be overwritten so that it does not cause a consumer application to prematurely stop while reading the file. Preserving the presence of an EOF is at the discretion of the producer application's developers.
[unquote]

I rest my case ;-)

Cheers,
JD 

 

Kevin

unread,
Jun 19, 2013, 11:14:38 PM6/19/13
to bea...@googlegroups.com
Hi JD,

I certainly did not suggest reading the file twice.

And you can ignore unidentified records by setting ignoreUnidentifiedRecords="true" on the stream definition.

But the best solution which I am recommending is to wrap your FileReader like this:

streamFactory.createReader("stream", new LogicalEOFReader(new FileReader("file.csv"), 26));


Where LogicalEOFReader is something like this:

public class LogicalEOFReader extends FilterReader {


    private final int eofMarker;

    private boolean eof = false;

    

    public LogicalEOFReader(Reader in, int eofMarker) {

        super(in);

        this.eofMarker = eofMarker;

    }


    @Override

    public int read() throws IOException {

        if (eof) {

            return -1;

        }

        

        int n = super.read();

        if (n == -1 || n == eofMarker) {

            eof = true;

            return -1;

        }

        else {

            return n;

        }

    }


    @Override

    public int read(char[] cbuf, int off, int len) throws IOException {

        if (eof) {

            return -1;

        }

        

        int n = super.read(cbuf, off, len);

        if (n == -1) {

            eof = true;

            return -1;

        }

        for (int i=0; i<n; i++) {

            if (cbuf[i] == eofMarker) {

                eof = true;

                return i;

            }

        }

        return n;

    }

}


Thanks,
Kevin

Kevin

unread,
Jun 19, 2013, 11:16:08 PM6/19/13
to bea...@googlegroups.com
Sorry about the double spacing, not sure why Google decided to format it that way...
Reply all
Reply to author
Forward
0 new messages