CSVReader - char after last field delimiter

21 views
Skip to first unread message

jwcolby

unread,
Apr 14, 2008, 1:28:17 PM4/14/08
to CSV...@googlegroups.com
Bruce,

I have a situation where files sent to me have a character after the
last field delimiter. I am getting back an error message:

Bulk Copy FAILED: Maximum column length of 100,000 exceeded in column 39
in record 2,477. Set the SafetySwitch property to false if you're
expecting column lengths greater than 100,000 characters to avoid this
error.

I am not exactly sure why the error message reads the way it does but I
can tell you this:

1) There are 153 field delimiters ( I use the pipe symbol | )
2) The longest line in the entire file is only 987 characters total,
including the pipe symbol field delimiters.
3) The line with the error only has 783 columns in it total including
delimiters (as counted by UltraEdit)
4) Every single line has the same number of delimiter characters in it,
so it is not that one has too few or two many.

I am importing over 50 of these files, with millions of records, most
import, but a small handful fail with this kind of error.

I had this error long ago and contacted you but never really resolved
the problem. Because of the size of these files it is difficult to
really view them with ultraedit (gigabytes of text).

This time I wrote some code to split the file apart and write just a
chunk containing the error line as reported by CSVReader and sure
enough, when I go to the line indicated there is (in this case) a single
0 at the end of the line, before the crlf.

I tried feeding just this smaller "chunk" back through my import program
and it failed with the same error message (different line number of course).

Anyway, I have the "smaller" file (7.9 megs uncompressed) if you would
like to look at it.

In the meantime I need to know if there is an easy way to get at the
line to strip anything off after the last delimiter so that I can
proceed to import my data. I already sink CSVData_ReadRecord but I
think you said that event never even fired because your code didn't
consider it a line.

Anything you can do to help me figure this out would be much appreciated.

--
John W. Colby
www.ColbyConsulting.com

shriop

unread,
Apr 15, 2008, 12:57:16 AM4/15/08
to CSVChat
I recommend Large Text File Viewer for looking at a large file,
http://www.swiftgear.com/ .

My best guess without the file itself is that the file switches format
part way through and the parse can no longer recognize column
delimiters or end of line markers. I recently ran into several files,
in different client systems, that were like this and after a certain
point in the file, the file went crazy and switched to binary. The
error that you're running into is, as the name implies, a safety
switch built in to prevent the parser from hogging all the memory in
possibly a production server as it tries to buffer the contents of a
record until it runs into either a column delimiter or an end of line
marker. The message generally implies that the settings for parsing
the file does not match the file itself. If you get this message on
column 1 of record 1, it can be a problem with file encoding not being
set correctly. If you hit it in the middle of a file, it can also mean
that the parser ran into a text qualifier at the start of a column and
then did not run into another non-escaped text qualifier within the
next 100,000 characters. It's part of the csv spec that anything can
exist in a csv field as long as it is text qualified. If you're not
seeing a text qualifier starting at that position, then look for the
next pipe or end of line marker that occurs from that point on. The
error message shows the position where the parser started a column and
could not finish from there. Feel free to email me the file, along
with the sample code that you're using to set the column delimiter and
such, and I will track down the specifics.

Bruce Dunwiddie

jwcolby

unread,
Apr 15, 2008, 5:00:00 AM4/15/08
to CSV...@googlegroups.com
Bruce,

I tracked it down. I use a | character as the field delimiter. I had
set "Use text qualifier" as true so as to strip out the quotes around a
field if any were encountered. Sure enough there is an "opening" text
qualifier but no "closing" text qualifier.

Is this a common problem? It certainly seems to be a common problem
with the data files from this one company I am getting data from, and it
seems that it could be encountered often in "hand entered" data where a
person could simply type the " (text qualifier) by mistake in the middle
of real data.

I am wondering if you can put in a "fix dirty data" property to tell
your code to strip the text qualifier if a field delimiter is
encountered before a close text qualifier.

IOW I chose a pipe because I have never seen it used in normal text
(non-programming language text). I need to strip ALL text qualifiers
(") out of the data but would have to pre-process every file in order to
do this.

Question, does CSVReader read an entire line (to the crlf for example)
and then process that? If so could you read the line and raise a
"preprocess" event that would hand the event handler the line of data by
reference? I could then do a simple replace(LineData,'"','') on every
line every time.

If I can't do something like that how do you recommend handling this
data. I do have other data files where the field delimiter is the more
normal comma and so would need the normal processing of text qualifiers.

IOW is there some standard programmatic way of handling this issue or am
I stuck with manually finding and fixing bad data?

Thanks for your help on this,

John W. Colby

--
John W. Colby
www.ColbyConsulting.com

shriop

unread,
Apr 15, 2008, 9:02:41 AM4/15/08
to CSVChat
It's a somewhat common problem. No, the parser does not read until the
first end of line marker and then process the line. Standard csv rules
allows for carriage returns and line feeds to exist in a text
qualified field. That's part of the purpose of the text qualifier, not
just to escape the delimiter. What you're saying would totally break
all the rules of csv. There is an easy way to do what you're saying
though. Just turn off text qualifiers and call .Trim('\"') on every
field in the ReadRecord event, if that's what you want. The question
is then how to handle escaped text qualifiers, and hope you don't run
into pipes in the data.

Bruce Dunwiddie

On Apr 15, 4:00 am, jwcolby <jwco...@colbyconsulting.com> wrote:
> Bruce,
>
> John W. Colbywww.ColbyConsulting.com- Hide quoted text -
>
> - Show quoted text -
Reply all
Reply to author
Forward
0 new messages