ASCII Streams with mixed Line-End conventions

94 views
Skip to first unread message

Joachim Tuchel

unread,
Sep 21, 2020, 4:02:17 AM9/21/20
to VA Smalltalk

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

I know there are quite a few things we could do here. Like pre-process the file, look fo the first occurence of either Cr, CrLf or Lf and make sure to unify them befor starting the CSV import. But maybe someone has an even easier way to do it, maybe even with on-board tools of VAST?

Joachim

Louis LaBrunda

unread,
Sep 21, 2020, 9:02:47 AM9/21/20
to VA Smalltalk
Hi Joachim,

The #nextLine method of #CfsReadFileStream has the following comment:

nextLine
"Answer the elements between the current position and the next lineDelimiter. 
 
If #shouldSearchForAllStandardDelimiters answers false 
(likely, the user has specified an explicit line delimiter via #lineDelimiter:)
then we ONLY search for that delimiter. However, if it answers true
(likely the stream current delimiter is the default one), then we try to look
for any of the standard delimiters: Cr, Lf, and CrLf. This is useful when we are
reading streams that could have been written on different platforms.
Answers:
<ByteArray | String>"

It seems to me that covers you case.  I haven't checked but I guess other streams may have similar methods.

Lou

Richard Sargent

unread,
Sep 21, 2020, 12:38:05 PM9/21/20
to VA Smalltalk
On Monday, September 21, 2020 at 1:02:17 AM UTC-7, Joachim Tuchel wrote:

Hi there,

we're importing some data from csv files. There's a lot of good reasons not to do it, but anyways... ;-)

Just yesterday I got a complaint from a user who couldn't figure out how to import their data. It turns out their source system exports data with mixed CrLf and Lf line-ends.

Given that a quoted field may contain line breaks, you really need to parse the end of line as part of parsing the fields. i.e. splitting the file into lines will give you bad results.

e.g. pseudocode

parseLine
    [self parseField] whileTrue.

Joachim Tuchel

unread,
Sep 22, 2020, 2:51:12 AM9/22/20
to VA Smalltalk
Thanks Lou,

I was somewhat assuming that if I set #lineDelimiter (a variable that our users can modify in the Applications' GUI), the Stream will only look for this one. The comment suggests I can set (or override) #shouldSearchForAllStandardDelimiters and combine the two somehow.

I will have to play with this....

BUT of course, Richard is bringing up a very valid point. We're using NeoCSVReader to parse CSV fields from these files. So the challenge here will be to find the right way to fiddle the line-end detection into that.... In the worst of cases, we'll have to do the "line reading" outside of NeoCSVReader and only feed each single line into the Reader instead of using #upToEnd....

Sigh. So much trouble for a single exception to a simple rule....

Anyways: Thanks Lou and Richard for your input. You showed me both a possible solution to take a closer look at as well as a very important caveat...

Joachim

Louis LaBrunda

unread,
Sep 22, 2020, 9:48:39 AM9/22/20
to VA Smalltalk
Hi Joachim,

Perhaps you can modify #nextLine or extend the class with a new method that handles the embedded line end character.  perhaps something as simple as counting the double quotes and continuing to read if there is an odd number.

Lou

Mariano Martinez Peck

unread,
Sep 24, 2020, 4:00:01 PM9/24/20
to VA Smalltalk
Hi guys, 

Joachim, yes, that behavior on #nextLine and #shouldSearchForAllStandardDelimiters is somehow new (91 or 92) and I hope it can meet your requirements. If not, then I would like to hear about it.  Not sure if I understand the question NeoCSVReader. If that's still an issue, could you clarify?

Louis: no, please, don't modify base apps. 

Best,

Mariano



--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6244a91-501c-461e-aa08-7a585b3d488dn%40googlegroups.com.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

Louis LaBrunda

unread,
Sep 24, 2020, 5:40:16 PM9/24/20
to VA Smalltalk
Mariano: I wasn't suggesting the base be modified, just that #nextLine could be copied to a new name and modified to do what Joachim needs to meet  Richard's warning.

Joachim Tuchel

unread,
Sep 25, 2020, 2:36:11 AM9/25/20
to VA Smalltalk
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)

Anyways: thank you Lou, Richard, Mariano for joining me in this journey and pushing me in the right direction.

Conclusion: PositionableStream, Cfs*FileStream and NeoCSVReader all handle line endings perfectly. If you expect Windows or Unix line endings and even a mix of these, just let them do their work and don't interfere. You're in the way if you try to be more clever ;-)

Joachim

Richard Sargent

unread,
Sep 25, 2020, 3:23:13 AM9/25/20
to VA Smalltalk
On Thu, Sep 24, 2020, 23:36 Joachim Tuchel <jtu...@objektfabrik.de> wrote:
Mariano,

Thanks. First of all, I can confirm #nextLine works as expected on teh example file that contained lines ending in both <cr><lf> and <lf>. So the key is not to set a lineDelimiter on the Stream.

To my surpise, NeoCSV also scans for cr/crlf/lf and does everything in the way it should.

The problem in this case is that I had implemented some clever code that first reads the first line of the File and tries to determine the lineDelimiter and then sets it on the input Stream. From then on, the Stream didn't handle any other lineDeliimiter than the one I had told it. So I was knocking out the cleverness of the Streams. I thought I was more clever. I learned something about this illusion today ;-)


One of the greatest failings of programmers is when they try to be clever. There are myriad counter-examples. I'll tell you my favourite, if you're interested.


You received this message because you are subscribed to a topic in the Google Groups "VA Smalltalk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/va-smalltalk/pifWoYORMxY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/b2eafb48-bebf-454b-b67e-e355f492b314n%40googlegroups.com.

Mariano Martinez Peck

unread,
Sep 25, 2020, 9:09:53 AM9/25/20
to VA Smalltalk
Hi Joachim, 

Happy yo read everything is working. I think you indeed needed to be smarter before and that's exactly why we improved this in 9.1. Just for the record, the developer case was:

"63547: Improve #nextLine to work with all known delimiters (Cr, Lf and CrLf)"

Cheers,



Joachim Tuchel

unread,
Sep 25, 2020, 3:15:22 PM9/25/20
to VA Smalltalk
Richard

I love to hear stories from and about other developers that feed my illusion that I am not alone in my imperfection. The top-most cleverest ideas I have often turn out to be the worst, but it takes time and sweat to find out ;-)
Is there something like the anonymous emberassed programmers? I'd like to join.

Joachim

Joachim Tuchel

unread,
Sep 25, 2020, 3:18:12 PM9/25/20
to VA Smalltalk
Mariano,

I guess or at least like the idea of thinking this was once a clver choice in order to overcome some limitation. I just recently migrated to 9.2, so thank you for opening this back door for me ;-)

Anyways: removing my clever code was a satisfying act today, as well as seeing how fine VAST handles even mixed line endings in files and streams. These are the small but powerful things that I like about your work at Instantiations.

Joachim

Mariano Martinez Peck

unread,
Sep 25, 2020, 3:45:36 PM9/25/20
to VA Smalltalk
On Fri, Sep 25, 2020 at 4:18 PM Joachim Tuchel <jtu...@objektfabrik.de> wrote:
Mariano,

I guess or at least like the idea of thinking this was once a clver choice in order to overcome some limitation. I just recently migrated to 9.2, so thank you for opening this back door for me ;-)

Anyways: removing my clever code was a satisfying act today, as well as seeing how fine VAST handles even mixed line endings in files and streams. These are the small but powerful things that I like about your work at Instantiations.


How did you know I implemented that? hahahah ;)
Now seriously, I am still not convinced on the decision on way to do that automatically or not..that is...the #shouldSearchForAllStandardDelimiters
I evaluated, and still is in my mind, to use an explicit new boolean instVar to control that.... but so far, I am still not convinced. 

BTW, to implement that functionally and be it performant, we also implemented new primitives which allowed speedup in other areas too. 
Below some notes from our interna bug tracker:

"- New methods in SequenceableCollection

indexOfAny: aSequenceableCollection
indexOfAny: aSequenceableCollection ifAbsent: exceptionBlock
indexOfAny: aSequenceableCollection startingAt: start
indexOfAny: aSequenceableCollection startingAt: start ifAbsent: exceptionHandler
- These 4 methods give symmetry with indexOfSubCollection...
- indexOfAny will answer the index of the first element to be in the argument aSequenceableCollection
- Implemented VMprStringIndexOfAny prim
This will provide String/DBString specific prim-assist for character searches using indexOfAny.... It's about 3-4x faster than using the more generic version in SequenceableCollection.  

 These also went on all the other streams that have the new delimiter handlers.

skipToAny: aSequentialCollection
upToAny: aSequentialCollection
"
create a generic/reusable and  FAST #indexOfAny:* ,   #skipToAny:  and  #upToAny:  


So basically.... all these #indexOfAny:*, #skipToAny:  and #upToAny:  are now prim assisted and fast.  

Richard Sargent

unread,
Sep 25, 2020, 5:08:24 PM9/25/20
to VA Smalltalk
On Friday, September 25, 2020 at 12:15:22 PM UTC-7, Joachim Tuchel wrote:
Richard

I love to hear stories from and about other developers that feed my illusion that I am not alone in my imperfection. The top-most cleverest ideas I have often turn out to be the worst, but it takes time and sweat to find out ;-)
Is there something like the anonymous emberassed programmers? I'd like to join.

Most of my stories/complaints are more in the "I would like to Slinky the programmer responsible for this!" style

My favourite and most extreme example involved buying something on eBay when I live in Zurich. I had navigated through numerous pages to order the product and had navigated through a number of pages to set up the payment. As soon as I entered my credit card's billing address, my web browser presented everything in German. Clever programmer (TM) had decided that, living in a German-speaking city, I must speak and read German and that he (somehow I am sure it was a male) was doing me a great favour by switching to German.
Reply all
Reply to author
Forward
0 new messages