Carriage return or new line

210 views
Skip to first unread message

Jeff Shrum

unread,
Oct 14, 2013, 11:04:42 AM10/14/13
to flex...@googlegroups.com

I have been importing several old Toolbox databases into Flex, but I am having some problems using regular expressions with Notepad++.  I needed to add “\ps v” to all entries that began with “*”.  The result looked good in Notepad++ but then Flex did not see that “\ps v” was on a new line and merged the information with the “\x” field.  When I went back to Notepad++ and went to “Viewà show symbolsà end of line”,  most lines had “CR” “LF”.  The problem records do not have “CR” only “LR” .  What is the correct regular expression symbols for carriage return?  I need to improve my expressions for the next data set to be imported.

 

I want to alert  others that Notepad++, while free and good in many ways, does not display a Unicode text correctly in all cases.

 

Jeff Shrum
Language Technology Consultant
SIL Southern Africa
+258 82 300 8461

+258 86 682 0862

In Malawi: +265 99 373 3153

 

 

jim_al...@wycliffe.org

unread,
Oct 14, 2013, 12:02:34 PM10/14/13
to flex...@googlegroups.com

\r\n  matches crlf

 

Jim Albright

704 843-0582

JAARS, Speeding Bible Translation

Wycliffe, Partners in Bible Translation

--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/000001cec8ee%24b9e5cd60%242db16820%24%40org.
For more options, visit https://groups.google.com/groups/opt_out.

Beth Bryson

unread,
Oct 14, 2013, 12:30:32 PM10/14/13
to flex...@googlegroups.com
The first key is to find out how your text was encoded in the first place--are the line endings for Windows, Mac, or Unix?

One thing I *like* about Notepad++ (in contrast to Notepad or WordPad) is that it is easy to ask it to save the file with "DOS line endings" or "Mac line endings" or "Unix line endings".

I don't remember right now how to find out which one your file already is, but that is the key, and then when you do your regular expression search, you need to use whatever your file is already using.  (Personally, all the scripts I have for processing Toolbox files in preparation for import to FLEx work much better with Unix line endings than Windows ones, maybe because only one character is used to mark the end of a line, not two.)

Whatever you do, don't end up with a mixture--that is a mess!

Like Jim, I use \n or \r or \r\n in regular expressions, depending on how my file is encoded.

-Beth

On Oct 14, 2013, at 10:04 AM, Jeff Shrum wrote:

I have been importing several old Toolbox databases into Flex, but I am having some problems using regular expressions with Notepad++.  I needed to add “\ps v” to all entries that began with “*”.  The result looked good in Notepad++ but then Flex did not see that “\ps v” was on a new line and merged the information with the “\x” field.  When I went back to Notepad++ and went to “Viewàshow symbolsà end of line”,  most lines had “CR” “LF”.  The problem records do not have “CR” only “LR” .  What is the correct regular expression symbols for carriage return?  I need to improve my expressions for the next data set to be imported.
 
I want to alert  others that Notepad++, while free and good in many ways, does not display a Unicode text correctly in all cases.
 
Jeff Shrum
Language Technology Consultant
SIL Southern Africa
+258 82 300 8461
In Malawi: +265 99 373 3153
 
 

Jeff Shrum

unread,
Oct 14, 2013, 6:13:03 PM10/14/13
to flex...@googlegroups.com

Beth,

 

Yes, the file I have is a mess.  It seems to have lots of hidden tab characters and LF’s without CR’s.  I think I made it worse without knowing it.  The file I was given is actually a .doc file that has had who knows what done to it.  I have learned to scrutinize files more before I start working on them.  Things are not always what they seem.  I think I can find all of the LF’s without CR’s and add the CR’s with a regular expression now that I know what the problem is.  One of the interesting things that happens on importing a \lx field without out a CR on the end is that Flex marks even single word lexemes as phrases.  Next time I see this, I will know what the underlying problem with the SFM file is.

 

Jeff S.

Jeff Shrum

unread,
Oct 14, 2013, 6:13:03 PM10/14/13
to flex...@googlegroups.com

Thanks Jim.  These look like the strings that I was taught to use, but when the result looked correct in Notepad++ I did not add them to the expression that I used.

 

Jeff S.

Jeff Shrum

unread,
Oct 15, 2013, 12:46:06 AM10/15/13
to flex...@googlegroups.com

In this data there are lines with only LF and no CR, but I cannot seem to write and expression that will only capture those cases.  Whatever I have tried captures both lines with LF and CRLF.  Anyone know how to do this?

 

Jeff S.

 

From: flex...@googlegroups.com [mailto:flex...@googlegroups.com] On Behalf Of Jeff Shrum
Sent: Monday, October 14, 2013 5:05 PM
To: flex...@googlegroups.com
Subject: [FLEx] Carriage return or new line

 

I have been importing several old Toolbox databases into Flex, but I am having some problems using regular expressions with Notepad++.  I needed to add “\ps v” to all entries that began with “*”.  The result looked good in Notepad++ but then Flex did not see that “\ps v” was on a new line and merged the information with the “\x” field.  When I went back to Notepad++ and went to “Viewà show symbolsà end of line”,  most lines had “CR” “LF”.  The problem records do not have “CR” only “LR” .  What is the correct regular expression symbols for carriage return?  I need to improve my expressions for the next data set to be imported.

--

Robert Hedinger

unread,
Oct 15, 2013, 4:57:00 AM10/15/13
to flex...@googlegroups.com
One simple way of doing it is adding the CR before LF and then do another change that reduces the unwanted CRCRLF to CRLF.
 
Robert

Jeff Shrum

unread,
Oct 15, 2013, 4:23:09 PM10/15/13
to flex...@googlegroups.com

Robert,

 

Yes, I see that a two step process could work.  Thankyou.  I am just surprised that  something that has symbol “\n” cannot be treated in isolation.  Probably dates back to a weakness in DOS or KPl-M that no one ever bothered to correct.

 

Jeff S.

J V C

unread,
Oct 23, 2013, 2:33:20 AM10/23/13
to flex...@googlegroups.com
In Notepad++, there's a "P" (paragraph mark) button you can click to display nonprinting characters, much like in MS Word (but clearer, though uglier). You can also simply use the Edit, EOL Conversion feature and let it do all the work. I tend to use \r\n (CR LF) since I'm on Windows, but it's less work to use \n as long as you remember that your regexes are then less general-purpose, and you'll want to switch back to CRLF when saving a file to be share with Windows users.

I do seem to remember NP++ having trouble being precise with these characters, but I think they fixed that over a year ago when they greatly improved their regex support.

FWIW, here's something I posted on LTS a while back...

I use \r\n all the time in regexes too, since I'm mostly working with text files created on Windows with its annoying newlines. But if you also need to handle Linux or Mac (e.g. if you're sharing regexes with a mixed audience), here are the codes:
\n (Linux, Unix, Mac)
\r (old Mac, prior to OS X)
\r\n (Windows)
So, for a cross-platform regex matching one or more newlines, this works:
[\n\r]+
To match on exactly one newline, this is more precise. It means, "\n, or else \r followed by zero or one \n"):
\n|\r\n?

Jon
Reply all
Reply to author
Forward
0 new messages