Trim on non-breaking spaces

801 views
Skip to first unread message

Dan Schlegel

unread,
Jan 8, 2015, 11:54:10 PM1/8/15
to cloju...@googlegroups.com
Hi all, 

I believe a similar issue to this has come up in the past, but I don't believe this specific case was considered.

I was parsing some text today and was surprised to find that clojure.string/trim was behaving unexpectedly:

=> (clojure.string/trim "   Headache ")
"   Headache"

It turns out that the first character is a non-breaking space (code 160): 

=> (int (first "   Headache "))
160

The trim function uses Java's Character/isWhitespace method, which has a very specific set of accepted whitespace characters (and doesn't include this one). See: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char).

Perhaps the more appropriate function to use is Character/isSpaceChar, which "... returns true if the character's general category type is any of the following:
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR"

Looking back through JIRA, I see the decision to use isWhitespace consistently across trim,triml, and trimr was made in CLJ-935 (http://dev.clojure.org/jira/browse/CLJ-935), but I'm not sure that isSpaceChar was considered. 

Unfortunately, Java's String.trim method does not catch code 160 as whitespace either. Perhaps there's some rationale there that I'm unfamiliar with.

I was wondering if you thought this was a bug, or expected behavior. 

Thanks!

-Dan

Alex Miller

unread,
Jan 9, 2015, 12:15:05 AM1/9/15
to cloju...@googlegroups.com
The decision was made at the time of CLJ-935 to consistently use isWhitespace for the trim functions. There is, admittedly, more than one way to define "space" in the context of trimming (which was the source of the inconsistency in that ticket in the first place). I don't know that it's possible for there to be one universally right answer here. 

Right now, I would say this is expected behavior (with the acknowledgement that it may not be *your* expected behavior :).  If you find you need behavior different than the clojure.string trim functions, you will need to roll your own. With a sufficiently compelling argument, you could probably convince me the Clojure's functions should change but doing so would potentially break expectations for other users so that raises the bar.

Alex

--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/d/optout.

Andy Fingerhut

unread,
Jan 9, 2015, 5:03:31 PM1/9/15
to cloju...@googlegroups.com
Dan:

I don't have anything Clojure-specific to add over what Alex did.

I do have some extra details to toss in, which are more than just about anyone wants to know about Unicode and which of that full character set is considered white space.  You can run these commands to get a table of which Unicode code points are considered whitespace, according to all of the different criteria I could find in Java other than the deprecated isSpace method:

% cd text.unicode
% lein test :whitespace

That will write a file whitespace.txt with a table of hex code points considered whitespace according to at least one of the criteria.  If yours is similar to mine, you'll note that isSpaceChar does not treat characters like tab, carriage return, and line feed as white space, but isWhitespace does.  If you want to write your own custom whitespace trimming function, you might want to take that table into account, perhaps considering a character as whitespace if either isWhitespace or isSpaceChar return true.

Perl, as a reference point for a language whose maintainers think about Unicode details _a lot_, even has locale-specific rules for which characters are considered whitespace: http://perldoc.perl.org/perlrecharclass.html#Whitespace

Andy

--

Dan Schlegel

unread,
Jan 9, 2015, 5:40:57 PM1/9/15
to cloju...@googlegroups.com
Alex, Andy: 

Thank you both. I'll admit I'm not a character set expert, and there may be many competing uses for something like the trim function. Both of your responses were very helpful. For now, I have just copied the existing trim function into my code and changed calls originally to isWhitespace to isSpaceChar. 

I wonder how commonly functions like trim meet or fail to meet expectations. It's one of those functions that might benefit from another argument which specified what to trim (either a list, or function to call). Just a thought anyway :). 

Thanks again,

-Dan
Reply all
Reply to author
Forward
0 new messages