Iterating over a String

Roedy Green

unread,

Nov 14, 2009, 12:54:14 PM11/14/09

to

It has bugged me that the for:each syntax would not let me write code
of the form:

String categories = "amq";
...
for ( char category: categories )

However, you can write this:

String categories = "amq";

...

final char[] cats = categories.toCharArray();
for ( char category : cats )

What is your opinion. Would you prefer it, or a indexing look with
CharAt?

The indexing loop lets you look back and forward, which the for:each
does not.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Without deviation from the norm, progress is not possible.
~ Frank Zappa (born: 1940-12-21 died: 1993-12-04 at age: 52)

Eric Sosman

unread,

Nov 14, 2009, 1:18:16 PM11/14/09

to

Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
>
> String categories = "amq";
> ...
> for ( char category: categories )

Bugs me, too. It seems so obvious to have String (more
generally, CharSequence) implement Iterable, but ...

--
Eric Sosman
eso...@ieee-dot-org.invalid

markspace

unread,

Nov 14, 2009, 1:45:31 PM11/14/09

to

Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
>
> String categories = "amq";

> ....

> for ( char category: categories )

I've opined here that I'd like a shorter form for iterating with a
simple integer.

for( int i : categories.length() ) {
char c = categories.charAt(i);
....
}

Not exactly what you are asking for but I thought I'd toss my two bits in.

Patricia Shanahan

unread,

Nov 14, 2009, 4:14:02 PM11/14/09

to

Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
>
> String categories = "amq";
> ...
> for ( char category: categories )
>
>
> However, you can write this:
>
>
> String categories = "amq";
>
> ...
>
> final char[] cats = categories.toCharArray();
> for ( char category : cats )
>
> What is your opinion. Would you prefer it, or a indexing look with
> CharAt?
>
> The indexing loop lets you look back and forward, which the for:each
> does not.

Here's a utility class that makes it easy to apply the for each loop to
a String, if that is what you want to do. I use the for loop whenever it
works without stretching. See the main method at the end for an example
of using the class.

import java.util.Iterator;
import java.util.NoSuchElementException;

public class IterableString implements Iterable<Character> {
private String data;

/**
* Create an Iterable for the specified String
*
* @param data
* The String to iterate.
*/
public IterableString(String data) {
super();
this.data = data;
}

@Override
public Iterator<Character> iterator() {
return new StringIterator(data);
}

private static class StringIterator implements Iterator<Character> {

private String data;
private int index = 0;

public StringIterator(String data) {
this.data = data;
}

@Override
public boolean hasNext() {
return index < data.length();
}

@Override
public Character next() {
if (index < data.length()) {
Character result = data.charAt(index);
index++;
return result;
} else {
throw new NoSuchElementException();
}
}

@Override
public void remove() {
throw new UnsupportedOperationException("No remove from String");
}

}

/** Demonstration method */
public static void main(String[] args) {
String testData = "xyzzy";
for(char c : new IterableString(testData)){
System.out.println(c);
}
}

}

Daniel Pitts

unread,

Nov 16, 2009, 3:42:43 PM11/16/09

to

Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
>
> String categories = "amq";

> ....

> for ( char category: categories )
>
>
> However, you can write this:
>
>
> String categories = "amq";
>

> ....

>
> final char[] cats = categories.toCharArray();
> for ( char category : cats )
>
> What is your opinion. Would you prefer it, or a indexing look with
> CharAt?
>
> The indexing loop lets you look back and forward, which the for:each
> does not.

What about 64bit codepoints? Wouldn't you rather iterate over codepoints
than characters?

Patricia gave a good wrapper class for doing what you requested, but I
suggest adapting it to support Integer codepoints.

Roedy Green

unread,

Nov 16, 2009, 8:29:14 PM11/16/09

to

On Mon, 16 Nov 2009 12:42:43 -0800, Daniel Pitts
<newsgroup....@virtualinfinity.net> wrote, quoted or indirectly
quoted someone who said :

>What about 64bit codepoints? Wouldn't you rather iterate over codepoints
>than characters?

if it were either/or I would say no. I don't have any application for
32-bit Unicode yet and don't foresee it in my lifetime.

RedGrittyBrick

unread,

Nov 17, 2009, 4:31:25 AM11/17/09

to

Roedy Green wrote:
> On Mon, 16 Nov 2009 12:42:43 -0800, Daniel Pitts
> <newsgroup....@virtualinfinity.net> wrote, quoted or indirectly
> quoted someone who said :
>
>> What about 64bit codepoints? Wouldn't you rather iterate over codepoints
>> than characters?
>
> if it were either/or I would say no. I don't have any application for
> 32-bit Unicode yet and don't foresee it in my lifetime.

Gadzooks! Do you mean ...

* ASCII is enough.
* ISO 8859-1 is enough.
* Unicode Base Multilingual Plane is enough.
* Something else?

Unicode isn't a 32 bit character set, it's a 21 bit character set[1],
though one *encoding* of Unicode is 32-bits - UTF-32.

[1] http://unicode.org/faq/utf_bom.html#gen0
--
RGB

markspace

unread,

Nov 17, 2009, 10:05:01 AM11/17/09

to

RedGrittyBrick wrote:
>
> Roedy Green wrote:
>> On Mon, 16 Nov 2009 12:42:43 -0800, Daniel Pitts
>> <newsgroup....@virtualinfinity.net> wrote, quoted or indirectly
>> quoted someone who said :
>>
>>> What about 64bit codepoints? Wouldn't you rather iterate over
>>> codepoints than characters?
>>
>> if it were either/or I would say no. I don't have any application for
>> 32-bit Unicode yet and don't foresee it in my lifetime.
>
> Gadzooks! Do you mean ...

> * Something else?

This I think.

If you read the quotes above, you'll notice that Daniel wrote "64 bit
codepoints." I think that's roughly twice as many bits as even the
Unicode Consortium has dreamed of using, and more than twice the
required 21 bits currently required for the whole she-bang, as you point
out.

Daniel Pitts

unread,

Nov 17, 2009, 7:19:00 PM11/17/09

to

I made a mistake, in a state of cold medicine induced delirium ;-) I
meant to say 32bit codepoints, as apposed to 16bit chars.

It doesn't matter if *you* think you need to support it, your clients
will need you to support it one day, randomly, out of the blue. When
your program crashes, or does the wrong thing, it will look bad. Even
if you are able to repair it quickly. It is better to not have to
repair it at all.

Roedy Green

unread,

Nov 17, 2009, 7:40:22 PM11/17/09

to

On Tue, 17 Nov 2009 09:31:25 +0000, RedGrittyBrick
<RedGrit...@spamweary.invalid> wrote, quoted or indirectly quoted
someone who said :

>Unicode isn't a 32 bit character set, it's a 21 bit character set[1],

inside java, with codepoints you treat it as 32-bit.

What are the codepoints above the 16 bit point?

Aegean numbers, Mormon Deseret, Cuneiform, Shavian, Osmanya
(Somalian), Byzantine music symbols, extended Chinese.

These are not the sorts of symbols used in business. I am not likely
to ever use these. These are more for anthropologists.

The only one plausible is the Alphabetic Mathematical, which really
have no business being codepoints. They could just as easily be fonts.
http://www.unicode.org/charts/PDF/U1D400.pdf

RedGrittyBrick

unread,

Nov 18, 2009, 5:06:28 AM11/18/09

to

Roedy Green wrote:
> On Tue, 17 Nov 2009 09:31:25 +0000, RedGrittyBrick
> <RedGrit...@spamweary.invalid> wrote, quoted or indirectly quoted
> someone who said :
>
>> Unicode isn't a 32 bit character set, it's a 21 bit character set[1],
>
> inside java, with codepoints you treat it as 32-bit.
>
> What are the codepoints above the 16 bit point?
>
> Aegean numbers, Mormon Deseret, Cuneiform, Shavian, Osmanya
> (Somalian), Byzantine music symbols, extended Chinese.
>
> These are not the sorts of symbols used in business.

These are not *commonly* used in business.

Amazon is a business, it sells books on those subjects. Businesses like
Amazon sometimes display extracts of the books they sell. The publishers
of those books are also businesses. Well known businesses sometimes
index huge numbers of books and make those indexes accessible to the
public over the web[1].

> I am not likely to ever use these. These are more for anthropologists.

It may be true that you will never do business with anthropologists,
booksellers or people whose job, hobbies or interests involve the
writing systems you listed.

I'd prefer my use of Java not to limit my opportunities, no matter how
unlikely they might seem to me today.

However, like you I think, I'm reluctant to jump through special hoops
to achieve this.

[1]
http://www.archive.org/search.php?query=subject%3A%22Cuneiform%20inscriptions%22
--
RGB

Mayeul

unread,

Nov 18, 2009, 7:43:14 AM11/18/09

to

My point would rather be that the moment you expose a text input field
to end users, is the moment you must support (or at least reject instead
of misleadingly accept) the entire Unicode range.

Users will install the cool cuneiform font and copy/paste cuneiform
characters because they'll think it's cool or it helps them organize
whatever they're inputting. Short version: because they can.

If I had to trust my clients on it, a lot of people have ancient capital
Greek letters in their last name or phone number shortcode. Especially
Capital Pi and Sigma.

--
Mayeul

RedGrittyBrick

unread,

Nov 18, 2009, 10:03:17 AM11/18/09

to

Mayeul wrote:
> My point would rather be that the moment you expose a text input field
> to end users,

It's a good point but it doesn't apply to Roedy's original example of
String categories = "amq";

> is the moment you must support (or at least reject instead
> of misleadingly accept) the entire Unicode range.

You are right. This is where Java lets us down.

> Users will install the cool cuneiform font and copy/paste cuneiform
> characters because they'll think it's cool or it helps them organize
> whatever they're inputting. Short version: because they can.

I'm not familiar with handling surrogate pairs, presumably you write
code like

for (int i=0; i<userInput.codePointCount(0,userInput.length()); i++) {
int codePoint = userInput.codePointAt(i);
char[] chars = Character.toChars(codePoint);
if (chars.length == 2) {
// we have a surrogate pair for a character > \uFFFF
} else {
// we have a BMP character
}
}

--
RGB

RedGrittyBrick

unread,

Nov 18, 2009, 10:27:21 AM11/18/09

to

Oops, silly me, the parameter for codePointAt() is in terms of chars
not, err, characters.

----------------------------------8<------------------------------------

String userInput = "\uD834\uDD1E" /* U+1D11E */ + " G clef";
System.out.printf("String '%s' has length %d %n",
userInput, userInput.length());

for (int i=0; i<userInput.length(); i++) {

int codePoint = userInput.codePointAt(i);
char[] chars = Character.toChars(codePoint);
if (chars.length == 2) {

// its a surrogate pair
System.out.printf("%d: pair '%s' code-point %X %n",
i, new String(chars), codePoint);
i++;
} else {
// its a BMP character
System.out.printf("%d: character '%s' code-point %X %n",
i, chars[0], codePoint);
}
}
----------------------------------8<------------------------------------
String '? G clef' has length 9
0: pair '?' code-point 1D11E
2: character ' ' code-point 20
3: character 'G' code-point 47
4: character ' ' code-point 20
5: character 'c' code-point 63
6: character 'l' code-point 6C
7: character 'e' code-point 65
8: character 'f' code-point 66

--
RGB

Mayeul

unread,

Nov 18, 2009, 11:33:37 AM11/18/09

to

RedGrittyBrick wrote:
> I'm not familiar with handling surrogate pairs, presumably you write
> code like
>
> for (int i=0; i<userInput.codePointCount(0,userInput.length()); i++) {
> int codePoint = userInput.codePointAt(i);
> char[] chars = Character.toChars(codePoint);
> if (chars.length == 2) {
> // we have a surrogate pair for a character > \uFFFF
> } else {
> // we have a BMP character
> }
> }
>

I didn't mean to imply I don't know how to handle it.

The problem is more that in some circumstances where you apply
character-by-character logic, you might need to /know/ you have to
handle it.

Besides, and taking your following correction into account, you probably
want to call Character.charCount() instead of Character.toChars()

-----------------------------------------------------
int codePoint;
for (int i=0; i<userInput.length(); i += Character.charCount(codePoint)) {
codePoint = userInput.codePointAt(i);

// Really, character-by-character code should be made
// to work with codePoint and Character methods from here.
}
--------------------------------------------------------

--
Mayeul

Roedy Green

unread,

Nov 19, 2009, 9:31:11 PM11/19/09

to

On Wed, 18 Nov 2009 10:06:28 +0000, RedGrittyBrick

<RedGrit...@spamweary.invalid> wrote, quoted or indirectly quoted
someone who said :

>

>I'd prefer my use of Java not to limit my opportunities, no matter how
>unlikely they might seem to me today.

If you check back to how this strand got going, it was about the need
for 16, 32 or 64 bit String Iterators. I stated, if I could only have
one, it would be 16 bit. I then justified my choice. I am not
preaching deliberate ignorance on codepoints.

I am probably one of the few people have poked around in that part of
the codespace, motivated by nothing more than raw curiosity.

Something I have noticed in my meanderings is that alphabets are NOT
designed to make the letters as visually distinct as possible. They
are primarily designed to visually blend together in an aesthetically
pleasing way.

Many alphabets have letters almost identical. I am curious to learn
how that came about. It seems like an odd thing to do in designing an
alphabet.

--
Roedy Green Canadian Mind Products
http://mindprod.com

Finding a bug is a sign you were asleep a the switch when coding. Stop debugging, and go back over your code line by line.

Roedy Green

unread,

Nov 19, 2009, 9:31:11 PM11/19/09

to

On Wed, 18 Nov 2009 15:03:17 +0000, RedGrittyBrick

<RedGrit...@spamweary.invalid> wrote, quoted or indirectly quoted
someone who said :

>> is the moment you must support (or at least reject instead

>> of misleadingly accept) the entire Unicode range.
>
>You are right. This is where Java lets us down.

An Abundance, my own language, fields come with filters that describe
what chars they accept when keyed. If you allow lower case letters
only, the upper case get translated to lower case as they are keyed,
and accented letters get their accents stripped. Invalid chars keyed
do nothing other that cause a short distinctive invalid char noise.

The type information is used to automatically generate prompt
information about what is acceptable.

Java is still downright backward when it comes to data entry. Even
KEYPUNCHES were more programmer and user friendly.

Abundance (circa 1980) did data entry, enforcing low/high bounds,
valid phone numbers, postal codes, zips, provinces, countries, states,
currency, dates, optional/mandatory, check digits, credit card
numbers, fields, ... without any programming other that specifying
the field type and bounds.

The world seems to have lost interest in high speed data entry.

--
Roedy Green Canadian Mind Products
http://mindprod.com

RedGrittyBrick

unread,

Nov 21, 2009, 9:22:44 AM11/21/09

to

Roedy Green wrote:
> On Wed, 18 Nov 2009 10:06:28 +0000, RedGrittyBrick
> <RedGrit...@spamweary.invalid> wrote, quoted or indirectly quoted
> someone who said :
>
>> I'd prefer my use of Java not to limit my opportunities, no matter how
>> unlikely they might seem to me today.
>
> If you check back to how this strand got going, it was about the need
> for 16, 32 or 64 bit String Iterators. I stated, if I could only have
> one, it would be 16 bit. I then justified my choice. I am not
> preaching deliberate ignorance on codepoints.

If I could only have one, I would have one that hides from me the number
of bits and the encoding used by the underlying representation.

e.g.

for (UnicodeCharacter char: astring) ...

UnicodeCharacter char = String.UnicodeCharacterAtOrdinal(7);

Where 7 is the eighth Unicode Character in the string regardless of how
many 8-bit or 16-bit units precede it in the underlying representation.

--
RGB

RedGrittyBrick

unread,

Nov 21, 2009, 9:55:17 AM11/21/09

to

Roedy Green wrote:
>
> Something I have noticed in my meanderings is that alphabets are NOT
> designed to make the letters as visually distinct as possible. They
> are primarily designed to visually blend together in an aesthetically
> pleasing way.
>
> Many alphabets have letters almost identical. I am curious to learn
> how that came about. It seems like an odd thing to do in designing an
> alphabet.

We are using the word alphabet rather loosely to mean the basic letter
shapes used in modern alphabetic writing systems used for human
communications. I imagine most commonly used alphabets don't get
designed, they evolve, split and coalesce in an unplanned way over
millennia.

If they were designed, it was for a different purpose (tallying,
accounting, administration) than that for which we mostly use them.

I like http://www.textism.com/writing/?id=1
but Wikipedia is probably more enlightening:
http://en.wikipedia.org/wiki/History_of_the_alphabet
or
http://whyfiles.org/079writing/index.html

I imagine that similar looking letters can arise in the same way that
homonyms arise. The shape of letters and the pronunciation and spelling
of distinct words changes over hundreds of years. Eventually we have
visual representations that now look the same (or very similar) but have
differing meanings.

AIUI, we don't read letter-shapes. We mostly read word-shapes.

--
RGB

Lew

unread,

Nov 21, 2009, 12:55:59 PM11/21/09

to

RedGrittyBrick wrote:
> If I could only have one, I would have one that hides from me the number
> of bits and the encoding used by the underlying representation.
>
> e.g.
>
> for (UnicodeCharacter char: astring) ...
>
> UnicodeCharacter char = String.UnicodeCharacterAtOrdinal(7);
>
> Where 7 is the eighth Unicode Character in the string regardless of how
> many 8-bit or 16-bit units precede it in the underlying representation.

String#codePointAt( int index )
String#codePointCount( int beginIndex, int endIndex )

provide crude tools to sort of do that.

--
Lew

Lew

unread,

Nov 21, 2009, 12:59:53 PM11/21/09

to

RedGrittyBrick wrote:
>> If I could only have one, I would have one that hides from me the
>> number of bits and the encoding used by the underlying representation.
>>
>> e.g.
>>
>> for (UnicodeCharacter char: astring) ...
>>
>> UnicodeCharacter char = String.UnicodeCharacterAtOrdinal(7);
>>
>> Where 7 is the eighth Unicode Character in the string regardless of
>> how many 8-bit or 16-bit units precede it in the underlying
>> representation.

Lew wrote:
> String#codePointAt( int index )
> String#codePointCount( int beginIndex, int endIndex )

and String#offsetByCodePoints(int index, int codePointOffset)