Question about the implementation of WHITESPACE CharMatcher

118 views
Skip to first unread message

Qing Shu

unread,
Feb 8, 2014, 2:04:18 AM2/8/14
to guava-...@googlegroups.com
I'm writing a presentation for sharing guava libraries to my colleague , when i looked up the implementation of CharMatcher and notice a field WHITESPACE_MULTIPLIER=1682554634

i try to change this value to 1582554634 , running the testcase CharMatcherTest#testWhitespaceBreakingWhitespaceSubset , of course it failed.

after that i changed testWhitespaceBreakingWhitespaceSubset only invoke WHITESPACE.apply((char)c) without asset,   print the index in the method of WHITESPACE.matches 

(index=(WHITESPACE_MULTIPLIER * c) >>> WHITESPACE_SHIFT)

finally find out that index collided  after changed the WHITESPACE_MULTIPLIER from 1682554634 to  1582554634

No doubt, 1682554634 is well designed ,  my question is how can i infer this value ?
 
       sincerely 

Martin Grajcar

unread,
Feb 8, 2014, 12:37:40 PM2/8/14
to Qing Shu, guava-discuss
You can't infer it, but you can find it by systematic or pseudo-random search as I did. You need to find a formula generating unique indexes for 22 chars. Using a table of size 32 and a multiplication with a downshift is pretty fast and distributes the indexes well.

It's fast enough for a systematic search, with negative values omitted due to symmetry, you only have to try 2 billion values. Looping from 0 to MAX is not the smartest way as changing the value by 1 seldom changes the outcome, so something like

for (int i=0; i>=0; ++i) {
   int multiplier = 123456789 * i;
   if (hasNoConflict(multiplier)) return multiplier;
}

makes more sense. Assuming random distribution (which isn't true), you're done after about 20k tries.


--
--
guava-...@googlegroups.com
Project site: http://guava-libraries.googlecode.com
This group: http://groups.google.com/group/guava-discuss
 
This list is for general discussion.
To report an issue: http://code.google.com/p/guava-libraries/issues/entry
To get help: http://stackoverflow.com/questions/ask (use the tag "guava")
---
You received this message because you are subscribed to the Google Groups "guava-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to guava-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/guava-discuss/5d430f35-be91-457b-bcf4-56f7981d9a77%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Qing Shu

unread,
Feb 9, 2014, 9:25:18 PM2/9/14
to guava-...@googlegroups.com, Qing Shu
Hi Martin :
I try to write multiper generator  like this and worked :

char[] charsReq = WHITESPACE_TABLE.toCharArray();
Arrays.sort(charsReq);

for (int WHITESPACE_MULTIPLIER_WANTTED = 1682553701; WHITESPACE_MULTIPLIER_WANTTED <= 1682554834; WHITESPACE_MULTIPLIER_WANTTED++) {
            int i = 0;
            for (int c = 0; c <= Character.MAX_VALUE; c++) {
                if (Arrays.binarySearch(charsReq, (char) c) >= 0) {
                    if (WHITESPACE_TABLE.charAt((WHITESPACE_MULTIPLIER_WANTTED * c) >>> WHITESPACE_SHIFT) == c) {
                        i++;
                    } else {
                        break;
                    }
                } else {
                    if (WHITESPACE_TABLE.charAt((WHITESPACE_MULTIPLIER_WANTTED * c) >>> WHITESPACE_SHIFT) != c) {
                        i++;
                    } else {
                        break;
                    }
                }
            }

            if ((i - 1) == (int) (Character.MAX_VALUE)) {
                System.out.println(WHITESPACE_MULTIPLIER_WANTTED);
            }
        }

another question is WHITESPACE_TABLE ,   WHITESPACE_TABLE has lots of repeated character ,  the uniq count is 25 , the actual count is 32.  sequece of character and repeated \u3000 for padding are programming trick for performance gains 
is this assumption right ?

Martin Grajcar

unread,
Feb 9, 2014, 10:05:57 PM2/9/14
to Qing Shu, guava-discuss
On Mon, Feb 10, 2014 at 3:25 AM, Qing Shu <magic...@gmail.com> wrote:
another question is WHITESPACE_TABLE ,   WHITESPACE_TABLE has lots of repeated character ,  the uniq count is 25 , the actual count is 32.  sequece of character and repeated \u3000 for padding are programming trick for performance gains 
is this assumption right ?

In general such questions are better placed on http://stackoverflow.com with the tag guava. The table size must be power of two in order for the downshift to work, that's why 32 gets used. For the content of the unused entries you can use any character c, even a non-matching one, but then you'd need TABLE[index(c)] != c or a special condition like in the SmallCharMatcher. Using a matching character as a filler is the simplest choice.
Reply all
Reply to author
Forward
0 new messages