Arbitrary instance for Strings

86 views
Skip to first unread message

rm

unread,
May 15, 2011, 12:53:25 PM5/15/11
to scalacheck
I noticed today that the standard Arbitrary[String] will never
generate strings with characters that lie outside the base
multilingual plane (because it ultimately delegates to arbChar which
never selects surrogate characters). I wrote this to get full
Unicode strings, perhaps it would be useful (and hopefully Google
Groups won't mangle a code-paste):

val arbFullUnicodeString = Arbitrary[String] {
val lowSurrogate = Gen.choose(Character.MIN_LOW_SURROGATE,
Character.MAX_LOW_SURROGATE).map(_.toChar)

val notLowSurrogate = Gen.frequency(
(Character.MIN_LOW_SURROGATE - Char.MinValue,
Gen.choose(Char.MinValue, Character.MIN_LOW_SURROGATE - 1)),
(Char.MaxValue - Character.MAX_LOW_SURROGATE,
Gen.choose(Character.MAX_LOW_SURROGATE + 1, Char.MaxValue))
).map(_.toChar)

val validCodePoint = notLowSurrogate flatMap { a =>
if(a.isHighSurrogate) lowSurrogate map { b => new
String(Array(a, b)) }
else a.toString
}

Gen.containerOf[List, String](validCodePoint) map (_.mkString)
}

I suppose it'd also be possible add a "suchThat" filter to ensure the
first character isn't a combining character, too, though I didn't
because I couldn't find anything in the unicode standard that would
forbid it, even though it's pretty nonsensical.

Yuvi Masory

unread,
May 16, 2011, 8:20:18 AM5/16/11
to scala...@googlegroups.com
I noticed today that the standard Arbitrary[String] will never
generate strings with characters that lie outside the base
multilingual plane

The Scala spec states that only characters in the basic multilingual plane are supported, so that's probably a good default for ScalaCheck. But what you've written could be a valuable addition for when you're testing Java or doing something else with Strings. I've been writing a bunch of generators for Unicode if you're interested: https://github.com/quala/qualac/blob/master/src/main/scala/lex/Characters.scala

Yuvi
 
(because it ultimately delegates to arbChar which
never selects surrogate characters).   I wrote this to get full
Unicode strings, perhaps it would be useful (and hopefully Google
Groups won't mangle a code-paste):

 val arbFullUnicodeString = Arbitrary[String] {
   val lowSurrogate = Gen.choose(Character.MIN_LOW_SURROGATE,
Character.MAX_LOW_SURROGATE).map(_.toChar)

   val notLowSurrogate = Gen.frequency(
     (Character.MIN_LOW_SURROGATE - Char.MinValue,
Gen.choose(Char.MinValue, Character.MIN_LOW_SURROGATE - 1)),
     (Char.MaxValue - Character.MAX_LOW_SURROGATE,
Gen.choose(Character.MAX_LOW_SURROGATE + 1, Char.MaxValue))
   ).map(_.toChar)

   val validCodePoint = notLowSurrogate flatMap { a =>
     if(a.isHighSurrogate) lowSurrogate map { b => new
String(Array(a, b)) }
     else a.toString
   }

   Gen.containerOf[List, String](validCodePoint) map (_.mkString)
 }

I suppose it'd also be possible add a "suchThat" filter to ensure the
first character isn't a combining character, too, though I didn't
because I couldn't find anything in the unicode standard that would
forbid it, even though it's pretty nonsensical.

--
You received this message because you are subscribed to the Google Groups "scalacheck" group.
To post to this group, send email to scala...@googlegroups.com.
To unsubscribe from this group, send email to scalacheck+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scalacheck?hl=en.


rm

unread,
May 16, 2011, 10:36:17 AM5/16/11
to scalacheck
On May 16, 5:20 am, Yuvi Masory <ymas...@gmail.com> wrote:
> > I noticed today that the standard Arbitrary[String] will never
> > generate strings with characters that lie outside the base
> > multilingual plane
>
> The Scala spec states that only characters in the basic multilingual plane
> are supported, so that's probably a good default for ScalaCheck.

Actually, it says the "Scala programs are written using the Unicode
Basic
Multilingual Plane (BMP) character set"; Char, to the extent it's
specified
at all, is merely an unsigned 16-bit integer type and String of course
is
("usually" says the spec but in practice always) the underlying
platform's
string class, and therefore UTF-16. Anyway, since there are no non-
BMP
characters in the actual source code, it's fine according to my
reading.

> But what
> you've written could be a valuable addition for when you're testing Java or
> doing something else with Strings. I've been writing a bunch of generators
> for Unicode if you're interested:https://github.com/quala/qualac/blob/master/src/main/scala/lex/Charac...

..and that's actually very useful indeed. Dealing with unicode on the
codepoint level instead of the UTF-16 stuff the JVM and CLR force on
us is... well, still not exactly a walk in the park, but better.

Ingvar Bogdahn

unread,
Jul 27, 2011, 7:13:13 AM7/27/11
to scala...@googlegroups.com
Hi,

I have problems with Strings when testing java. I'd like to use your code, but the link is offline. Could you please rehost, and if not evident, I'd be grateful for a quick hint how to generate java compatible strings.

thanks
Ingvar
Reply all
Reply to author
Forward
0 new messages