Scala primitives performance

Greg Zoller

unread,

Feb 17, 2015, 10:13:25 AM2/17/15

to scala...@googlegroups.com

I am working with a JSON tokenizer and comparing an algorithm in Scala with a very similar approach in Java. The Scala version is a full 5x slower.

This particular algorithm mostly works with large quantities of primitive types char, int, and byte. I read in Scala these are objects where Java uses unwrapped primitives. Would this account for a significant difference in speed, i.e. is Scala spending time creating lots of wrapper objects for these primitives?

Oliver Ruebenacker

unread,

Feb 17, 2015, 10:28:26 AM2/17/15

to Greg Zoller, scala-user

Hello,

Ideally, Scala only boxes primitives if it has to, so that should not be a performance penalty. However, you can easily inadvertently write code that forces Scala to box things, for example, type parameters need to be boxed, such as Seq[Int], unless you use specialization.

Best, Oliver

On Tue, Feb 17, 2015 at 10:13 AM, Greg Zoller <gzo...@gmail.com> wrote:

I am working with a JSON tokenizer and comparing an algorithm in Scala with a very similar approach in Java. The Scala version is a full 5x slower.
This particular algorithm mostly works with large quantities of primitive types char, int, and byte. I read in Scala these are objects where Java uses unwrapped primitives. Would this account for a significant difference in speed, i.e. is Scala spending time creating lots of wrapper objects for these primitives?

--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Oliver Ruebenacker

Solutions Architect at Altisource Labs

Be always grateful, but never satisfied.

Paul Keeble

unread,

Feb 17, 2015, 10:33:56 AM2/17/15

to scala-user

If the performance is an issue then go ahead and profile it and this will tell you why the Scala version is slower. Boxing could certainly be part of the problem but it not might not be the only thing. For example when I was writing Freezer I found doing a pattern match for the left most element of an array caused the entire array to be converted to a Seq first, and since I was basically walking through a binary message start to end this was done effectively N times. I was a little surprised but it was easy to fix once I profiled it and saw the cause of the problem and I learnt something about how the collections worked.

So rather than guessing as to the cause profile it, just the free java included one will probably tell you what you need to know.

PK

Greg Zoller

unread,

Feb 17, 2015, 7:30:45 PM2/17/15

to scala...@googlegroups.com

Profiling was a huge help. Cut time solidly in half after finding some things like calling functions passing large arrays. Got my performance to on-par with Jackson's parser. Most of the remaining time according to the profiler seems to be accessing the large arrays by index, i.e. myArr(i). There's something in the profiler about time spent somewhere in mutable Array in Scala (not using mutable Arrays, directly anyway). This might be a minor bit of wrapping to pimp the java Arrays with the Scala Arrays, but that may be what's costing me a bit.

The amazing/depressing thing is I was trying to get close to a tokenizer designed by jenkov in Java. Where mine and Jackson's would tokenize a 5m file in about 75ms, jenkov's consistently parsed less than half that. Truly an impressive piece of work.

Vlad Ureche

unread,

Feb 17, 2015, 8:29:02 PM2/17/15

to Greg Zoller, scala-user

2015-02-18 1:30 GMT+01:00 Greg Zoller <gzo...@gmail.com>:

Profiling was a huge help. Cut time solidly in half after finding some things like calling functions passing large arrays. Got my performance to on-par with Jackson's parser. Most of the remaining time according to the profiler seems to be accessing the large arrays by index, i.e. myArr(i). There's something in the profiler about time spent somewhere in mutable Array in Scala (not using mutable Arrays, directly anyway). This might be a minor bit of wrapping to pimp the java Arrays with the Scala Arrays, but that may be what's costing me a bit.

The amazing/depressing thing is I was trying to get close to a tokenizer designed by jenkov in Java. Where mine and Jackson's would tokenize a 5m file in about 75ms, jenkov's consistently parsed less than half that. Truly an impressive piece of work.

This may be a bit more work, but if you’re after primitive performance, the miniboxing plugin can provide some useful warnings:

$ git clone https://github.com/miniboxing/miniboxing-example.git
Cloning into 'miniboxing-example'...

$ cd miniboxing-example/
$ sed -i 's/Nil/"-P:minibox:warn-all"::Nil/' project/Build.scala # get it to display warnings for the library as well
$ sbt console
...
scala> 3::Nil
<console>:8: warning: The method scala.collection.immutable.List.:: would benefit from miniboxing type parameter B, since it is instantiated by a primitive type.
              3::Nil
               ^
res0: List[Int] = List(3)

Miniboxing is something like @specialized but more efficient and user-friendly, and you’ll get these warnings whenever the code is going to box.
The website for the plugin is scala-miniboxing.org and there’s a presentation there explaining these warnings.

HTH,
Vlad

Greg Zoller

unread,

Feb 18, 2015, 12:44:08 AM2/18/15

to scala...@googlegroups.com, gzo...@gmail.com

This is some fine work, and I hope it will soon be a part of Scala's standard compiler. Unfortunately when I tried miniboxing it gave me warnings--about all the Scala standard collections! (My code doesn't define any new parameterized anything--just uses Scala collections.) I think I'd have to rebuild Scala itself with the miniboxed plugin to see how much improvement I'd get. I do believe this is probably where the last bit of "lost" performance is (vs Java).

Primitives are currently a "hole" in Scala performance and while I understand why we don't want them in the language, the minibox adaptation is a great optimization.

Sonnenschein

unread,

Feb 18, 2015, 3:14:34 AM2/18/15

to scala...@googlegroups.com, gzo...@gmail.com

But collections must box alike. So what are the concrete collections you are using on Java respectively Scala side?
Btw, you can use special collections or just primitive arrays to deal with primitive types both in Java and Scala.
Peter

Sonnenschein

unread,

Feb 18, 2015, 3:15:44 AM2/18/15

to scala...@googlegroups.com, gzo...@gmail.com

I mean: Java collections must box alike.

Vlad Ureche

unread,

Feb 18, 2015, 3:56:03 AM2/18/15

to Greg Zoller, scala-user

2015-02-18 6:44 GMT+01:00 Greg Zoller <gzo...@gmail.com>:

I think I'd have to rebuild Scala itself with the miniboxed plugin to see how much improvement I'd get. I do believe this is probably where the last bit of "lost" performance is (vs Java).

Yes, it would be really nice to have miniboxing in Scala and all the collections optimized. And it's getting there, slowly.

Primitives are currently a "hole" in Scala performance and while I understand why we don't want them in the language, the minibox adaptation is a great optimization.

Have you tried the scala-blitz optimizer? It's capable of optimizing and fusing collection operations out of the box, so you also get rid of the boxing (at least in the optimized scopes).

HTH,

Vlad

Vlad Ureche

unread,

Feb 18, 2015, 4:02:12 AM2/18/15

to Sonnenschein, scala-user, Greg Zoller

2015-02-18 9:14 GMT+01:00 Sonnenschein <peter...@arcor.de>:

But collections must box alike. So what are the concrete collections you are using on Java respectively Scala side?
Btw, you can use special collections or just primitive arrays to deal with primitive types both in Java and Scala.

When using generic Array[T], both element access and update cause boxing:

$ cat array.scala 
import scala.reflect.ClassTag

class Test[T: ClassTag](t: T) {
  val array = new Array[T](3)
  array(0)
  array(0) = t
}
$ scalac array.scala -Xprint:erasure
[[syntax trees at end of                   erasure]] // array.scala
package <empty> {
  class Test extends Object {
    ...
    runtime.this.ScalaRunTime.array_apply(Test.this.array(), 0);
    runtime.this.ScalaRunTime.array_update(Test.this.array(), 0, Test.this.t)
  }
}

And you can see array_apply and array_update in the Scala repo: they return and receive Any, which is erased to Object.
Not going into why this is done, but it’s the only correct way to do it.

Vlad

Sonnenschein

unread,

Feb 18, 2015, 5:52:06 AM2/18/15

to scala...@googlegroups.com, peter...@arcor.de, gzo...@gmail.com

So are you telling that even having an array of primitives, for instance of jvm ints with the class name of [I, accessing its elements causes boxing?

val a = Array(1)
a(0) // ???

Dennis Haupt

unread,

Feb 18, 2015, 6:51:57 AM2/18/15

to Sonnenschein, scala-user, gzo...@gmail.com

i think/hope he means that if you generically create an array (Array[T]), every element has to be boxed, but if you create a primitive array directly (Array[Int]), it does not have to happen.

--

Vlad Ureche

unread,

Feb 18, 2015, 9:20:16 AM2/18/15

to Dennis Haupt, Sonnenschein, scala-user, Greg Zoller

2015-02-18 12:51 GMT+01:00 Dennis Haupt <d.ha...@gmail.com>:

i think/hope he means that if you generically create an array (Array[T]), every element has to be boxed, but if you create a primitive array directly (Array[Int]), it does not have to happen.

Yes, but it’s a bit more subtle. When you create an Array[T], you need a ClassTag[T], which allows the compiler to create the correct unboxed array type (int[], long[], …). So values are stored unboxed in the array. Yet, when you access this array as a generic Array[T], the method doing the access (array_apply/array_update in ScalaRuntime) will return/accept a boxed value. For example:

def firstElement[T](arr: Array[T]): T = arr(0) // bytecode: ScalaRuntime.array_apply(arr, 0), which returns an Object
val array = new Array[Int](100)                // bytecode: classTag[Int].newArray(...), which creates an int[]
val x: Int = array(0)                          // bytecode: access using iaload bytecode instruction, returning an int
val y: Int = firstElement(array)               // bytecode: calling firstElement, which returns an object

HTH,
Vlad

Vlad Ureche

unread,

Feb 18, 2015, 9:27:30 AM2/18/15

to Dennis Haupt, Sonnenschein, scala-user, Greg Zoller

To summarize:

creating a generic Array[T] requires a ClassTag[T] in order to instantiate the unboxed variant of the array (e.g. int[], long[])
reading/updating an array of the monomorphic type (e.g. Array[Int]) is done without boxing via bytecode instructions (e.g. iaload, laload etc)
reading/updating an array of the polymorphic type (e.g. Array[T]) is done by calling ScalaRuntime, which pays for the dispatch and for boxing the result (or unboxing the argument)

These are the three golden rules of arrays. Then, collections always take the 3rd path, since they’re not specialized.
I hope this clarifies what I meant…

Vlad

Greg Zoller

unread,

Feb 18, 2015, 10:25:01 AM2/18/15

to scala...@googlegroups.com, d.ha...@gmail.com, peter...@arcor.de, gzo...@gmail.com

Ok, guys... the problem may be a little deeper than just collections. I've uploaded a short sample here: https://github.com/gzoller/whiplash

You can see how I'm using arrays in the Whiplash class.

These are 2 trivial JSON tokenizers, one written in Scala by me and one in Java by Jenkov. Just to make the issue obvious, I've commented out all the "guts" of both algorithms, so all they're doing is just looping over a 5mb test JSON file (also included in the repo). The Scala version takes ~15ms to run. The Java version needs just 2-3ms. Wow. You can do 'sbt run' to see yourself (takes several seconds for the data file to load).

Un-commenting Jenkov's code so it fully works moves the run for the Java to about 36ms. So the fully-operational Java tokenizer takes only twice what the Scala version needs to just loop with no logic at all!

BTW, the reason I show this approach (commenting out) is to factor out any inefficiencies of my algorithm. FWIW uncommenting my code requires ~100ms to run. Not bad, really, but still about 3x slower than the Java code, and as you can see my code is pretty primitive.

Rex Kerr

unread,

Feb 18, 2015, 3:02:44 PM2/18/15

to Greg Zoller, scala-user, Dennis Haupt, Peter Empen

You didn't perform an exact translation of the Java code into Scala. What happens if you perform an exact translation of your Scala code into Java?

For example, let's look at the difference in scanString. In your Scala code you create a function which is one extra level of indirection; then, in the loop you check the end variable once every time; and you parse every single \\ unlike the Java code which only checks the last one (and AFAICT has a bug on strings like "blah\\"). You would _think_ your version would be at least as fast, and it's more obviously correct, but having really simple code in loops is a huge help, so it wouldn't surprise me if this makes a measurable difference.

It's entirely possible (easy, even) to write Java code like you wrote the Scala code. But little choices like these matter in tight loops. Even in your reduced code I can't follow the logic easily enough to come up with an obvious suspect for the bulk of the difference, but I do notice differences that do matter (like the above).

--Rex

--

Rex Kerr

unread,

Feb 18, 2015, 3:19:17 PM2/18/15

to Greg Zoller, scala-user, Dennis Haupt, Peter Empen

I should say--the Java code does _notice_ when there are '\'s but it's the simplest possible logic. Your version actually handles the looping differently, which means the loop variable i doesn't just go up one by one, which is something that the JIT compiler looks for to optimize array dereferences without needing to check bounds very often.

--Rex

Vlad Ureche

unread,

Feb 19, 2015, 3:59:22 PM2/19/15

to Sonnenschein, scala-user, Greg Zoller

2015-02-18 11:52 GMT+01:00 Sonnenschein <peter...@arcor.de>:

No, an array of primitive types (e.g.Array[Int]) is always represented as Java’s unboxed array (e.g.int[]) and reads/updates are on par with Java operations.
However, if you do:

def firstElement[T](arr: Array[T]): T = arr(0)

the body of firstElement will go through the generic dispatch in ScalaRuntime, since unboxed arrays (e.g.int[],long[], etc) have no common ancestor other than Object, which does not expose methods for reading/updating.

By the way, all operations mixing collections and arrays are generic, so they go through ScalaRuntime. :(

Vlad

Greg Zoller

unread,

Feb 19, 2015, 4:00:50 PM2/19/15

to scala...@googlegroups.com, gzo...@gmail.com, d.ha...@gmail.com, peter...@arcor.de

Many thanks to everyone who posted on this thread!! Great tips.

Here's where I finally landed... after profiling and taking your input into consideration, I moved my original code, which started at >200ms to 40ms parsing the same data set. I'm very happy with that. It's 1/3 faster than Jackson and while still about 10ms slower than Jenkov's algorithm I think my approach is more comprehensive in how it parses the JSON. (Also since the data set was a 5mb JSON file--which is a pretty big file--I'm not going to quibble over that last 10ms. Much bigger JSON than that and you probably want a streaming parser anyway.)

This code will either be next-gen implementation for my ScalaJack library.

Simon Ochsenreither

unread,

Feb 19, 2015, 7:10:52 PM2/19/15

to scala...@googlegroups.com, peter...@arcor.de, gzo...@gmail.com

Btw, I benchmarked "generic" array access and array length in Scala vs. Java, and the results seem to be pretty weird:

[info] Benchmark                      Mode Cnt Score   Error Units
[info] ArrayOpBench.javaArrayApply    avgt   10 0,071 ± 0,004 us/op
[info] ArrayOpBench.javaArrayLength   avgt   10 0,006 ± 0,000 us/op
[info] ArrayOpBench.scalaArrayApply   avgt   10 0,008 ± 0,001 us/op
[info] ArrayOpBench.scalaArrayLength avgt   10 0,006 ± 0,000 us/op

So accessing an array element is 10 times slower in Java despite Java and Scala having the same implementation, while accessing the length of an array is pretty much identical, despite the Java version being an intrinsic in the JVM.
(Maybe I did only mess up the benchmark...)

Details: https://gist.github.com/soc/51a8f0fd01d676b4799c

Dennis Haupt

unread,

Feb 19, 2015, 7:17:45 PM2/19/15

to Simon Ochsenreither, scala-user, Peter Empen, Greg Zoller

magic jvm optimization?

magic scala compiler optimization?

--

Rex Kerr

unread,

Feb 19, 2015, 8:43:59 PM2/19/15

to Simon Ochsenreither, scala-user, Peter Empen, Greg Zoller

It's atrociously slow in my tests also. I'm not sure why. The C++ source doesn't contain any clues, but I haven't found source that precisely matches up with what I expect to see. Maybe something to do with clobbering registers? If you have to push and pop the full set of registers on the stack for each access (like you do with JNI) then you could possibly end up with the 70 ns/call that you see. It shouldn't be that bad on average; 5 branches cached in L2 should be ~20 ns if branch prediction is wrong and less if prediction is right.

Also, with non-native code the JVM might unroll the loop and notice that there's a pattern. What happens if you permute the array?

--Rex

--

Jason Zaugg

unread,

Feb 19, 2015, 9:33:53 PM2/19/15

to Rex Kerr, Simon Ochsenreither, scala-user, Peter Empen, Greg Zoller

On Fri, Feb 20, 2015 at 11:43 AM, Rex Kerr <ich...@gmail.com> wrote:

It's atrociously slow in my tests also. I'm not sure why. The C++ source doesn't contain any clues, but I haven't found source that precisely matches up with what I expect to see. Maybe something to do with clobbering registers? If you have to push and pop the full set of registers on the stack for each access (like you do with JNI) then you could possibly end up with the 70 ns/call that you see. It shouldn't be that bad on average; 5 branches cached in L2 should be ~20 ns if branch prediction is wrong and less if prediction is right.

Also, with non-native code the JVM might unroll the loop and notice that there's a pattern. What happens if you permute the array?

Looks like we noticed this around Scala 2.8. Ismael Juma reported it to the OpenJDK mailing list, but I don't find any followup discussion.

One difference I notice is that the intrinsic Reflection::array_get in OpenJDK performs the bounds checking before switching on the element type, whereas our version switches first on the element type (bounds checking will happen under the hoods when we access the element.)

-jason

Reply all

Reply to author

Forward