Scala array for Spark GraphX use: "Method <init>'s code too large!"

478 views
Skip to first unread message

bertlhf

unread,
Apr 8, 2015, 10:45:42 AM4/8/15
to scala...@googlegroups.com
Hi,

I'm using Scala as a tool for developing and using a rail network in Apache Spark GraphX. GraphX uses Scale 10.4.

I initialize Scala vertex and edge arrays representing a rail network in preparation for use in Apache Spark GraphX's `sc.parallelize` method. When adding a new station, passing from 489 to 490 vertices, I got an "exceeds JVM code size limits" message.

The REPL transcript:

scala> :load ../../scala/alt-graphx-1.scala
Loading ../../scala/alt-graphx-1.scala...
altVertexArray: Array[(Long, (String, List[String], Map[String,String]))] = Array((1,(wien-nordbf,List(Wien Nordbf),Map(wien-krak-s103 -> ab 21:35, wien-ber-w1958 -> ab 19:58, prat-hut-okb3 -> ||, ber-wien-w1911 -> an 08:04, lund-wien-okb29 -> km 83,1, ber-wien-w2327 -> an 15:00, ber-wien-w1616 -> an 06:00, ber-wien-w0813 -> an 21:32, wien-ber-w0728 -> ab 07:28, wien-lund-okb29 -> km 0,0, wien-ber-w2225 -> ab 22:25, hut-prat-okb3 -> ||, wien-ber-w1600 -> ab 16:00))), (2,(floridsdorf,List(Floridsdorf),Map(lund-wien-okb29 -> km 78,0, wien-krak-s103 -> ab 21:43, wien-lund-okb29 -> km 5,1))), (3,(leopoldau,List(Leopoldau),Map(lund-wien-okb29 -> km 76,1, wien-lund-okb29 -> km 7,0))), (4,(sussenbrunn,List(Süßenbrunn),Map(lund-wien-okb29 -> km 71,3, wien-lund-okb29 -> km 11,8))), (5,(deutsch-w...

res7: Int = 489 // load OK, I have now 489 vertices

scala> :load ../../scala/alt-graphx-1.scala
Loading ../../scala/alt-graphx-1.scala...
<console>:18: error: Could not write class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC because it exceeds JVM code size limits. Method <init>'s code too large!
class $iwC extends Serializable {
      ^
res8: Int = 489 // load failed, I still have the "old" 489 vertices instead of the hoped for 490 vertices


What can be done about this?

Oliver Ruebenacker

unread,
Apr 8, 2015, 12:16:14 PM4/8/15
to bertlhf, scala-user

     Hello,

  The JVM does not allow a method to have more than 64k of bytecode, so you can not stuff this much data into a method. Scala initialization is one method.

  You could break it up into multiple methods. But much better: put it in a data file or database.

     Best, Oliver

--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Oliver Ruebenacker
Solutions Architect at Altisource Labs
Be always grateful, but never satisfied.

bertlhf

unread,
Apr 8, 2015, 1:02:23 PM4/8/15
to scala...@googlegroups.com
Hi Oliver,

Thanks for the quick answer.

I have no idea how to use a data file to get a (large) Array of type (Long, (String, List[String], Map[String,String])) which I need to work with my GraphX application. Can you give an example?

Or the other possibility you mention: how can I use multiple methods to get one (large) Array of the above type.

As you may have suspected, I’m not really a Scala programmer but just an end user of the Scala REPL interface to GraphX.

— Bert

Oliver Ruebenacker

unread,
Apr 8, 2015, 1:48:35 PM4/8/15
to bertlhf, scala-user

     Hello,

  Unfortunately, I don't have the time to write an example, but I can outline some ideas.

  You could store the data in a file as XML, JSON or YAML, and use something like Jackson to read it. Or, you could store the data in MongoDB. Or, you can use a relational database like MySQL, and access it by either sending SQL queries over JDBC, or use a object-relational mapper like Hibernate or Slick.

  To exploit the graph structure of your data, you might want to look at Neo4JTitan or BlazeGraph.

  The data can also be expressed in RDF, in which case you can use data files in Turtle and you can use Sesame to read such files, or as a database in its own right, or to access Neo4JTitan or BlazeGraph.

  In the unlikely case that above databases don't support the required throughput even when running on a cluster, you may want to look at high-throughput databases such as Cassandra, HBase, Riak or Voldemort.

     Best, Oliver
  


--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Naftoli Gugenheim

unread,
Apr 8, 2015, 3:25:45 PM4/8/15
to bertlhf, scala...@googlegroups.com

You can ++ two arrays together


bertlhf

unread,
Apr 8, 2015, 4:00:22 PM4/8/15
to scala...@googlegroups.com

Thanks Naftoli for the ++ tip. It works. I got my (by now enlarged) rail network running again.
I think that in the final network configuration (about 8000 vertices and about double that number edges) I will be obliged to chain a lot of arrays with ++

-- Bert


On Wednesday, April 8, 2015 at 4:45:42 PM UTC+2, bertlhf wrote:

Oliver Ruebenacker

unread,
Apr 8, 2015, 5:08:31 PM4/8/15
to bertlhf, scala-user

     Hello,

  Placing such an amount of data as literals into the code is very unusual and has a number of drawbacks.

  Tools used to develop code are not designed to handle such large files and will slow down, or show unexpected limitation or glitches (such as hitting the maximum size of a method, which most developers are probably not aware of).

  I don't know how you are getting the data into the code - if you write it by hand, you definitely will spend more time doing this than it would take to learn using a database system, and you will make many errors.

  Finally, if you have a large amount of data, most likely you will want to verify, modify and query that data, and use it for different purposes - all things that are easy in a database system and hard if the data is hard-coded.

     Best, Oliver
  

--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bertlhf

unread,
Apr 9, 2015, 9:45:28 AM4/9/15
to scala...@googlegroups.com, be...@analytag.com


On Wednesday, April 8, 2015 at 11:08:31 PM UTC+2, Oliver Ruebenacker wrote:

  

  I don't know how you are getting the data into the code

Hi Oliver,

The Scala arrays are the data, they are not (part of) the code. These arrays are fed into GraphX which generates a graph from these arrays. This GraphX graph can in turn be accessed, queried, processed etc. by Scala methods in the REPL. 

For example, I have a small Scala program that prints a railway line (with kilometre indications), or that prints a train route (with arrival/departure time info). Another small program, given a station (vertexId), prints all arriving and departing trains and in- and outgoing railway lines of that station.

At https://spark.apache.org/docs/latest/graphx-programming-guide.html, in the paragraph "Example Property Graph", you find a small graph that, I think, makes the role of the Scala arrays as input for GraphX clear.

I am completely happy editing my data (i.e. the Scala vertex and edge arrays) in my text editor of choice. I only had the problem of the "Method <init>'s code too large!". Since that is solved by chaining smaller arrays with ++ in one big one I can continue expanding my rail network.

-- Bert

Naftoli Gugenheim

unread,
Apr 9, 2015, 12:57:34 PM4/9/15
to bertlhf, scala...@googlegroups.com

Are you modeling a real railway? From where do you get the information that you put in the code?
If it's just for fun then do whatever you're comfortable with, but ultimately it's well worth it to learn how to use a database.

bertlhf

unread,
Apr 9, 2015, 2:10:14 PM4/9/15
to scala...@googlegroups.com, be...@analytag.com


On Thursday, April 9, 2015 at 6:57:34 PM UTC+2, nafg wrote:

Hi Naftoli,

Are you modeling a real railway?

I am modelling an historic railway network: the railway network as it was in Europe in 1914 before World War 1. In this time you had on the continent, apart from a much more extensive railway network than today, 3 empires and 3 emperors: the Russian Czar, the German Kaiser and Franz-Joseph, the Austro-Hungarian emperor. By the way, there was one point where the 3 empires touched each other: the famous "Dreikaiserreichecke"
 

From where do you get the information that you put in the code?

I have several historic time tables, but the most important (and complete) ones are the Reichskursbuch 1914 and the Österreichisches Kursbuch 1914

 

If it's just for fun then do whatever you're comfortable with,

It is indeed just a hobby, and indeed I'm comfortable editing arrays in plain text files. The type of the vertex and edge array represents the complete logic of my network and it gives me all the structure I need.
 

but ultimately it's well worth it to learn how to use a database.


How could I use a database given that the input GraphX needs to construct a graph is a vertex array and an edge array? Converting the database to an array in some way? And what does it bring me?
I'm not familiar with databases and I'm a little bit reluctant to learn something additional which I don't strictly need for my purpose.

-- Bert

Naftoli Gugenheim

unread,
Apr 9, 2015, 2:18:18 PM4/9/15
to bertlhf, scala...@googlegroups.com


On Thu, Apr 9, 2015, 2:10 PM bertlhf <be...@analytag.com> wrote:

On Thursday, April 9, 2015 at 6:57:34 PM UTC+2, nafg wrote:

Hi Naftoli,

Are you modeling a real railway?


I am modelling an historic railway network: the railway network as it was in Europe in 1914 before World War 1. In this time you had on the continent, apart from a much more extensive railway network than today, 3 empires and 3 emperors: the Russian Czar, the German Kaiser and Franz-Joseph, the Austro-Hungarian emperor. By the way, there was one point where the 3 empires touched each other: the famous "Dreikaiserreichecke"

 

From where do you get the information that you put in the code?


I have several historic time tables, but the most important (and complete) ones are the Reichskursbuch 1914 and the Österreichisches Kursbuch 1914

 

If it's just for fun then do whatever you're comfortable with,


It is indeed just a hobby, and indeed I'm comfortable editing arrays in plain text files. The type of the vertex and edge array represents the complete logic of my network and it gives me all the structure I need.

 

but ultimately it's well worth it to learn how to use a database.

How could I use a database given that the input GraphX needs to construct a graph is a vertex array and an edge array? Converting the database to an array in some way?

Correct

And what does it bring me?

See Oliver's last paragraph :)

I'm not familiar with databases and I'm a little bit reluctant to learn something additional which I don't strictly need for my purpose.


OK, but I predict you will sooner or later :)

Oliver Ruebenacker

unread,
Apr 10, 2015, 8:28:52 AM4/10/15
to bertlhf, scala-user

     Hello,

  Just to clarify, when I say "Scala code", I mean anything that is written in Scala syntax (and therefore is usually handled with a text editor and needs to be compiled and run on a JVM). That includes the arrays you hard-code as Scala arrays which contain your data.

  Scala is designed to express algorithms. It can be used to hard-code large amounts of data, but other solutions are more efficient in handling large amounts of data.

  Apache Spark is designed to perform massively parallel computations. It can be used to perform simple queries, but other solutions are more efficient at that.

  I don't have the time to write examples, but the links I provided lead to lots of examples.

     Best, Oliver

Patrick Roemer

unread,
Apr 10, 2015, 9:53:52 AM4/10/15
to scala...@googlegroups.com
Responding to bertlhf:
> At https://spark.apache.org/docs/latest/graphx-programming-guide.html, in
> the paragraph "Example Property Graph", you find a small graph that, I
> think, makes the role of the Scala arrays as input for GraphX clear.

You can obtain an RDD from other sources: From a text file (local or
HDFS) through SparkContext#textFile(), from JDBC, JSON, etc. via
SparkSQL, etc. It looks like GraphX provides support for custom graph
file formats as well though GraphLoader.

Spark is all about "big data". Getting source input from handcrafted
arrays is rather rare in real-world applications. I'd be surprised if
this were different for GraphX.

Best regards,
Patrick


Reply all
Reply to author
Forward
0 new messages