java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: requirement failed: inv

335 views
Skip to first unread message

alepb...@gmail.com

unread,
Jan 24, 2017, 6:31:48 PM1/24/17
to BigDL User Group
Hi everyone, 

Im struggling with a simple app. Hopefully someone can help me here. 

What Im trying to do is to vectorize some short sentences (3-4) words through a word2vec model and feed those into vectors in a NN to performa a classification task.

My input data consists of an array of floats representing the w2v vectorization of the sentence of size 100 + the label for the categorization. (Array[Double], Double)

Ideally I'd like to use those 100 sized vectors to perform the classification through softmax()

The error as I stated in the title is the following(full):

17/01/24 17:09:47 ERROR Executor: Exception in task 0.0 in stage 22.0 (TID 17)

java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: requirement failed: invalid size

at java.util.concurrent.FutureTask.report(FutureTask.java:122)

at java.util.concurrent.FutureTask.get(FutureTask.java:192)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$7.apply(DistriOptimizer.scala:176)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$7.apply(DistriOptimizer.scala:176)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)

at scala.collection.AbstractTraversable.map(Traversable.scala:104)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:176)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:125)

at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

at org.apache.spark.scheduler.Task.run(Task.scala:85)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)



This is my code so far:

import com.intel.analytics.bigdl.utils.{Engine, T}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.log4j.PropertyConfigurator
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.StringIndexer
import com.intel.analytics.bigdl.nn.{ClassNLLCriterion, _}
import com.intel.analytics.bigdl.example.textclassification._
import com.intel.analytics.bigdl.optim._
import org.apache.spark.rdd._
import com.intel.analytics.bigdl.dataset._
import com.intel.analytics.bigdl.dataset.{MiniBatch, Transformer}
import org.apache.spark.ml.linalg.DenseVector
import com.intel.analytics.bigdl.numeric.NumericDouble

PropertyConfigurator.configure("/PATH/log4j.properties")

object ToSample {
  def apply(nRows: Int, nCols: Int)
  : ToSample =
    new ToSample(nRows, nCols)
}

class ToSample(nRows: Int, nCols: Int)
  extends Transformer[ (Array[Double], Double) , Sample[Double]] {

  private val buffer = new Sample[Double]()
  private var featureBuffer: Array[Double] = null
  private var labelBuffer: Array[Double] = null

  override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[Sample[Double]] = {

    prev.map(x => {

      if (featureBuffer == null || featureBuffer.length < nRows * nCols) {
        featureBuffer = new Array[Double](nRows * nCols)
      }
      if (labelBuffer == null || labelBuffer.length < nRows) {
        labelBuffer = new Array[Double](nRows)
      }

      var i = 0
      while (i < nRows) {
        Array.copy(x._1, 0, featureBuffer, i * nCols, nCols)
        labelBuffer(i) = x._2
        i += 1
      }

      buffer.copy(featureBuffer, labelBuffer,
        Array(nRows, nCols), Array(nRows))
    })
  }
}

def wrapper(s:String): Seq[String] ={
  var t = s.split(" ").toSeq
  return t
}

def extractor(s:String): String ={
  var t = s.split("~")(1)
  return t
}

val coder: (String => String) = (arg: String) => {extractor(arg)}
val coder2: (String => Seq[String]) = (arg: String) => {wrapper(arg)}
val sqlfunc  = udf(coder)
val sqlfunc2 = udf(coder2)

val nodeNum = 1
val coreNum = 1
val sc = new SparkContext(Engine.init(nodeNum, coreNum, true).get.setMaster("local[*]").setAppName("W2V").set("spark.task.maxFailures", "1"))
val sqlContext = new SQLContext(sc)

val path = "PATH_TSV.tsv"
val data = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").load(path)

val data_vectorized = data.withColumn("phraseArray", sqlfunc2(col("phrase"))).withColumn("level1", sqlfunc(col("cat")))

val indexer = new StringIndexer().setInputCol("level1").setOutputCol("level1_labels")
val labeled = indexer.fit(data_vectorized).transform(data_vectorized)

val embeddingDim = 300
val word2Vec = new Word2Vec().setSeed(32).setInputCol("phraseArray").setOutputCol("vectors").setVectorSize(embeddingDim).setMinCount(0).setWindowSize(5)
val model_w2v = word2Vec.fit(labeled)
val result = model_w2v.transform(labeled)
val trainingSplit: Double = 0.8
val numClass = 34


val vectorizedRdd :RDD[(Array[Double], Double)] = result.select("level1_labels","vectors").rdd.map(r => (r(1).asInstanceOf[DenseVector].toArray,r(0).asInstanceOf[Double]))
val Array(trainingRDD, valRDD) = vectorizedRdd.randomSplit(Array(trainingSplit, 1 - trainingSplit))

val batchSize = 10
val trainSet = DataSet.rdd(trainingRDD) -> ToSample(1, 2) -> SampleToBatch(batchSize)
val valSet = DataSet.rdd(valRDD) -> ToSample(1, 2)  -> SampleToBatch(batchSize)

val classNum = 32
val model_N = Sequential()
.add(Reshape(Array(2)))
.add(Linear(2, classNum))
.add(LogSoftMax())

val state = T("learningRate" -> 0.01, "learningRateDecay" -> 0.0002)

val optimizer = Optimizer(
model = model_N,
dataset = trainSet,
criterion = new ClassNLLCriterion[Double]()
)

optimizer.
setState(state).
setValidation(Trigger.everyEpoch, valSet, Array(new Loss[Double])).
setOptimMethod(new Adagrad[Double]()).
optimize()



Jason Dai

unread,
Jan 24, 2017, 9:23:31 PM1/24/17
to alepb...@gmail.com, BigDL User Group
We'll take a look into this.

Thanks,
-Jason

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/9bbb3a9a-5008-43b5-be95-7a04b17cf8b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zhang, Cherry

unread,
Jan 24, 2017, 10:22:20 PM1/24/17
to alepb...@gmail.com, BigDL User Group

Hi, I am sorry , can you give me your input data file I can’t recur the problem.

 

Thanks,

Cherry

--

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.

 

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/CAHvkTeHGedBEYQYBwm8y%3Djeiiw1NHr_EDGTThXJKk4KkPP0E5Q%40mail.gmail.com.

alepb...@gmail.com

unread,
Jan 24, 2017, 11:21:07 PM1/24/17
to BigDL User Group, alepb...@gmail.com
A sample of the data would be:

String = "the lazy fox jumped over the brown dog" Label = "Comedy"

which tokenized would be a 300 sized vector(generated from word2vec) + a Double representing its label:

[0.12,0.45, ... ], 2.0

Does this make sense?

To post to this group, send email to bigdl-u...@googlegroups.com.

Jason Dai

unread,
Jan 25, 2017, 9:30:02 AM1/25/17
to Alessandro Panebianco, BigDL User Group

Hi,

 

Two changes are needed to fix the issue:

  1. The target (label) used in ClassNLLCriterion is 1 based, while the StringIndexer in Spark is 0 based. You may need to change your labels as follows:

    val vectorizedRdd :RDD[(Array[Double], Double)] = result.select("level1_labels","vectors").rdd.map(r => (r(1).asInstanceOf[DenseVector].toArray,r(0).asInstanceOf[Double] + 1.0))


  2. The output of SampleToBatch was not compatible with ClassNLLCriterion (previously it was only used with CrossEntropyCriterion). We just fixed that in https://github.com/intel-analytics/BigDL/pull/411; you may need to pull the latest BigDL jar or code for the fix. 

Thanks,

-Jason


To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/60d470eb-4ff8-4589-a4ad-e012bb0464e5%40googlegroups.com.

alepb...@gmail.com

unread,
Jan 25, 2017, 6:32:52 PM1/25/17
to BigDL User Group, alepb...@gmail.com
Thanks Jason, your fix did solve my problems.

What Im trying to do now it's slightly modifying my input data.

So far I have used word2vec to generate ONE single vector of fixed length for a seuqence of words. To go back to my previous example:

if I have this input:
String = "the lazy fox jumped over the brown dog" Label = "Comedy" 
My output would be of the format (Array[Double], Double):
[0.12,0.45, ... ], 2.0

Now Id like to produce a vector for each word of my sentence. Following the above example Id have
 
the  =  [0.12,0.45, ... ]
lazy =  [0.42,0.35, ... ]
...
dog =  [0.62,0.75, ... ],
and still 2.0 representing its label.

So from (Array[Double], Double) Id turn it into a (Array(Array[Double]), Double).

What is not clear to me is how how would process the iterators to batch my input. Would you please help me with this?

Im not sure I could use the one given for the TextClassifier example:
val batching = Batching(param.batchSize, Array(param.maxSequenceLength, param.embeddingDim))
val trainingDataSet = DataSet.rdd(trainingRDD) -> batching

because maxSequenceLength in my case would be the maximum length of my sentence (which of course is going to be fixed anyway, let's say 5 as my input are short)

Any input would be highly valuable.Thanks,

Alessandro







Li, Zhichao

unread,
Jan 25, 2017, 8:35:39 PM1/25/17
to alepb...@gmail.com, BigDL User Group

I think you can try to use the Batching class for now. The second parameters is the shape of your input i.e Array(5, vector_dim).

 

And of course,  convert the input to Sample(take a look at Sample.copy(…)), and then apply the SampleToBatch is also a valid solution.

 

We are in the process of simplify the input API, the batching and sampling logic would be transparent to the user shortly.

 

 

From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of alepb...@gmail.com
Sent: Thursday, January 26, 2017 7:33 AM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Cc: alepb...@gmail.com
Subject: Re: [bigdl-user-group] java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: requirement failed: inv

 

Thanks Jason, your fix did solve my problems.

 

What Im trying to do now it's slightly modifying my input data.

 

So far I have used word2vec to generate ONE single vector of fixed length for a seuqence of words. To go back to my previous example:

 

if I have this input:

String = "the lazy fox jumped over the brown dog" Label = "Comedy" 

My output would be of the format (Array[Double], Double):

[0.12,0.45, ... ], 2.0

 

Now Id like to produce a vector for each word of my sentence. Following the above example Id have

 

the  =  [0.12,0.45, ... ]

lazy =  [0.42,0.35, ... ]

...

dog =  [0.62,0.75, ... ],

and still 2.0 representing its label.

 

So from (Array[Double], Double) Id turn it into a (Array(Array[Double]), Double).

 

What is not clear to me is how how would process the iterators to batch my input. Would you please help me with this?

 

Im not sure I could use the one given for the TextClassifier example:

val batching = Batching(param.batchSize, Array(param.maxSequenceLength, param.embeddingDim))
val trainingDataSet = DataSet.rdd(trainingRDD) -> batching

 

because maxSequenceLength in my case would be the maximum length of my sentence (which of course is going to be fixed anyway, let's say 5 as my input are short)

 

Any input would be highly valuable.Thanks,

 

Alessandro

 

 

 

 

 



On Wednesday, January 25, 2017 at 8:30:02 AM UTC-6, Jason Dai wrote:

Hi,

 

Two changes are needed to fix the issue:

1.           The target (label) used in ClassNLLCriterion is 1 based, while the StringIndexer in Spark is 0 based. You may need to change your labels as follows:

Reply all
Reply to author
Forward
0 new messages