trying simple NN, but getting error: requirement failed: input must be vector or matrix

vbrz...@ebay.com

unread,

Jan 13, 2017, 11:26:29 PM1/13/17

to BigDL User Group

I have the following simple NN training code (just getting my feet wet), modeled after the example / tutorials here https://github.com/intel-analytics/BigDL/wiki/Getting-Started.

All is well, until I hit the actual “optimize” – see below – when this error occurs:

17/01/13 17:56:15 ERROR ThreadPool$: Error: java.lang.IllegalArgumentException: requirement failed: input must be vector or matrix

at scala.Predef$.require(Predef.scala:233)

at com.intel.analytics.bigdl.nn.Linear.updateOutput(Linear.scala:66)

at com.intel.analytics.bigdl.nn.Linear.updateOutput(Linear.scala:29)

at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)

at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:33)

at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)

at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:116)

at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)

at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invokeAndWait$1$$anonfun$apply$2.apply(Engine.scala:103)

at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)

at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Here is my code:

*************************

val sc = new SparkContext(
Engine.init(1, 1, true).get
    .setAppName("Sample_NN")
    .set("spark.akka.frameSize", 64.toString)
    .set("spark.task.maxFailures", "1")
    )

// make up some data
val data = (0 to 100).collect {
case i if i > 75 || i < 25 ⇒
    (0 to 100).collect {
      case j if j > 75 || j < 25 ⇒
        val res =
          if (i > 75 && j < 25) 23.0
          else if (i < 25 && j > 75) -45
          else 0
        (Array(i / 100.0 + 1, j / 100.0 + 2), res)
    }
}.flatMap(x ⇒ x)

val batchSize = 4

val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet

val layer1 = Linear[Double](2,4)
val layer2 = ReLU[Double]()
val output = Sum[Double]()

val model = Sequential[Double]().
add(layer1).
add(layer2).
add(output)

val state =
T(
    "learningRate" -> 0.01,
    "weightDecay" -> 0.0005,
    "momentum" -> 0.9,
    "dampening" -> 0.0
)

val optimizer = Optimizer(
model = model,
dataset = trainSet,
criterion = new MSECriterion[Double]()
)

optimizer.
setState(state).
// setValidation(Trigger.everyEpoch, validationSet, Array(new Loss[Double])).
setOptimMethod(new Adagrad[Double]()).
optimize()

**************************

SampleToBatch is the one in BigDL here: https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Transformer.scala

as is Sample in here: https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Types.scala

ToSample is my own, defined as follows:

******************

object ToSample {
def apply(nRows: Int, nCols: Int)
: ToSample =
    new ToSample(nRows, nCols)
}

class ToSample(nRows: Int, nCols: Int)
extends Transformer[ (Array[Double], Double) , Sample[Double]] {

private val buffer = new Sample[Double]()
private var featureBuffer: Array[Double] = null
private var labelBuffer: Array[Double] = null

override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[Sample[Double]] = {

    prev.map(x => {

      if (featureBuffer == null || featureBuffer.length < nRows * nCols) {
        featureBuffer = new Array[Double](nRows * nCols)
      }
      if (labelBuffer == null || labelBuffer.length < nRows) {
        labelBuffer = new Array[Double](nRows)
      }

      var i = 0
      while (i < nRows) {
        Array.copy(x._1, 0, featureBuffer, i * nCols, nCols)
        labelBuffer(i) = x._2
        i += 1
      }

      buffer.copy(featureBuffer, labelBuffer,
        Array(nRows, nCols), Array(nRows))
    })
}
}

********************

Inspecting the MiniBatch, all seems fine:

scala> val q = trainSet.toLocal.data(false)

q: Iterator[com.intel.analytics.bigdl.dataset.MiniBatch[Double]] = non-empty iterator

scala> val z = q.next()

z: com.intel.analytics.bigdl.dataset.MiniBatch[Double] =

MiniBatch((1,.,.) =

1.0 2.0

(2,.,.) =

1.0 2.01

(3,.,.) =

1.0 2.02

(4,.,.) =

1.0 2.03

[com.intel.analytics.bigdl.tensor.DenseTensor of size 4x1x2],0.0

0.0

[com.intel.analytics.bigdl.tensor.DenseTensor of size 4x1])

so I can’t understand why I am getting the above error.

Any help would be appreciated.

Thanks.

Vadim.

Yan Wan

unread,

Jan 14, 2017, 12:33:17 AM1/14/17

to BigDL User Group, vbrz...@ebay.com

Hi,

I think the problem is that the Linear layer requires a 1D or 2D input, which is a vector or a matrix, whereas the minibatch input is a 3D tensor.

The SampleToBatch layer forms the input format into a [4, 1, 2] size of Tensor. Actually, the valid input size for the Linear layer should be [4, 2].

One of the solution could be this:

In your model definition:

val model = Sequential[Double]().
add(layer1).
add(layer2).
add(output)

just add a ReShape Layer to reform your data input format.

=>

val model = Sequential[Double]()

.add(Reshape(Array(batchSize*1, 2))).
add(layer1).
add(layer2).
add(output)

Then the model will receive a correct input format.

Hope this will help you.

Yiheng Wang

unread,

Jan 14, 2017, 12:59:46 AM1/14/17

to Yan Wan, BigDL User Group, vbrz...@ebay.com

@Yan Wan. The model definition should not include batch size.

Hi Vadim

It should be

val model = Sequential[Double]()

.add(Reshape(Array(2))).
add(layer1).
add(layer2).
add(output)

The Reshape(Array(2)) will reshape a 4 * 1 * 2 tensor into 4 * 2 tensor.

Another solution is modify the toSample to

buffer.copy(featureBuffer, labelBuffer,
Array(nRows * nCols), Array(nRows))

If your model doesn't contain any spatial layers, you needn't generate a 2D tensor for a sample.

Regards,

Yiheng

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/4ede1529-0cd3-4bb2-9dd4-6859a38e5bff%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Yiheng Wang

SSG STO Big Data Technology

Intel Asia-Pacific Research & Development Ltd.

No. 880 Zi Xing Road

Shanghai, PRC, 200241

Phone: (86-21) 61166094

yih...@gmail.com
yih...@hotmail.com

von Brzeski, Vadim

unread,

Jan 16, 2017, 12:19:26 AM1/16/17

to Yiheng Wang, Yan Wan, BigDL User Group

Thanks Yiheng! That worked.

Vadim.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/4ede1529-0cd3-4bb2-9dd4-6859a38e5bff%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Yiheng Wang

SSG STO Big Data Technology

Intel Asia-Pacific Research & Development Ltd.

No. 880 Zi Xing Road

Shanghai, PRC, 200241

Phone: (86-21) 61166094

yih...@gmail.com
yih...@hotmail.com

--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/CAF3ct6B2DR-EzaF%3Dzq%2BfFM8GpjiKCVbdupYPrd0huaM8Eo3_Rw%40mail.gmail.com.

von Brzeski, Vadim

unread,

Jan 16, 2017, 9:59:46 PM1/16/17

to Yiheng Wang, Yan Wan, BigDL User Group

Spoke too soon :(

Seems like the solution works, but only if batchSize == number of outputs in Linear layer (??). i.e.:

Works OK:

val batchSize = 6
val sampleShape = Array(1,2)

val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet

val layer1 = Linear[Double](2,6)

val layer2 = ReLU[Double]()

val output = Sum[Double]()

val model = Sequential[Double]().

add(Reshape(Array(2))).

…etc.

Fails – see exception below:

val batchSize = 10
val sampleShape = Array(1,2)

val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet

val layer1 = Linear[Double](2,6)

val layer2 = ReLU[Double]()

val output = Sum[Double]()

val model = Sequential[Double]().

add(Reshape(Array(2))).

…etc.

17/01/16 19:51:22 ERROR ThreadPool$: Error: java.lang.IllegalArgumentException: requirement failed: inconsistent tensor size

at scala.Predef$.require(Predef.scala:233)

at com.intel.analytics.bigdl.tensor.DenseTensorApply$.apply2(DenseTensorApply.scala:63)

at com.intel.analytics.bigdl.tensor.DenseTensor.map(DenseTensor.scala:396)

at com.intel.analytics.bigdl.nn.MSECriterion.updateOutput(MSECriterion.scala:33)

at com.intel.analytics.bigdl.nn.MSECriterion.updateOutput(MSECriterion.scala:27)

at com.intel.analytics.bigdl.nn.abstractnn.AbstractCriterion.forward(AbstractCriterion.scala:43)

at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:117)

at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)

at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invokeAndWait$1$$anonfun$apply$2.apply(Engine.scala:103)

at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)

at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

V.

From: <bigdl-us...@googlegroups.com> on behalf of Yiheng Wang <yih...@gmail.com>

Date: Friday, January 13, 2017 at 9:59 PM
To: Yan Wan <yan...@intel.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>, "Brzeski, Vadim" <vbrz...@ebay.com>

Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

.add(Reshape(Array(batchSize*1, 2))).

Zhang, Yao

unread,

Jan 17, 2017, 2:21:15 AM1/17/17

to von Brzeski, Vadim, Yiheng Wang, Wan, Yan, BigDL User Group

Hi,

For the modules that support batch and meanwhile the input can be a multi-dimensional tensor(there is no restriction on the number of dimensions of input tensor), that kind of modules usually have an argument named `nInputDims`.

This argument means that “the number of dimensions of the given input”. Because the input tensor can have indefinite number of dimensions, so that modules can not automatically inference whether the user use the batch or not and what the actual dimensions of input excluding the batch dimension. Hence `nInputDims` represents the actual dimensions of given input excluding the batch. In your case, the `nInputDims` of Sum layer should be 1, because the input size is batchSize * 6 (one dimension excluding the batch dimension).

The sum layer should be `val output = Sum(nInputDims = 1)`.

So your code should be something like:

“””

import com.intel.analytics.bigdl.numeric.NumericDouble

val batchSize = 10

val trainSet = DataSet.array(data.toArray) -> ToSample(1, 2) -> SampleToBatch(batchSize)

val layer1 = Linear(2, 6)

val layer2 = ReLU()

val output = Sum(nInputDims = 1)

val model = Sequential()

.add(Reshape(Array(2)))

.add(layer1)

.add(layer2)

.add(output)

“””

For the reason that `batchSize = 6` runs without error, the sum layer will receive a 6 * 6 tensor, and sum along the first dimension, then a tensor of 1 dimension with size 6 obtained by the criterion. Because the size is the same as the number of labels (= batchSize), there won’t be any errors, but the logic is still not correct. If `batchSize = 10`, a tensor of 1 dimension with size 10 obtained by the criterion, which cannot match the batchSize (6), so the run error occurs.

Hope that would make some helps.

Best regards

Yao

Full worked code attached:

“””

object ToSample {

def apply(nRows: Int, nCols: Int)

: ToSample =

new ToSample(nRows, nCols)

}

class ToSample(nRows: Int, nCols: Int)

extends Transformer[(Array[Double], Double), Sample[Double]] {

private val buffer = new Sample[Double]()

private var featureBuffer: Array[Double] = null

private var labelBuffer: Array[Double] = null

override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[Sample[Double]] = {

prev.map(x => {

if (featureBuffer == null || featureBuffer.length < nRows * nCols) {

featureBuffer = new Array[Double](nRows * nCols)

}

if (labelBuffer == null || labelBuffer.length < nRows) {

labelBuffer = new Array[Double](nRows)

}

var i = 0

while (i < nRows) {

Array.copy(x._1, 0, featureBuffer, i * nCols, nCols)

labelBuffer(i) = x._2

i += 1

}

buffer.copy(featureBuffer, labelBuffer,

Array(nRows, nCols), Array(nRows))

})

}

// make up some data

val data = (0 to 100).collect {

case i if i > 75 || i < 25 ⇒

(0 to 100).collect {

case j if j > 75 || j < 25 ⇒

val res =

if (i > 75 && j < 25) 23.0

else if (i < 25 && j > 75) -45

else 0

(Array(i / 100.0 + 1, j / 100.0 + 2), res)

}

}.flatten

val sc = new SparkContext(

Engine.init(1, 1, true).get

.setAppName("Sample_NN")

.set("spark.akka.frameSize", 64.toString)

.set("spark.task.maxFailures", "1")

.setMaster("local[4]")

)

import com.intel.analytics.bigdl.numeric.NumericDouble

val batchSize = 10

val trainSet = DataSet.array(data.toArray) -> ToSample(1, 2) -> SampleToBatch(batchSize)

val layer1 = Linear(2, 6)

val layer2 = ReLU()

val output = Sum(nInputDims = 1)

val model = Sequential()

.add(Reshape(Array(2)))

.add(layer1)

.add(layer2)

.add(output)

val state =

T(

"learningRate" -> 0.01,

"weightDecay" -> 0.0005,

"momentum" -> 0.9,

"dampening" -> 0.0

)

val optimizer = Optimizer(

model = model,

dataset = trainSet,

criterion = new MSECriterion[Double]()

)

optimizer.

setState(state).

// setValidation(Trigger.everyEpoch, validationSet, Array(new Loss[Double])).

setOptimMethod(new Adagrad[Double]()).

optimize()

“””

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/1189B713-D9DC-4A0C-88EF-CCE4E0C329AD%40ebay.com.

Jason Dai

unread,

Jan 17, 2017, 7:12:16 AM1/17/17

to Zhang, Yao, von Brzeski, Vadim, Yiheng Wang, Wan, Yan, BigDL User Group

I think the point here is that one need to specify the number of dimensions for each input record (i.e., nInputDims) to Sum; in this case, each input record to Sum is a 1-dimensional tensor (Tensor[Double](6)), and we should specify Sum(nInputDims = 1) here. (See https://github.com/torch/nn/blob/master/doc/simple.md#sum for more details).

As mini-batch SGD is used in BigDL training, I think we can actually have the Optimizer to infer the batch size automatically, so that each Module can just refer to that during training; I have opened an issue (https://github.com/intel-analytics/BigDL/issues/382) for this.

Thanks,

-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/1189B713-D9DC-4A0C-88EF-CCE4E0C329AD%40ebay.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/2D9FB5B117EE62468567426DDD3FEA39315AF489%40SHSMSX101.ccr.corp.intel.com.

von Brzeski, Vadim

unread,

Jan 17, 2017, 10:58:06 PM1/17/17

to Jason Dai, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group

Hi –

The above suggestions worked (thanks Yao), but now I have a different problem.

When I train locally using the code (the one in the above thread and the working example you provided), all is fine. I get a reasonable model weights (model.getParameters returns non-zero values for weights and biases), and a good fit in training.

But when I try the same on RDD, I get all biases = 0 and a bad fit.

Again, the network:

val layer1 = Linear[Double](dimInput,nHidden)

val layer2 = ReLU[Double]()

val layer3 = Linear[Double](nHidden,nHidden)
val layer4 = ReLU[Double]()
val output = Linear[Double](nHidden,1) //Sum[Double](nInputDims = 1)

val model = Sequential[Double]().
add(Reshape(Array(dimInput))).
add(layer1).
add(layer2).
add(layer3).
add(layer4).
add(output)

Local:

val trainSet = DataSet.array(data.toArray).transform(ToSample(1,dimInput)).transform(SampleToBatch(batchSize))

leads to this:

scala> println(model.getParameters())

(0.6438069051434786

-0.5983624270225454

-1.222042899057738

1.2080124683211286

0.09588958746823727

…..

….

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67],-0.029515928487775192

-0.08776935671924736

0.05927649370933306

0.18495866162243968

-0.003225973822120612

-0.008541260257896508

….

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67])

and a good fit.

But with RDD:

val sampleShape = Array(1,dimInput)

val batching = OOMBatching(batchSize, sampleShape)
val trainSetRDD = sc.makeRDD(data).coalesce(numExecutors*numCores, true).coalesce(numExecutors)
val trainSet = DataSet.rdd(trainSetRDD) -> batching

where OOMBatching:

object OOMBatching {
def apply(batchSize: Int, sampleShape: Array[Int]): OOMBatching =
    new OOMBatching(batchSize, sampleShape)
}

/**
* Batching samples into mini-batch
* @param batchSize The desired mini-batch size.
* @param sampleShape Shape of the training sample
*/
class OOMBatching(batchSize: Int, sampleShape: Array[Int]) extends Transformer[(Array[Double], Double), MiniBatch[Double]] {
    override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[MiniBatch[Double]] = {
      new Iterator[MiniBatch[Double]] {
        private val featureTensor: Tensor[Double] = Tensor[Double]()
        private val labelTensor: Tensor[Double] = Tensor[Double]()
        private var featureData: Array[Double] = null
        private var labelData: Array[Double] = null
        private val featureLength = sampleShape.product
        private val labelLength = 1

        override def hasNext: Boolean = prev.hasNext

        override def next(): MiniBatch[Double] = {
          if (prev.hasNext) {
            var i = 0
            while (i < batchSize && prev.hasNext) {
              val sample = prev.next()
              if (featureData == null || featureData.length < batchSize * featureLength) {
                featureData = new Array[Double](batchSize * featureLength)
              }
              if (labelData == null || labelData.length < batchSize * labelLength) {
                labelData = new Array[Double](batchSize * labelLength)
              }
              Array.copy(sample._1, 0, featureData, i * featureLength, featureLength)
              labelData(i) = sample._2
              i += 1
            }
            featureTensor.set(Storage[Double](featureData), storageOffset = 1, sizes = Array(i) ++ sampleShape)
            labelTensor.set(Storage[Double](labelData), storageOffset = 1, sizes = Array(i, 1))
            MiniBatch(featureTensor, labelTensor)
          }
          else {
            null
          }
        }
      }
    }
}

leads to this: all bias parameters exactly 0:

scala> print(model.getParameters)

(0.5835011785523165

-0.29177034775166744

-0.19938245877956728

0.17784682057321344

0.5514315980564399

….

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67],0.0

0.0

….

0.0

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67])

Thanks again!

V.

Li, Zhichao

unread,

Jan 17, 2017, 11:32:23 PM1/17/17

to von Brzeski, Vadim, Jason Dai, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group

I guess this is due to inconsistent batch size.

The “batchSize” within OOMBatching is the batch per node. i.e if your cluster size is 4, then the total batch size is batchSize * 4. How about try to specify the batchSize within OOMBatching as “batchSize used in local”/ cluster_size ?

Thanks,

Zhichao

From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of von Brzeski, Vadim
Sent: Wednesday, January 18, 2017 11:58 AM
To: Jason Dai <jaso...@gmail.com>; Zhang, Yao <yao....@intel.com>
Cc: Yiheng Wang <yih...@gmail.com>; Wan, Yan <yan...@intel.com>; BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

Hi –

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.

Jason Dai

unread,

Jan 18, 2017, 12:14:11 AM1/18/17

to Li, Zhichao, von Brzeski, Vadim, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group

Hi Vadim,

As Zhichao mentioned above, the batch size used in your local training is "batchSize"; but the batch sized used in your distributed training is "batchSize * node_num", which is probably too large and needs to be scaled down in distributed training.

Alternatively, you may use "batchSize = Utils.getBatchSize(totalBatch)" to calculate the "batchSize" given the "totalBatch" (e.g., see https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Transformer.scala#L91).

Thanks,

-Jason

--

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.

von Brzeski, Vadim

unread,

Jan 18, 2017, 2:21:34 AM1/18/17

to Jason Dai, Li, Zhichao, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group

Hi Jason, Zhichao –

Thanks, but I don’t think that’s it.

1) When I tried it, I had numExecutors = numCores = 1, and batchSize = 100.

2) Then I tried numExecutors = 4, numCores = 1, and with Utils.getBatchSize(totalBatch) , this time with batchSize = 400 à same result.

Here’s my Engine conf and data generation mechanism below, in case you want to try it. Like I said, it works great in local model.

(BTW: where I’m headed with this: my real dataset (for a regression problem) has 1B+ rows and 70 columns. Am I delusional trying something like this with Big-DL?

V.

val sc = new SparkContext(
Engine.init(numExecutors, numCores, true).get

    .setAppName("Sample_NN")
    .set("spark.akka.frameSize", 64.toString)
    .set("spark.task.maxFailures", "1")

.set("spark.scheduler.minRegisteredResourcesRatio", "1")
)

sc.setLogLevel("ERROR")

val dimInput = 2

val data = (0 to 100).collect {

case i if i > 75 || i < 25 =>
    (0 to 100).collect {
      case j if j > 75 || j < 25 =>
        val res =
          if (i > 75 && j < 25) 2.0
          else if (i < 25 && j > 75) -4.0
          else 0
        (Array(i / 100.0, j / 100.0), res)
    }
}.flatten
val sampleShape = Array(1,dimInput)

From: Jason Dai <jaso...@gmail.com>
Date: Tuesday, January 17, 2017 at 9:14 PM
To: "Li, Zhichao" <zhich...@intel.com>
Cc: "Brzeski, Vadim" <vbrz...@ebay.com>, "Zhang, Yao" <yao....@intel.com>, Yiheng Wang <yih...@gmail.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

Hi Vadim,

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.

Yiheng Wang

unread,

Jan 18, 2017, 5:05:27 AM1/18/17

to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Hi Vadim

First, the second element of the getParameter is gradient, not bias... We model after torch, and the key API(e.g. getParameter) has same signature(https://github.com/torch/nn/blob/master/doc/module.md#flatparameters-flatgradparameters-getparameters)... The weight and bias is combined into the first tensor.

I don't quite understand that why you said that the distributed training is bad fitted. I try your code. Here's some result:

1. Not on Spark, batchSize = 100, maxIteration = 1000, loss can converge to 0.5

2. On Spark cluster, 4 executors, each use 1 core, total batchSize = 100(so each node has a batch 25), maxIteration = 1000, loss can converge to 0.5

3. On Spark cluster, 4 executors, each use 1 core, total batchSize = 400(so each node has a batch 100), maxIteration = 1000, loss can converge to 0.2

A little trick, you can see the loss in the log directly, if you set the log level correctly, see https://github.com/intel-analytics/BigDL/wiki/Programming-Guide#logging

Here's the code(just modify the batchsize/logger/end condition) https://gist.github.com/yiheng/6c94cdb137627ff16474f44e2f7b3200

BTW, as the batch is small and model is small, the distribute training may be slower than local training due to communication cost.

We'd like help you enable deeplearning on your big dataset. So please let us know if you're blocked by any problem.

Regards,

Yiheng

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.

For more options, visit https://groups.google.com/d/optout.

Jason Dai

unread,

Jan 18, 2017, 7:04:13 AM1/18/17

to Yiheng Wang, von Brzeski, Vadim, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

BTW, Optimizer.optimize() will return a trained model after it's done, while the original model passed to Optimizer is not updated when running on Spark.

As for your data scale, while we have not tried 1B+ rows, we have tested on ImageNet data (>1.2M samples where each sample is a 224*224*3 tensor) using deep convolution neural networks (Inception) on up-to 32 nodes, which are actually comparable to your data size. We'd be glad to help out if you run into any problems when using BigDL on your data.

Thanks,

-Jason

von Brzeski, Vadim

unread,

Jan 18, 2017, 5:32:53 PM1/18/17

to Jason Dai, Yiheng Wang, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Hi guys –

Thanks for your help so far and your offer to help in future.

Yiheng (and all) – here is what I found:

First, I had to do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get” because sometimes I get this error: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this.

I also set nHidden = 6, num iterations = 2000, learning rate = 0.005

Anyway, when I do this:

val trainSetRDD = sc.makeRDD(data)

I get this fit:

17/01/18 15:30:15 INFO DistriOptimizer$: [Epoch 80 2400/2500][Iteration 2000][Wall Clock 96.279817182s] Train 100 in 0.032721897seconds. Throughput is 3056.057538473396 records/second. Loss is 0.4063624393630257.

But when I do this:

val numExecutors = 1 // args(3).toInt
val numCores = 1 // args(4).toInt

val trainSetRDD = sc.makeRDD(data).coalesce(numExecutors*numCores, true).coalesce(numExecutors)

I get a different worse fit:

17/01/18 15:26:54 INFO DistriOptimizer$: [Epoch 80 2400/2500][Iteration 2000][Wall Clock 102.645296308s] Train 100 in 0.03926751seconds. Throughput is 2546.634609630201 records/second. Loss is 0.7909712745636441.

I do the coalesce steps because of what I read here: https://github.com/intel-analytics/BigDL/pull/353 regarding the error above.

Is the coalesce advised? And any solution for the above error ?

Thanks

V.

From: Jason Dai <jaso...@gmail.com>

Date: Wednesday, January 18, 2017 at 4:04 AM
To: Yiheng Wang <yih...@gmail.com>

Cc: "Brzeski, Vadim" <vbrz...@ebay.com>, "Li, Zhichao" <zhich...@intel.com>, "Zhang, Yao" <yao....@intel.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

BTW, Optimizer.optimize() will return a trained model after it's done, while the original model passed to Optimizer is not updated when running on Spark.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.

For more options, visit https://groups.google.com/d/optout.

Yiheng Wang

unread,

Jan 19, 2017, 12:13:11 AM1/19/17

to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Hi Vadim

Something want to be clarified:

"First, I had to do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get”

Do you mean use Engine.init(1, 1, true) instead of Engine.init(4, 1, true)?

I think partition number of the RDD from sc.makeRDD(data) may be too small. We found similar issues before.

Please pass in a partition number to that method, like sc.makeRDD(data, nodeNumber * coreNumber)

And if the coreNumber is 1, you may need to set the executor cores to 1 by --executor-cores 1

If that do not solve the problem, can you provide the whole example code, the spark version, deploy mode and your spark-submit commands? So that we can try to reproduce the issue.

You needn't do the coalesce steps. The root cause is the input RDD partition is too small. Increase input RDD partition can solve this issue in my experience.

Regards the different loss, I try your hyper parameter, the coalesce go to loss 0.4 but non-coalesce go to loss 0.6. I think it a random fluctuation in SGD...

Regards

Yiheng

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.

For more options, visit https://groups.google.com/d/optout.

--

Yiheng Wang

SSG STO Big Data Technology

Intel Asia-Pacific Research & Development Ltd.

No. 880 Zi Xing Road

Shanghai, PRC, 200241

Phone: (86-21) 61166094

yih...@gmail.com
yih...@hotmail.com

Yiheng Wang

unread,

Jan 19, 2017, 12:57:44 AM1/19/17

to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Also, please set shuffleLocalityEnabled = false, to see if it resolve the muti-task on same node issue.

Yiheng Wang

unread,

Jan 19, 2017, 12:58:39 AM1/19/17

to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Sorry, I mean set the property

spark.shuffle.reduceLocality.enabled

to true

von Brzeski, Vadim

unread,

Jan 19, 2017, 1:30:51 PM1/19/17

to Yiheng Wang, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Hi Yiheng –

I did as you suggested, and things seem OK on my small sample dataset. I am now running into the “Detect multi-task run on one Executor/Container. Currently not support this” issue on my real large dataset.

Right before calling optimize(), I do this re-partition as you suggested:

val trainSetRDD =
if (doRepartition) {
    xFit.repartition(numExecutors * numCores)
} else {
    xFit
}
log.info("trainSetRDD num partitions = "+trainSetRDD.getNumPartitions)
log.info("trainSetRDD count = "+trainSetRDD.count)

val validationSetRDD =
if (doRepartition) {
    xVal.repartition(numExecutors * numCores)
} else {
    xVal
}

log.info("validationSetRDD num partitions = "+validationSetRDD.getNumPartitions)
log.info("validationSetRDD count = "+validationSetRDD.count)

val batchingT = com.ebay.mktgscience.oom.bigDL.OOMBatching(batchSize, sampleShape)
val trainSet = DataSet.rdd(trainSetRDD) -> batchingT
log.info("trainSet.size = "+trainSet.size())

val batchingV = com.ebay.mktgscience.oom.bigDL.OOMBatching(batchSize, sampleShape)
val validationSet = DataSet.rdd(validationSetRDD) -> batchingV
log.info("validationSet.size = "+validationSet.size())

And I get this result:

17/01/19 10:57:06 INFO OOM_Train_NN$: trainSetRDD num partitions = 500

17/01/19 10:59:16 INFO OOM_Train_NN$: trainSetRDD count = 1085223099

17/01/19 10:59:16 INFO OOM_Train_NN$: validationSetRDD num partitions = 500

17/01/19 10:59:39 INFO OOM_Train_NN$: validationSetRDD count = 120595645

17/01/19 11:00:51 INFO OOM_Train_NN$: trainSet.size = 1085223099

17/01/19 11:01:48 INFO OOM_Train_NN$: validationSet.size = 120595645

17/01/19 11:01:48 INFO DistriOptimizer$: Cache thread models...

17/01/19 11:01:49 ERROR TaskSetManager: Task 469 in stage 91.0 failed 1 times; aborting job

17/01/19 11:01:49 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 469 in stage 91.0 failed 1 times, most recent failure: Lost task 469.0 in stage 91.0 (TID 11391, hdc9-phx04-0160-0117-032.stratus.phx.ebay.com): java.lang.IllegalArgumentException: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this

at scala.Predef$.require(Predef.scala:233)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$9.apply(DistriOptimizer.scala:366)

Here is my entire spark-submit, spark version 1.6.1:

***********

export OMP_NUM_THREADS=1

export KMP_BLOCKTIME=0

export OMP_WAIT_POLICY=passive

export DL_ENGINE_TYPE=mklblas

HIVEVER=1.2.1000.2.4.2.0-258

HIVEDIR=hive-$HIVEVER

APACHEVER=2.7.1.2.4.2.0-258

MODELNAME=nn_bigDL

N_ITER=100

N_BATCH=500000 # inside OOMBatching, I do: private val batchSize = Utils.getBatchSize(totalBatch)

LEARNING_RATE=0.01

N_HIDDEN=100

N_EXECUTORS=500

N_CORES=1

DO_REPARTITION=true

PRIORMODEL=lm_3

/apache/spark/bin/spark-submit \

--name "OOM_Model_NN2" \

--master "yarn" \

--num-executors $N_EXECUTORS \

--executor-cores $N_CORES \

--deploy-mode "cluster" \

--driver-memory 12G \

--executor-memory 32G \

--conf "spark.rpc.askTimeout=240s" \

--conf "spark.yarn.executor.memoryOverhead=6000" \

--conf "spark.network.timeout=2000" \

--conf "spark.executor.heartbeatInterval=60s" \

--conf "spark.sql.shuffle.partitions=500" \

--conf "spark.executor.extraJavaOptions=-server -XX:MaxPermSize=1024m -XX:+UseG1GC" \

--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \

--conf "spark.kryoserializer.buffer=256m" \

--conf "spark.kryoserializer.buffer.max=1024m" \

--conf "spark.scheduler.minRegisteredResourcesRatio=1" \

--conf "spark.yarn.maxAppAttempts=1" \

--conf "spark.shuffle.reduceLocality.enabled=true" \

--driver-java-options "-XX:MaxPermSize=2G -XX:+UseG1GC"\

--driver-library-path "/apache/hadoop/lib/native:/apache/hadoop/lib/native/Linux-amd64-64" \

--driver-class-path "/apache/hadoop/share/hadoop/common/hadoop-common-$APACHEVER.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.2.4.2.0-258.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-$APACHEVER.jar" \

--jars "/apache/hadoop/share/hadoop/common/hadoop-common-$APACHEVER.jar,/apache/hadoop/lib/hadoop-lzo-0.6.0.2.4.2.0-258.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-$APACHEVER.jar,/apache/hive/lib/hive-metastore-$HIVEVER.jar,/apache/hive/lib/hive-common-$HIVEVER.jar,/apache/spark/lib/datanucleus-api-jdo-3.2.6.jar,/apache/spark/lib/datanucleus-core-3.2.10.jar,/apache/spark/lib/datanucleus-rdbms-3.2.9.jar" \

--verbose \

--queue "hddq-exprce-orgmrktg" \

--files /apache/hive/conf/hive-site.xml \

--class "com.ebay.mktgscience.oom.bigDL.OOM_Train_NN" \

./jars/oom-model-1.0-SNAPSHOT.jar $MODELNAME $N_ITER $N_BATCH $LEARNING_RATE $N_HIDDEN $N_EXECUTORS $N_CORES $DO_REPARTITION $PRIORMODEL

************

Thanks.

V.

From: <bigdl-us...@googlegroups.com> on behalf of Yiheng Wang <yih...@gmail.com>
Date: Wednesday, January 18, 2017 at 9:13 PM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: Jason Dai <jaso...@gmail.com>, "Li, Zhichao" <zhich...@intel.com>, "Zhang, Yao" <yao....@intel.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

Hi Vadim

Something want to be clarified:

"First, I had to do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get”

Do you mean use Engine.init(1, 1, true) instead of Engine.init(4, 1, true)?

ding.ding

unread,

Jan 19, 2017, 5:48:08 PM1/19/17

to BigDL User Group, vbrz...@ebay.com

We will look at it.

BTW, to achieve better performance we don't expect a big partition number of traing data and executor number. 500 seems big here. Could you try to low down the executor number and increase executor core numbers accordingly. And reduce training data partition number by coalesce like:

xFit.repartition(numExecutors * numCores).coalesce(numExecutors)

In our experience, it can avoid "multi-task run on one Executor/Container" exception and have a good performance.

It may helpful

von Brzeski, Vadim

unread,

Jan 19, 2017, 7:40:04 PM1/19/17

to ding.ding, BigDL User Group, Yiheng Wang, Jason Dai

Thanks for the advice. Already in progress :) with:

N_EXECUTORS=200

N_CORES=1

(and only doing xFit.repartition(numExecutors * numCores) as before, no coalesce)

Seems like it is working! Its already done multiple rounds of

count at DistriOptimizer.scala:399

reduce at DistriOptimizer.scala:220

Keep fingers crossed!

BTW: on other topic: any sample code for computing predictions given a fitted model? I tried it myself, writing a class like Batching, but I got this Exception:

scala> val predictor = OOMPredict(batchSize, sampleShape, modelBroadcast.asInstanceOf[Broadcast[Module[Double]]])

predictor: OOMPredict = $iwC$$iwC$OOMPredict@67023824

scala> val predictions = DataSet.rdd(trainSetRDD) -> predictor

java.io.NotSerializableException: org.apache.spark.SparkContext

Serialization stack:

- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@6902e2bf)

Thanks.

V.

邱鑫

unread,

Jan 19, 2017, 9:24:20 PM1/19/17

to BigDL User Group, vbrz...@ebay.com

Hi, Vadim

https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/example/imageclassification

https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/example/loadmodel

We have two example for fitted model, one is ImagePredictor, one is ModelValidator.

Bests,

-Xin

在 2017年1月20日星期五 UTC+8上午8:40:04，von Brzeski, Vadim写道：

von Brzeski, Vadim

unread,

Jan 20, 2017, 3:00:09 AM1/20/17

to 邱鑫, BigDL User Group, Jason Dai

Hi Xin –

Thanks. I tried following the /loadmodel example.

I tried saving / loading the model to HDFS, but was not successful. It seems like the saveModel() and Module.load() methods only operate on the local filesystem (true?). How do I save / load models when I run a spark-submit job on a cluster?

Thanks.

V.

Jason Dai

unread,

Jan 20, 2017, 3:16:56 AM1/20/17

to von Brzeski, Vadim, 邱鑫, BigDL User Group

Usually the model is saved/loaded from Spark driver; if one is using yarn-client mode, he or she should be able to access the local file system on the driver node. If one is using yarn-cluster mode, he or she needs to save/load the module from the distributed file system; we just opened an issue to add support for that.

Thanks,

-Jason

邱鑫

unread,

Jan 20, 2017, 3:17:27 AM1/20/17

to BigDL User Group, vbrz...@ebay.com

Hi, Vadim

Yeah, only the local file system. So you need to use yarn-client model for yarn cluster. Then the driver will run on the machine which you type the spark-submit, save/load could work properly.

"--master yarn --deploy-mode client" to enable the yarn-client model.

Thanks for your reply, I just create a issue issues#396 for this feature.

Bests,

-Xin

在 2017年1月20日星期五 UTC+8下午4:00:09，von Brzeski, Vadim写道：

Jason Dai

unread,

Jan 20, 2017, 6:38:55 AM1/20/17

to 邱鑫, BigDL User Group, von Brzeski, Vadim

BTW, for your previous example, running BigDL training using a large number (say, 200) of small executors (with only one core each) is actually bad for training speed, as BigDL needs to synchronize the parameters between all the executors. We would suggest using a small number of more powerful executors, e.g., 16 executors each with 25 cores, in BigDL training, so as to minimize network overheads.

Thanks,

-Jason

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.

von Brzeski, Vadim

unread,

Jan 20, 2017, 11:20:23 AM1/20/17

to Jason Dai, 邱鑫, BigDL User Group

Thanks for the advice!

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.

von Brzeski, Vadim

unread,

Jan 22, 2017, 11:12:44 PM1/22/17

to Jason Dai, BigDL User Group

Jason –

On you last comment – executors w/ many cores: I am limited to 8 cores in my environment (per policy).

Anyway, I have been able to successfully run a few times with 160 executors and 6 cores each (8 cores have me out of memory problems).

However, sometimes I get a job failed with this error (happens after about 100 or so iterations; all of a sudden it dies):

17/01/22 14:19:18 ERROR YarnClusterScheduler: Lost executor 38 on phxaishdc9dn0781.phx.ebay.com: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

17/01/22 14:19:18 ERROR TaskSetManager: Task 30 in stage 2458.0 failed 1 times; aborting job

17/01/22 14:19:18 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 2458.0 failed 1 times, most recent failure: Lost task 30.0 in stage 2458.0 (TID 59323, phxaishdc9dn0781.phx.ebay.com): ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 2458.0 failed 1 times, most recent failure: Lost task 30.0 in stage 2458.0 (TID 59323, phxaishdc9dn0781.phx.ebay.com): ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

at scala.Option.foreach(Option.scala:236)

at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1855)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)

at org.apache.spark.rdd.RDD.count(RDD.scala:1164)

at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:236)

at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:532)

at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN$.main(OOM_Train_NN.scala:306)

at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN.main(OOM_Train_NN.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)

This last one happened with 200 executors and 6 cores. (I have been trying bigger and bigger batch sizes too, and thus the 200 number of executors; this last one was 18e6 total batch = 90K per executor so I wouldn’t run out of memory. How large can I make the batch per executor w/ 40GB executor memory and 6 cores per executor?)

V.

From: Jason Dai <jaso...@gmail.com>
Date: Friday, January 20, 2017 at 3:38 AM
To: 邱鑫 <qiuxin...@gmail.com>

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.

Jason Dai

unread,

Jan 23, 2017, 12:28:14 AM1/23/17

to von Brzeski, Vadim, BigDL User Group

Hi Vadim,

Looks like your YARN container got killed, most likely due to exceeding memory limits? We use Intel MKL, which can allocate its own memory in native code. Maybe you can try tuning the "spark.yarn.executor.memoryOverhead" config.

Thanks,

-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.

Jason Dai

unread,

Jan 23, 2017, 7:00:36 AM1/23/17

to von Brzeski, Vadim, BigDL User Group

Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

Thanks,

-Jason

von Brzeski, Vadim

unread,

Jan 23, 2017, 11:17:22 AM1/23/17

to Jason Dai, BigDL User Group

Indeed – I did see such heartbeat time out messages. That’s probably it then.

Thanks!

V.

From: Jason Dai <jaso...@gmail.com>
Date: Monday, January 23, 2017 at 4:00 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

Thanks,

-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.

von Brzeski, Vadim

unread,

Jan 26, 2017, 8:23:21 PM1/26/17

to Jason Dai, BigDL User Group

Hi Jason (and all)

So I have trained the 1B+ set a few times, but am not getting the results I want, so I am playing around with different parameters, number of hiddens, layers, etc. Here’s the thing:

1. Go with 160 exec, 6 cores each. 40GB per exec, with 8GB overhead. Total batch 30M, batch per exec 187500. 2 ReLU layers, 100 hiddens each. Finishes ok.

2. Make my network a bit more complex: 3 ReLU layers, 100 hiddens each. Same job config. Get this: Container killed by YARN for exceeding memory limits. 48.0 GB of 48 GB physical memory used.

3. OK, so then I go with 200 exec, 5 cores each. Then I get the dreaded: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this

Any ideas?

V.

From: Jason Dai <jaso...@gmail.com>

Date: Monday, January 23, 2017 at 4:00 AM

To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

Thanks,

-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.

Jason Dai

unread,

Jan 26, 2017, 8:51:21 PM1/26/17

to von Brzeski, Vadim, BigDL User Group

Hi Vadim,

First I think your batch size seems too large for the learning rate; try lowering the batch size (for instance we used a batch size of 1K~2K when training the Inception model in our example), and tuning the hyper-parameters for better accuracy.

And to address the OOM problem, you can also try reducing the batch size (while keeping the core#) for now. We'll try to reproduce the "multi-task run" issues on our side.

Thanks,

-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.

von Brzeski, Vadim

unread,

Jan 26, 2017, 10:09:56 PM1/26/17

to Jason Dai, BigDL User Group

Thanks. Cut down my batch size, trying again. Sometimes I also get this one:

17/01/26 19:57:37 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 715 in stage 25.0 failed 1 times, most recent failure: Lost task 715.0 in stage 25.0 (TID 9276, phxdpehdc9dn2398.stratus.phx.ebay.com): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_279325_01_2122733 on host: phxdpehdc9dn2398.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 715 in stage 25.0 failed 1 times, most recent failure: Lost task 715.0 in stage 25.0 (TID 9276, phxdpehdc9dn2398.stratus.phx.ebay.com): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_279325_01_2122733 on host: phxdpehdc9dn2398.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node