trying simple NN, but getting error: requirement failed: input must be vector or matrix

617 views
Skip to first unread message

vbrz...@ebay.com

unread,
Jan 13, 2017, 11:26:29 PM1/13/17
to BigDL User Group

 

I have the following simple NN training code (just getting my feet wet), modeled after the example / tutorials here https://github.com/intel-analytics/BigDL/wiki/Getting-Started.

 

All is well, until I hit the actual “optimize” – see below – when this error occurs:

 

17/01/13 17:56:15 ERROR ThreadPool$: Error: java.lang.IllegalArgumentException: requirement failed: input must be vector or matrix

                at scala.Predef$.require(Predef.scala:233)

                at com.intel.analytics.bigdl.nn.Linear.updateOutput(Linear.scala:66)

                at com.intel.analytics.bigdl.nn.Linear.updateOutput(Linear.scala:29)

                at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)

                at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:33)

                at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)

                at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:116)

                at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)

                at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)

                at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invokeAndWait$1$$anonfun$apply$2.apply(Engine.scala:103)

                at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)

                at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)

                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

                at java.util.concurrent.FutureTask.run(FutureTask.java:266)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

                at java.lang.Thread.run(Thread.java:745)

 

 

Here is my code:

 

*************************

 

val sc = new SparkContext(
  Engine.init(1, 1, true).get
    .setAppName("Sample_NN")
    .set("spark.akka.frameSize", 64.toString)
    .set("spark.task.maxFailures", "1")
    )


// make up some data
val data = (0 to 100).collect {
  case i if i > 75 || i < 25

    (0 to 100).collect {
      case j if j > 75 || j < 25

        val res =
          if (i > 75 && j < 25) 23.0
          else if (i < 25 && j > 75) -45
          else 0
        (Array(i / 100.0 + 1, j / 100.0 + 2), res)
    }
}.flatMap(x
x)

val batchSize = 4

val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet

val layer1 = Linear[Double](2,4)
val layer2 = ReLU[Double]()
val output = Sum[Double]()

val model = Sequential[Double]().
  add(layer1).
  add(layer2).
  add(output)

val state =
  T(
    "learningRate" -> 0.01,
    "weightDecay" -> 0.0005,
    "momentum" -> 0.9,
    "dampening" -> 0.0
  )

val optimizer = Optimizer(
  model = model,
  dataset = trainSet,
  criterion = new MSECriterion[Double]()
)

optimizer.
  setState(state).
  // setValidation(Trigger.everyEpoch, validationSet, Array(new Loss[Double])).
 
setOptimMethod(new Adagrad[Double]()).
  optimize()

 

**************************

 

SampleToBatch is the one in BigDL here:  https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Transformer.scala

as is Sample in here: https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Types.scala

 

ToSample is my own, defined as follows:

 

******************

object ToSample {
  def apply(nRows: Int, nCols: Int)
  : ToSample =
    new ToSample(nRows, nCols)
}

class ToSample(nRows: Int, nCols: Int)
  extends Transformer[ (Array[Double], Double) , Sample[Double]] {

  private val buffer = new Sample[Double]()
  private var featureBuffer: Array[Double] = null
  private var
labelBuffer: Array[Double] = null

  override def
apply(prev: Iterator[(Array[Double], Double)]): Iterator[Sample[Double]] = {

    prev.map(x => {

      if (featureBuffer == null || featureBuffer.length < nRows * nCols) {
        featureBuffer = new Array[Double](nRows * nCols)
      }
      if (labelBuffer == null || labelBuffer.length < nRows) {
        labelBuffer = new Array[Double](nRows)
      }

      var i = 0
      while (i < nRows) {
        Array.copy(x._1, 0, featureBuffer, i * nCols, nCols)
        labelBuffer(i) = x._2
        i += 1
      }

      buffer.copy(featureBuffer, labelBuffer,
        Array(nRows, nCols), Array(nRows))
    })
  }
}

 

********************

 

Inspecting the MiniBatch, all seems fine:

 

scala> val q = trainSet.toLocal.data(false)

q: Iterator[com.intel.analytics.bigdl.dataset.MiniBatch[Double]] = non-empty iterator

 

scala> val z = q.next()

z: com.intel.analytics.bigdl.dataset.MiniBatch[Double] =

MiniBatch((1,.,.) =

1.0          2.0         

 

(2,.,.) =

1.0          2.01       

 

(3,.,.) =

1.0          2.02       

 

(4,.,.) =

1.0          2.03       

 

[com.intel.analytics.bigdl.tensor.DenseTensor of size 4x1x2],0.0        

0.0         

0.0         

0.0         

[com.intel.analytics.bigdl.tensor.DenseTensor of size 4x1])

 

so I can’t understand why I am getting the above error. 

 

Any help would be appreciated.

 

Thanks.

 

Vadim.

Yan Wan

unread,
Jan 14, 2017, 12:33:17 AM1/14/17
to BigDL User Group, vbrz...@ebay.com
Hi,
I think the problem is that the Linear layer requires a 1D or 2D input, which is a vector or a matrix, whereas the minibatch input is a 3D tensor.
The SampleToBatch layer forms the input format into a [4, 1, 2] size of Tensor. Actually, the valid input size for the Linear layer should be [4, 2].

One of the solution could be this:

In your model definition:

val model = Sequential[Double]().
  add(layer1).
  add(layer2).
  add(output)

 just add a ReShape Layer to reform your data input format.
=>

val model = Sequential[Double]()
    .add(Reshape(Array(batchSize*1, 2))).
   add(layer1).
  add(layer2).
  add(output)

Then the model will receive a correct input format.


Hope this will help you.

Yiheng Wang

unread,
Jan 14, 2017, 12:59:46 AM1/14/17
to Yan Wan, BigDL User Group, vbrz...@ebay.com
@Yan Wan. The model definition should not include batch size.

Hi Vadim

It should be
val model = Sequential[Double]()
    .add(Reshape(Array(2))).
   add(layer1).
  add(layer2).
  add(output)

The Reshape(Array(2)) will reshape a 4 * 1 * 2 tensor into 4 * 2 tensor.

Another solution is modify the toSample to

buffer.copy(featureBuffer, labelBuffer,
        Array(nRows * nCols), Array(nRows))

If your model doesn't contain any spatial layers, you needn't generate a 2D tensor for a sample.

Regards,
Yiheng

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/4ede1529-0cd3-4bb2-9dd4-6859a38e5bff%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Yiheng Wang
SSG STO Big Data Technology
Intel Asia-Pacific Research & Development Ltd.
No. 880 Zi Xing Road
Shanghai, PRC, 200241
Phone: (86-21) 61166094


yih...@gmail.com
yih...@hotmail.com

von Brzeski, Vadim

unread,
Jan 16, 2017, 12:19:26 AM1/16/17
to Yiheng Wang, Yan Wan, BigDL User Group

Thanks Yiheng!  That worked.

 

Vadim.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.




--

Yiheng Wang

SSG STO Big Data Technology

Intel Asia-Pacific Research & Development Ltd.

No. 880 Zi Xing Road

Shanghai, PRC, 200241

Phone: (86-21) 61166094

 


yih...@gmail.com
yih...@hotmail.com

--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/CAF3ct6B2DR-EzaF%3Dzq%2BfFM8GpjiKCVbdupYPrd0huaM8Eo3_Rw%40mail.gmail.com.

von Brzeski, Vadim

unread,
Jan 16, 2017, 9:59:46 PM1/16/17
to Yiheng Wang, Yan Wan, BigDL User Group

Spoke too soon :(

 

Seems like the solution works, but only if batchSize == number of outputs in Linear layer (??).   i.e.:

 

Works OK:

 

val batchSize = 6
val sampleShape = Array(1,2)


val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet

val layer1 = Linear[Double](2,6)


val layer2 = ReLU[Double]()

val output = Sum[Double]()


val model = Sequential[Double]().

  add(Reshape(Array(2))).

…etc.

 

Fails – see exception below:

 

val batchSize = 10
val sampleShape = Array(1,2)


val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet

val layer1 = Linear[Double](2,6)


val layer2 = ReLU[Double]()

val output = Sum[Double]()


val model = Sequential[Double]().

  add(Reshape(Array(2))).

…etc.

 

17/01/16 19:51:22 ERROR ThreadPool$: Error: java.lang.IllegalArgumentException: requirement failed: inconsistent tensor size

                at scala.Predef$.require(Predef.scala:233)

                at com.intel.analytics.bigdl.tensor.DenseTensorApply$.apply2(DenseTensorApply.scala:63)

                at com.intel.analytics.bigdl.tensor.DenseTensor.map(DenseTensor.scala:396)

                at com.intel.analytics.bigdl.nn.MSECriterion.updateOutput(MSECriterion.scala:33)

                at com.intel.analytics.bigdl.nn.MSECriterion.updateOutput(MSECriterion.scala:27)

                at com.intel.analytics.bigdl.nn.abstractnn.AbstractCriterion.forward(AbstractCriterion.scala:43)

                at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:117)

                at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)

                at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)

                at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invokeAndWait$1$$anonfun$apply$2.apply(Engine.scala:103)

                at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)

                at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)

                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

                at java.util.concurrent.FutureTask.run(FutureTask.java:266)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

                at java.lang.Thread.run(Thread.java:745)

 

V.

 

From: <bigdl-us...@googlegroups.com> on behalf of Yiheng Wang <yih...@gmail.com>


Date: Friday, January 13, 2017 at 9:59 PM
To: Yan Wan <yan...@intel.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>, "Brzeski, Vadim" <vbrz...@ebay.com>

Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

 .add(Reshape(Array(batchSize*1, 2))).

Zhang, Yao

unread,
Jan 17, 2017, 2:21:15 AM1/17/17
to von Brzeski, Vadim, Yiheng Wang, Wan, Yan, BigDL User Group

Hi,

 

For the modules that support batch and meanwhile the input can be a multi-dimensional tensor(there is no restriction on the number of dimensions of  input tensor),  that kind of modules usually have an argument named `nInputDims`.

 

This argument means that “the number of dimensions of the given input”. Because the input tensor can have indefinite number of dimensions, so that modules can not automatically inference whether the user use the batch or not and what the actual dimensions of input excluding the batch dimension. Hence `nInputDims` represents the actual dimensions of given input excluding the batch. In your case, the `nInputDims` of Sum layer should be 1, because the input size is batchSize * 6 (one dimension excluding the batch dimension).

 

The sum layer should be `val output = Sum(nInputDims = 1)`.

 

So your code should be something like:

“””

    import com.intel.analytics.bigdl.numeric.NumericDouble

    val batchSize = 10

    val trainSet = DataSet.array(data.toArray) -> ToSample(1, 2) -> SampleToBatch(batchSize)

 

    val layer1 = Linear(2, 6)

    val layer2 = ReLU()

    val output = Sum(nInputDims = 1)

 

    val model = Sequential()

      .add(Reshape(Array(2)))

      .add(layer1)

      .add(layer2)

      .add(output)

“””

 

For the reason that `batchSize = 6` runs without error, the sum layer will receive a 6 * 6 tensor, and sum along the first dimension, then a tensor of 1 dimension with size 6 obtained by the criterion. Because the size is the same as the number of labels (= batchSize), there won’t be any errors, but the logic is still not correct. If `batchSize  = 10`, a tensor of 1 dimension with size 10 obtained by the criterion, which cannot match the batchSize (6), so the run error occurs.

 

Hope that would make some helps.

 

Best regards

Yao

 

Full worked code attached:

“””

object ToSample {

      def apply(nRows: Int, nCols: Int)

      : ToSample =

        new ToSample(nRows, nCols)

    }

 

    class ToSample(nRows: Int, nCols: Int)

      extends Transformer[(Array[Double], Double), Sample[Double]] {

      private val buffer = new Sample[Double]()

      private var featureBuffer: Array[Double] = null

      private var labelBuffer: Array[Double] = null

 

      override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[Sample[Double]] = {

 

        prev.map(x => {

 

          if (featureBuffer == null || featureBuffer.length < nRows * nCols) {

            featureBuffer = new Array[Double](nRows * nCols)

          }

          if (labelBuffer == null || labelBuffer.length < nRows) {

            labelBuffer = new Array[Double](nRows)

          }

 

          var i = 0

          while (i < nRows) {

            Array.copy(x._1, 0, featureBuffer, i * nCols, nCols)

            labelBuffer(i) = x._2

            i += 1

          }

 

          buffer.copy(featureBuffer, labelBuffer,

            Array(nRows, nCols), Array(nRows))

        })

      }

    }

 

    // make up some data

    val data = (0 to 100).collect {

      case i if i > 75 || i < 25

        (0 to 100).collect {

          case j if j > 75 || j < 25

            val res =

              if (i > 75 && j < 25) 23.0

              else if (i < 25 && j > 75) -45

              else 0

            (Array(i / 100.0 + 1, j / 100.0 + 2), res)

        }

    }.flatten

 

    val sc = new SparkContext(

      Engine.init(1, 1, true).get

        .setAppName("Sample_NN")

        .set("spark.akka.frameSize", 64.toString)

        .set("spark.task.maxFailures", "1")

        .setMaster("local[4]")

    )

 

    import com.intel.analytics.bigdl.numeric.NumericDouble

    val batchSize = 10

    val trainSet = DataSet.array(data.toArray) -> ToSample(1, 2) -> SampleToBatch(batchSize)

 

    val layer1 = Linear(2, 6)

    val layer2 = ReLU()

    val output = Sum(nInputDims = 1)

 

    val model = Sequential()

      .add(Reshape(Array(2)))

      .add(layer1)

      .add(layer2)

      .add(output)

 

    val state =

      T(

        "learningRate" -> 0.01,

        "weightDecay" -> 0.0005,

        "momentum" -> 0.9,

        "dampening" -> 0.0

      )

 

    val optimizer = Optimizer(

      model = model,

      dataset = trainSet,

      criterion = new MSECriterion[Double]()

    )

 

    optimizer.

      setState(state).

      // setValidation(Trigger.everyEpoch, validationSet, Array(new Loss[Double])).

      setOptimMethod(new Adagrad[Double]()).

      optimize()

“””

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/1189B713-D9DC-4A0C-88EF-CCE4E0C329AD%40ebay.com.

Jason Dai

unread,
Jan 17, 2017, 7:12:16 AM1/17/17
to Zhang, Yao, von Brzeski, Vadim, Yiheng Wang, Wan, Yan, BigDL User Group
I think the point here is that one need to specify the number of dimensions for each input record (i.e., nInputDims) to Sum; in this case, each input record to Sum is a 1-dimensional tensor (Tensor[Double](6)), and we should specify Sum(nInputDims = 1) here. (See https://github.com/torch/nn/blob/master/doc/simple.md#sum for more details).

As mini-batch SGD is used in BigDL training, I think we can actually have the Optimizer to infer the batch size automatically, so that each Module can just refer to that during training; I have opened an issue (https://github.com/intel-analytics/BigDL/issues/382) for this.

Thanks,
-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/2D9FB5B117EE62468567426DDD3FEA39315AF489%40SHSMSX101.ccr.corp.intel.com.

von Brzeski, Vadim

unread,
Jan 17, 2017, 10:58:06 PM1/17/17
to Jason Dai, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group

Hi –

 

The above suggestions worked (thanks Yao), but now I have a different problem. 

 

When I train locally using the code (the one in the above thread and the working example you provided), all is fine.  I get a reasonable model weights (model.getParameters returns non-zero values for weights and biases), and a good fit in training.

 

But when I try the same on RDD, I get all biases = 0 and a bad fit. 

 

Again, the network:

 

val layer1 = Linear[Double](dimInput,nHidden)


val layer2 = ReLU[Double]()

val layer3 = Linear[Double](nHidden,nHidden)
val layer4 = ReLU[Double]()
val output =  Linear[Double](nHidden,1) //Sum[Double](nInputDims = 1)

val model = Sequential[Double]().
  add(Reshape(Array(dimInput))).
  add(layer1).
  add(layer2).
  add(layer3).
  add(layer4).
  add(output)

 

Local:

 

val trainSet = DataSet.array(data.toArray).transform(ToSample(1,dimInput)).transform(SampleToBatch(batchSize))

 

leads to this:

 

scala>     println(model.getParameters())

(0.6438069051434786

-0.5983624270225454

-1.222042899057738

1.2080124683211286

0.09588958746823727

…..

….

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67],-0.029515928487775192

-0.08776935671924736

0.05927649370933306

0.18495866162243968

-0.003225973822120612

-0.008541260257896508

….

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67])

 

and a good fit.

 

But with RDD:

 

val sampleShape = Array(1,dimInput)

val batching = OOMBatching(batchSize, sampleShape)
val trainSetRDD = sc.makeRDD(data).coalesce(numExecutors*numCores, true).coalesce(numExecutors)
val trainSet = DataSet.rdd(trainSetRDD) -> batching

 

where OOMBatching:

 

object OOMBatching {
  def apply(batchSize: Int, sampleShape: Array[Int]): OOMBatching =
    new OOMBatching(batchSize, sampleShape)
}

/**
* Batching samples into mini-batch
*
@param batchSize The desired mini-batch size.
*
@param sampleShape Shape of the training sample
*/
class OOMBatching(batchSize: Int, sampleShape: Array[Int]) extends Transformer[(Array[Double], Double), MiniBatch[Double]] {
    override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[MiniBatch[Double]] = {
      new Iterator[MiniBatch[Double]] {
        private val featureTensor: Tensor[Double] = Tensor[Double]()
        private val labelTensor: Tensor[Double] = Tensor[Double]()
        private var featureData: Array[Double] = null
        private var
labelData: Array[Double] = null
        private val
featureLength = sampleShape.product
        private val labelLength = 1

        override def hasNext: Boolean = prev.hasNext

        override def next(): MiniBatch[Double] = {
          if (prev.hasNext) {
            var i = 0
            while (i < batchSize && prev.hasNext) {
              val sample = prev.next()
              if (featureData == null || featureData.length < batchSize * featureLength) {
                featureData = new Array[Double](batchSize * featureLength)
              }
              if (labelData == null || labelData.length < batchSize * labelLength) {
                labelData = new Array[Double](batchSize * labelLength)
              }
              Array.copy(sample._1, 0, featureData, i * featureLength, featureLength)
              labelData(i) = sample._2
              i += 1
            }
            featureTensor.set(Storage[Double](featureData), storageOffset = 1, sizes = Array(i) ++ sampleShape)
            labelTensor.set(Storage[Double](labelData), storageOffset = 1, sizes = Array(i, 1))
            MiniBatch(featureTensor, labelTensor)
          }
          else {
            null
         
}
        }
      }
    }
}

 

leads to this: all bias parameters exactly 0:

 

scala> print(model.getParameters)

(0.5835011785523165

-0.29177034775166744

-0.19938245877956728

0.17784682057321344

0.5514315980564399

….

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67],0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

….

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

[com.intel.analytics.bigdl.tensor.DenseTensor of size 67])

 

Thanks again!

 

V.

 

 

 

Li, Zhichao

unread,
Jan 17, 2017, 11:32:23 PM1/17/17
to von Brzeski, Vadim, Jason Dai, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group

I guess this is due to inconsistent batch size.

The “batchSize” within OOMBatching is the batch per node. i.e if your cluster size is 4, then the total batch size is batchSize * 4. How about try to specify the batchSize within OOMBatching as “batchSize used in local”/ cluster_size ?

 

Thanks,

Zhichao

 

 

From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of von Brzeski, Vadim
Sent: Wednesday, January 18, 2017 11:58 AM
To: Jason Dai <jaso...@gmail.com>; Zhang, Yao <yao....@intel.com>
Cc: Yiheng Wang <yih...@gmail.com>; Wan, Yan <yan...@intel.com>; BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi –

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.

Jason Dai

unread,
Jan 18, 2017, 12:14:11 AM1/18/17
to Li, Zhichao, von Brzeski, Vadim, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group
Hi Vadim,

As Zhichao mentioned above, the batch size used in your local training is "batchSize"; but the batch sized used in your distributed training is "batchSize * node_num", which is probably too large and needs to be scaled down in distributed training.

Alternatively, you may use "batchSize = Utils.getBatchSize(totalBatch)" to calculate the "batchSize" given the "totalBatch" (e.g., see https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Transformer.scala#L91).

Thanks,
-Jason

--

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.

von Brzeski, Vadim

unread,
Jan 18, 2017, 2:21:34 AM1/18/17
to Jason Dai, Li, Zhichao, Zhang, Yao, Yiheng Wang, Wan, Yan, BigDL User Group

Hi Jason, Zhichao –

 

Thanks, but I don’t think that’s it. 

1)       When I tried it, I had numExecutors = numCores = 1, and batchSize = 100.

2)       Then I tried numExecutors = 4, numCores = 1, and with Utils.getBatchSize(totalBatch) , this time with batchSize = 400 à same result.

 

Here’s my Engine conf and data generation mechanism below, in case you want to try it.  Like I said, it works great in local model.

 

(BTW: where I’m headed with this:  my real dataset (for a regression problem) has 1B+ rows and 70 columns.  Am I delusional trying something like this with Big-DL?

 

V.

 

val sc = new SparkContext(
  Engine.init(numExecutors, numCores, true).get


    .setAppName("Sample_NN")
    .set("spark.akka.frameSize", 64.toString)
    .set("spark.task.maxFailures", "1")

    .set("spark.scheduler.minRegisteredResourcesRatio", "1")
    )

sc.setLogLevel("ERROR")

val dimInput = 2



val data = (0 to 100).collect {

  case i if i > 75 || i < 25 =>
    (0 to 100).collect {
      case j if j > 75 || j < 25 =>
        val res =
          if (i > 75 && j < 25) 2.0
          else if (i < 25 && j > 75) -4.0
          else 0
        (Array(i / 100.0, j / 100.0), res)
    }
}.flatten
val sampleShape = Array(1,dimInput)

 

 

From: Jason Dai <jaso...@gmail.com>
Date: Tuesday, January 17, 2017 at 9:14 PM
To: "Li, Zhichao" <zhich...@intel.com>
Cc: "Brzeski, Vadim" <vbrz...@ebay.com>, "Zhang, Yao" <yao....@intel.com>, Yiheng Wang <yih...@gmail.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi Vadim,

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

Yiheng Wang

unread,
Jan 18, 2017, 5:05:27 AM1/18/17
to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group
Hi Vadim

First, the second element of the getParameter is gradient, not bias... We model after torch, and the key API(e.g. getParameter) has same signature(https://github.com/torch/nn/blob/master/doc/module.md#flatparameters-flatgradparameters-getparameters)... The weight and bias is combined into the first tensor.

I don't quite understand that why you said that the distributed training is bad fitted. I try your code. Here's some result:
1. Not on Spark, batchSize = 100, maxIteration = 1000, loss can converge to 0.5

2. On Spark cluster, 4 executors, each use 1 core, total batchSize = 100(so each node has a batch 25), maxIteration = 1000, loss can converge to 0.5

3. On Spark cluster, 4 executors, each use 1 core, total batchSize = 400(so each node has a batch 100), maxIteration = 1000, loss can converge to 0.2
 
A little trick, you can see the loss in the log directly, if you set the log level correctly, see https://github.com/intel-analytics/BigDL/wiki/Programming-Guide#logging

Here's the code(just modify the batchsize/logger/end condition) https://gist.github.com/yiheng/6c94cdb137627ff16474f44e2f7b3200

BTW, as the batch is small and model is small, the distribute training may be slower than local training due to communication cost.

We'd like help you enable deeplearning on your big dataset. So please let us know if you're blocked by any problem.

Regards,
Yiheng






To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

Jason Dai

unread,
Jan 18, 2017, 7:04:13 AM1/18/17
to Yiheng Wang, von Brzeski, Vadim, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group
BTW, Optimizer.optimize() will return a trained model after it's done, while the original model passed to Optimizer is not updated when running on Spark.

As for your data scale, while we have not tried 1B+ rows, we have tested on ImageNet data (>1.2M samples where each sample is a 224*224*3 tensor) using deep convolution neural networks (Inception) on up-to 32 nodes, which are actually comparable to your data size. We'd be glad to help out if you run into any problems when using BigDL on your data.

Thanks,
-Jason

von Brzeski, Vadim

unread,
Jan 18, 2017, 5:32:53 PM1/18/17
to Jason Dai, Yiheng Wang, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Hi guys –

 

Thanks for your help so far and your offer to help in future.

 

Yiheng (and all) – here is what I found:

 

First, I had to  do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get” because sometimes I get this error:  requirement failed: Detect multi-task run on one Executor/Container. Currently not support this

I also set nHidden = 6, num iterations = 2000, learning rate = 0.005

 

Anyway, when I do this:

 

val trainSetRDD = sc.makeRDD(data)

 

I get this fit:

 

17/01/18 15:30:15 INFO DistriOptimizer$: [Epoch 80 2400/2500][Iteration 2000][Wall Clock 96.279817182s] Train 100 in 0.032721897seconds. Throughput is 3056.057538473396 records/second. Loss is 0.4063624393630257.

 

But when I do this:

 

val numExecutors = 1 // args(3).toInt
val numCores     = 1 // args(4).toInt

val trainSetRDD = sc.makeRDD(data).coalesce(numExecutors*numCores, true).coalesce(numExecutors)

 

I get a different worse fit:

 

17/01/18 15:26:54 INFO DistriOptimizer$: [Epoch 80 2400/2500][Iteration 2000][Wall Clock 102.645296308s] Train 100 in 0.03926751seconds. Throughput is 2546.634609630201 records/second. Loss is 0.7909712745636441.

 

I do the coalesce steps because of what I read here:  https://github.com/intel-analytics/BigDL/pull/353 regarding the error above.

 

Is the coalesce advised?  And any solution for the above error ?

 

Thanks

 

V.

 

From: Jason Dai <jaso...@gmail.com>


Date: Wednesday, January 18, 2017 at 4:04 AM
To: Yiheng Wang <yih...@gmail.com>

Cc: "Brzeski, Vadim" <vbrz...@ebay.com>, "Li, Zhichao" <zhich...@intel.com>, "Zhang, Yao" <yao....@intel.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

BTW, Optimizer.optimize() will return a trained model after it's done, while the original model passed to Optimizer is not updated when running on Spark.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

Yiheng Wang

unread,
Jan 19, 2017, 12:13:11 AM1/19/17
to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group
Hi Vadim

Something want to be clarified:

"First, I had to  do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get”

Do you mean use Engine.init(1, 1, true) instead of Engine.init(4, 1, true)?

I think partition number of the RDD from sc.makeRDD(data) may be too small. We found similar issues before.

Please pass in a partition number to that method, like sc.makeRDD(data, nodeNumber * coreNumber)

And if the coreNumber is 1, you may need to set the executor cores to 1 by --executor-cores 1

If that do not solve the problem, can you provide the whole example code, the spark version, deploy mode and your spark-submit commands? So that we can try to reproduce the issue.

You needn't do the coalesce steps. The root cause is the input RDD partition is too small. Increase input RDD partition can solve this issue in my experience.

Regards the different loss, I try your hyper parameter, the coalesce go to loss 0.4 but non-coalesce go to loss 0.6. I think it a random fluctuation in SGD...

Regards

Yiheng


To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--

Yiheng Wang

SSG STO Big Data Technology

Intel Asia-Pacific Research & Development Ltd.

No. 880 Zi Xing Road

Shanghai, PRC, 200241

Phone: (86-21) 61166094

 


yih...@gmail.com
yih...@hotmail.com

Yiheng Wang

unread,
Jan 19, 2017, 12:57:44 AM1/19/17
to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group
Also, please set shuffleLocalityEnabled = false, to see if it resolve the muti-task on same node issue.

Yiheng Wang

unread,
Jan 19, 2017, 12:58:39 AM1/19/17
to von Brzeski, Vadim, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group
Sorry, I mean set the property

spark.shuffle.reduceLocality.enabled


to true

von Brzeski, Vadim

unread,
Jan 19, 2017, 1:30:51 PM1/19/17
to Yiheng Wang, Jason Dai, Li, Zhichao, Zhang, Yao, Wan, Yan, BigDL User Group

Hi Yiheng –

 

I did as you suggested, and things seem OK on my small sample dataset.  I am now running into the  “Detect multi-task run on one Executor/Container. Currently not support this” issue on my real large dataset.

 

Right before calling optimize(), I do this re-partition as you suggested:

 

val trainSetRDD =
  if (doRepartition) {
    xFit.repartition(numExecutors * numCores)
  } else {
    xFit
  }
log.info("trainSetRDD num partitions = "+trainSetRDD.getNumPartitions)
log.info("trainSetRDD count = "+trainSetRDD.count)

val validationSetRDD =
  if (doRepartition) {
    xVal.repartition(numExecutors * numCores)
  } else {
    xVal
  }

 

log.info("validationSetRDD num partitions = "+validationSetRDD.getNumPartitions)
log.info("validationSetRDD count = "+validationSetRDD.count)

val batchingT = com.ebay.mktgscience.oom.bigDL.OOMBatching(batchSize, sampleShape)
val trainSet = DataSet.rdd(trainSetRDD) -> batchingT
log.info("trainSet.size = "+trainSet.size())

val batchingV = com.ebay.mktgscience.oom.bigDL.OOMBatching(batchSize, sampleShape)
val validationSet = DataSet.rdd(validationSetRDD) -> batchingV
log.info("validationSet.size = "+validationSet.size())

 

And I get this result:

 

17/01/19 10:57:06 INFO OOM_Train_NN$: trainSetRDD num partitions = 500

17/01/19 10:59:16 INFO OOM_Train_NN$: trainSetRDD count = 1085223099

17/01/19 10:59:16 INFO OOM_Train_NN$: validationSetRDD num partitions = 500

17/01/19 10:59:39 INFO OOM_Train_NN$: validationSetRDD count = 120595645

17/01/19 11:00:51 INFO OOM_Train_NN$: trainSet.size = 1085223099

17/01/19 11:01:48 INFO OOM_Train_NN$: validationSet.size = 120595645

17/01/19 11:01:48 INFO DistriOptimizer$: Cache thread models...

17/01/19 11:01:49 ERROR TaskSetManager: Task 469 in stage 91.0 failed 1 times; aborting job

17/01/19 11:01:49 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 469 in stage 91.0 failed 1 times, most recent failure: Lost task 469.0 in stage 91.0 (TID 11391, hdc9-phx04-0160-0117-032.stratus.phx.ebay.com): java.lang.IllegalArgumentException: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this

        at scala.Predef$.require(Predef.scala:233)

        at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$9.apply(DistriOptimizer.scala:366)

 

Here is my entire spark-submit, spark version 1.6.1:

 

***********

 

export OMP_NUM_THREADS=1

export KMP_BLOCKTIME=0

export OMP_WAIT_POLICY=passive

export DL_ENGINE_TYPE=mklblas

 

HIVEVER=1.2.1000.2.4.2.0-258

HIVEDIR=hive-$HIVEVER

APACHEVER=2.7.1.2.4.2.0-258

 

MODELNAME=nn_bigDL

N_ITER=100

N_BATCH=500000                                             # inside OOMBatching, I do: private val batchSize = Utils.getBatchSize(totalBatch)

LEARNING_RATE=0.01

N_HIDDEN=100

N_EXECUTORS=500

N_CORES=1

DO_REPARTITION=true

PRIORMODEL=lm_3

 

/apache/spark/bin/spark-submit \

   --name "OOM_Model_NN2" \

   --master "yarn" \

   --num-executors $N_EXECUTORS \

   --executor-cores $N_CORES \

   --deploy-mode "cluster" \

   --driver-memory 12G \

   --executor-memory 32G \

   --conf "spark.rpc.askTimeout=240s" \

   --conf "spark.yarn.executor.memoryOverhead=6000" \

   --conf "spark.network.timeout=2000" \

   --conf "spark.executor.heartbeatInterval=60s" \

   --conf "spark.sql.shuffle.partitions=500" \

   --conf "spark.executor.extraJavaOptions=-server -XX:MaxPermSize=1024m -XX:+UseG1GC" \

   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \

   --conf "spark.kryoserializer.buffer=256m" \

   --conf "spark.kryoserializer.buffer.max=1024m" \

   --conf "spark.scheduler.minRegisteredResourcesRatio=1" \

   --conf "spark.yarn.maxAppAttempts=1" \

   --conf "spark.shuffle.reduceLocality.enabled=true" \

   --driver-java-options "-XX:MaxPermSize=2G -XX:+UseG1GC"\

   --driver-library-path "/apache/hadoop/lib/native:/apache/hadoop/lib/native/Linux-amd64-64" \

   --driver-class-path "/apache/hadoop/share/hadoop/common/hadoop-common-$APACHEVER.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.2.4.2.0-258.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-$APACHEVER.jar" \

   --jars "/apache/hadoop/share/hadoop/common/hadoop-common-$APACHEVER.jar,/apache/hadoop/lib/hadoop-lzo-0.6.0.2.4.2.0-258.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-$APACHEVER.jar,/apache/hive/lib/hive-metastore-$HIVEVER.jar,/apache/hive/lib/hive-common-$HIVEVER.jar,/apache/spark/lib/datanucleus-api-jdo-3.2.6.jar,/apache/spark/lib/datanucleus-core-3.2.10.jar,/apache/spark/lib/datanucleus-rdbms-3.2.9.jar" \

   --verbose \

   --queue "hddq-exprce-orgmrktg" \

   --files /apache/hive/conf/hive-site.xml \

   --class "com.ebay.mktgscience.oom.bigDL.OOM_Train_NN" \

   ./jars/oom-model-1.0-SNAPSHOT.jar $MODELNAME $N_ITER $N_BATCH $LEARNING_RATE $N_HIDDEN $N_EXECUTORS $N_CORES $DO_REPARTITION $PRIORMODEL

 

************

 

Thanks.


V.

 

From: <bigdl-us...@googlegroups.com> on behalf of Yiheng Wang <yih...@gmail.com>
Date: Wednesday, January 18, 2017 at 9:13 PM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: Jason Dai <jaso...@gmail.com>, "Li, Zhichao" <zhich...@intel.com>, "Zhang, Yao" <yao....@intel.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi Vadim

Something want to be clarified:

 

"First, I had to  do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get”

Do you mean use Engine.init(1, 1, true) instead of Engine.init(4, 1, true)?

ding.ding

unread,
Jan 19, 2017, 5:48:08 PM1/19/17
to BigDL User Group, vbrz...@ebay.com
We will look at it.

BTW, to achieve better performance we don't expect a big partition number of traing data and executor number. 500 seems big here. Could you try to low down the executor number and increase executor core numbers accordingly. And reduce training data partition number by coalesce like:
xFit.repartition(numExecutors * numCores).coalesce(numExecutors)
In our experience, it can avoid "multi-task run on one Executor/Container" exception and have a good performance.

It may helpful 

von Brzeski, Vadim

unread,
Jan 19, 2017, 7:40:04 PM1/19/17
to ding.ding, BigDL User Group, Yiheng Wang, Jason Dai

Thanks for the advice.  Already in progress :) with:

 

N_EXECUTORS=200

N_CORES=1

 

(and only doing xFit.repartition(numExecutors * numCores) as before, no coalesce)

 

Seems like it is working!  Its already done multiple rounds of

count at DistriOptimizer.scala:399 

reduce at DistriOptimizer.scala:220

 

Keep fingers crossed!

 

BTW: on other topic: any sample code for computing predictions given a fitted model?  I tried it myself, writing a class like Batching, but I got this Exception:

 

scala>     val predictor = OOMPredict(batchSize, sampleShape, modelBroadcast.asInstanceOf[Broadcast[Module[Double]]])

predictor: OOMPredict = $iwC$$iwC$OOMPredict@67023824

 

scala>     val predictions = DataSet.rdd(trainSetRDD) -> predictor

java.io.NotSerializableException: org.apache.spark.SparkContext

Serialization stack:

                - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@6902e2bf)

 

Thanks.

 

V.

邱鑫

unread,
Jan 19, 2017, 9:24:20 PM1/19/17
to BigDL User Group, vbrz...@ebay.com
Hi, Vadim

https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/example/imageclassification
We  have two example for fitted model, one is ImagePredictor, one is ModelValidator.

Bests,
-Xin


在 2017年1月20日星期五 UTC+8上午8:40:04,von Brzeski, Vadim写道:

von Brzeski, Vadim

unread,
Jan 20, 2017, 3:00:09 AM1/20/17
to 邱鑫, BigDL User Group, Jason Dai

Hi Xin –

 

Thanks.  I tried following the /loadmodel example.

 

I tried saving / loading the model to HDFS, but was not successful.  It seems like the saveModel() and Module.load() methods only operate on the local filesystem (true?).  How do I save / load models when I run a spark-submit job on a cluster?

 

Thanks.

 

V.

Jason Dai

unread,
Jan 20, 2017, 3:16:56 AM1/20/17
to von Brzeski, Vadim, 邱鑫, BigDL User Group
Usually the model is saved/loaded from Spark driver; if one is using yarn-client mode, he or she should be able to access the local file system on the driver node. If one is using yarn-cluster mode, he or she needs to save/load the module from the distributed file system; we just opened an issue to add support for that.

Thanks,
-Jason

邱鑫

unread,
Jan 20, 2017, 3:17:27 AM1/20/17
to BigDL User Group, vbrz...@ebay.com
Hi, Vadim

Yeah, only the local file system. So you need to use yarn-client model for yarn cluster. Then the driver will run on the machine which you type the spark-submit, save/load could work properly.
"--master yarn --deploy-mode client" to enable the yarn-client model.

Thanks for your reply, I just create a issue issues#396 for this feature.

Bests,
-Xin


在 2017年1月20日星期五 UTC+8下午4:00:09,von Brzeski, Vadim写道:

Jason Dai

unread,
Jan 20, 2017, 6:38:55 AM1/20/17
to 邱鑫, BigDL User Group, von Brzeski, Vadim
BTW, for your previous example, running BigDL training using a large number (say, 200) of small executors (with only one core each) is actually bad for training speed, as BigDL needs to synchronize the parameters between all the executors. We would suggest using a small number of more powerful executors, e.g., 16 executors each with 25 cores, in BigDL training, so as to minimize network overheads.

Thanks,
-Jason 

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

von Brzeski, Vadim

unread,
Jan 20, 2017, 11:20:23 AM1/20/17
to Jason Dai, 邱鑫, BigDL User Group

Thanks for the advice!

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

von Brzeski, Vadim

unread,
Jan 22, 2017, 11:12:44 PM1/22/17
to Jason Dai, BigDL User Group

Jason –

 

On you last comment – executors w/ many cores:  I am limited to 8 cores in my environment (per policy). 

 

Anyway, I have been able to successfully run a few times with 160 executors and 6 cores each (8 cores have me out of memory problems).

 

However, sometimes I get a job failed with this error (happens after about 100 or so iterations; all of a sudden it dies): 

 

17/01/22 14:19:18 ERROR YarnClusterScheduler: Lost executor 38 on phxaishdc9dn0781.phx.ebay.com: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

17/01/22 14:19:18 ERROR TaskSetManager: Task 30 in stage 2458.0 failed 1 times; aborting job

17/01/22 14:19:18 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 2458.0 failed 1 times, most recent failure: Lost task 30.0 in stage 2458.0 (TID 59323, phxaishdc9dn0781.phx.ebay.com): ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 2458.0 failed 1 times, most recent failure: Lost task 30.0 in stage 2458.0 (TID 59323, phxaishdc9dn0781.phx.ebay.com): ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)

        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

        at scala.Option.foreach(Option.scala:236)

        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)

        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1855)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)

        at org.apache.spark.rdd.RDD.count(RDD.scala:1164)

        at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:236)

        at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:532)

        at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN$.main(OOM_Train_NN.scala:306)

        at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN.main(OOM_Train_NN.scala)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)

 

This last one happened with 200 executors and 6 cores.  (I have been trying bigger and bigger batch sizes too, and thus the 200 number of executors; this last one was 18e6 total batch = 90K per executor so I wouldn’t run out of memory.  How large can I make the batch per executor w/ 40GB executor memory and 6 cores per executor?)

 

V.

 

From: Jason Dai <jaso...@gmail.com>
Date: Friday, January 20, 2017 at 3:38 AM
To:
邱鑫 <qiuxin...@gmail.com>

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

Jason Dai

unread,
Jan 23, 2017, 12:28:14 AM1/23/17
to von Brzeski, Vadim, BigDL User Group
Hi Vadim,

Looks like your YARN container got killed, most likely due to exceeding memory limits? We use Intel MKL, which can allocate its own memory in native code. Maybe you can try tuning the "spark.yarn.executor.memoryOverhead" config.

Thanks,
-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

Jason Dai

unread,
Jan 23, 2017, 7:00:36 AM1/23/17
to von Brzeski, Vadim, BigDL User Group
Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

Thanks,
-Jason

von Brzeski, Vadim

unread,
Jan 23, 2017, 11:17:22 AM1/23/17
to Jason Dai, BigDL User Group

Indeed – I did see such heartbeat time out messages.  That’s probably it then.

 

Thanks!

 

V.

 

From: Jason Dai <jaso...@gmail.com>
Date: Monday, January 23, 2017 at 4:00 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

 

Thanks,

-Jason

 

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

von Brzeski, Vadim

unread,
Jan 26, 2017, 8:23:21 PM1/26/17
to Jason Dai, BigDL User Group

Hi Jason (and all)

 

So I have trained the 1B+ set a few times, but am not getting the results I want, so I am playing around with different parameters, number of hiddens, layers, etc.  Here’s the thing:

 

1.       Go with 160 exec, 6 cores each.  40GB per exec, with 8GB overhead.   Total batch 30M, batch per exec 187500.  2 ReLU layers, 100 hiddens each.  Finishes ok.

2.       Make my network a bit more complex: 3 ReLU layers, 100 hiddens each.  Same job config.  Get this:  Container killed by YARN for exceeding memory limits. 48.0 GB of 48 GB physical memory used. 

3.       OK, so then I go with 200 exec, 5 cores each.  Then I get the dreaded:  requirement failed: Detect multi-task run on one Executor/Container. Currently not support this

 

Any ideas?

 

V.

 

From: Jason Dai <jaso...@gmail.com>


Date: Monday, January 23, 2017 at 4:00 AM

To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

 

Thanks,

-Jason

 

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

Jason Dai

unread,
Jan 26, 2017, 8:51:21 PM1/26/17
to von Brzeski, Vadim, BigDL User Group
Hi Vadim,

First I think your batch size seems too large for the learning rate; try lowering the batch size (for instance we used a batch size of 1K~2K when training the Inception model in our example), and tuning the hyper-parameters for better accuracy.

And to address the OOM problem, you can also try reducing the batch size (while keeping the core#) for now. We'll try to reproduce the "multi-task run" issues on our side.

Thanks,
-Jason




To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

von Brzeski, Vadim

unread,
Jan 26, 2017, 10:09:56 PM1/26/17
to Jason Dai, BigDL User Group

Thanks.  Cut down my batch size, trying again.  Sometimes I also get this one:

 

17/01/26 19:57:37 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 715 in stage 25.0 failed 1 times, most recent failure: Lost task 715.0 in stage 25.0 (TID 9276, phxdpehdc9dn2398.stratus.phx.ebay.com): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_279325_01_2122733 on host: phxdpehdc9dn2398.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 715 in stage 25.0 failed 1 times, most recent failure: Lost task 715.0 in stage 25.0 (TID 9276, phxdpehdc9dn2398.stratus.phx.ebay.com): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_279325_01_2122733 on host: phxdpehdc9dn2398.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

                at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)

                at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)

                at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)

                at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

                at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

                at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)

                at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

                at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

                at scala.Option.foreach(Option.scala:236)

                at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)

                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)

                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)

                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)

                at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

                at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)

                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1855)

                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)

                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)

                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)

                at org.apache.spark.rdd.RDD.count(RDD.scala:1164)

                at com.intel.analytics.bigdl.dataset.DistributedDataSet$class.transform(DataSet.scala:180)

                at com.intel.analytics.bigdl.dataset.CachedDistriDataSet.transform(DataSet.scala:208)

                at com.intel.analytics.bigdl.dataset.AbstractDataSet$class.$minus$greater(DataSet.scala:91)

                at com.intel.analytics.bigdl.dataset.CachedDistriDataSet.$minus$greater(DataSet.scala:208)

               ….

 

During the batching transformation.  Not sure if this is something on your end or on Spark/Hadoop.

 

V.

 

 

From: <bigdl-us...@googlegroups.com> on behalf of Jason Dai <jaso...@gmail.com>
Date: Thursday, January 26, 2017 at 5:51 PM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi Vadim,

 

Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

 

Thanks,

-Jason

 

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

 

 

 

 

--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/CAHvkTeHAdw9KzJkWA-ibkqsBFucg3AHncYU-S9dnFtp%3DODktGg%40mail.gmail.com.

Jason Dai

unread,
Jan 26, 2017, 11:32:15 PM1/26/17
to von Brzeski, Vadim, BigDL User Group
Looks like Spark is cancelling the job as the executor (YARN container) is lost - maybe due to YARN killing it? Are there any error messages from the Spark tasks? 

BTW, for the "Detect multi-task run" error, have you set "spark.shuffle.reduceLocality.enabled" to false? This usually fix the error in our environment. We are currently working on a fix for BigDL to ignore the error and continue to run.

Thanks,
-Jason


To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-user-group@googlegroups.com>

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.

To unsubscribe from this group and all its topics, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

von Brzeski, Vadim

unread,
Jan 27, 2017, 7:11:09 PM1/27/17
to Jason Dai, BigDL User Group

This

 

Detect multi-task run on one Executor/Container. Currently not support this at scala.Predef$.require(Predef.scala:219)

 

is really killing me now, after decreasing the batch size.  Even upped the executor cores to 10 (100 executors), spark.shuffle.reduceLocality.enabled=false. 

 

I can’t get a successful run anymore.  About to give up on this until this is fixed – not possible to do any real work in this situation.

 

V.

 

 

From: Jason Dai <jaso...@gmail.com>
Date: Thursday, January 26, 2017 at 8:32 PM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Looks like Spark is cancelling the job as the executor (YARN container) is lost - maybe due to YARN killing it? Are there any error messages from the Spark tasks? 

-Jason

 


To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>


Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi Vadim,

 

First I think your batch size seems too large for the learning rate; try lowering the batch size (for instance we used a batch size of 1K~2K when training the Inception model in our example), and tuning the hyper-parameters for better accuracy.

 

And to address the OOM problem, you can also try reducing the batch size (while keeping the core#) for now. We'll try to reproduce the "multi-task run" issues on our side.

 

Thanks,

-Jason

 

 

 

On Fri, Jan 27, 2017 at 9:23 AM, von Brzeski, Vadim <vbrz...@ebay.com> wrote:

Hi Jason (and all)

 

So I have trained the 1B+ set a few times, but am not getting the results I want, so I am playing around with different parameters, number of hiddens, layers, etc.  Here’s the thing:

 

1.       Go with 160 exec, 6 cores each.  40GB per exec, with 8GB overhead.   Total batch 30M, batch per exec 187500.  2 ReLU layers, 100 hiddens each.  Finishes ok.

2.       Make my network a bit more complex: 3 ReLU layers, 100 hiddens each.  Same job config.  Get this:  Container killed by YARN for exceeding memory limits. 48.0 GB of 48 GB physical memory used. 

3.       OK, so then I go with 200 exec, 5 cores each.  Then I get the dreaded:  requirement failed: Detect multi-task run on one Executor/Container. Currently not support this

 

Any ideas?

 

V.

 

From: Jason Dai <jaso...@gmail.com>
Date: Monday, January 23, 2017 at 4:00 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>

Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.

 

Thanks,

-Jason

 

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.

To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

dingdi...@gmail.com

unread,
Jan 27, 2017, 7:44:04 PM1/27/17
to BigDL User Group, jaso...@gmail.com, vbrz...@ebay.com
We have checked in a temp fix for “Detect multi-task run” exception, please sync latest code and call disableCheckSingleton in optimizer. See https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/models/inception/Train.scala   

Besides, I am a little doubted if there is really 100 executors launched in your cluster since you met the exception, could you have a check?

在 2017年1月27日星期五 UTC-8下午4:11:09,von Brzeski, Vadim写道:

-Jason

 

<span style="fo

Jason Dai

unread,
Jan 27, 2017, 8:16:11 PM1/27/17
to dingdi...@gmail.com, BigDL User Group, von Brzeski, Vadim
Hi Vadim,

You will need to pull our latest code and build the jar for the fix mentioned above; if you are still using the old jar, I think you can do something like:

val optimizer = Optimizer(
  model = model,
  dataset = trainSet,
  criterion = new MSECriterion[Double]()
).asInstanceOf[DistriOptimizer[Double]].disableCheckSingleton()


This should work as long as you don't use validation data (i.e., no Optimizer.setValidation); we are currently working on a fix to have it work on validation data too.

Thanks,
-Jason

von Brzeski, Vadim

unread,
Jan 27, 2017, 11:30:05 PM1/27/17
to Jason Dai, dingdi...@gmail.com, BigDL User Group

Thanks!!  Will give it a shot. 

 

BTW: after many trials, I went back to my 160 executors / 6 cores setup and it seems to be running on an older cluster we have, which limits us to 8 cores per exec.  I was having some memory issues with a 3-hidden-layer network and a very large batch on this cluster, but now since I have lowered my batch, it seems to be running.  This older cluster is running Spark 1.6.1 w/ Scala 2.10. 

 

We also have a newer cluster w/ Spark 1.6.2 Scala 2.11, with higher mem nodes, where you can have 10-12 cores per exec.  This is where I kept running into the aforementioned bug _all the time_, i.e. no successful runs on this new cluster.  This is where I will try your patch mentioned below.  I will let you know.

 

Again, thanks for your support.

 

Vadim.

von Brzeski, Vadim

unread,
Jan 29, 2017, 12:58:00 PM1/29/17
to Jason Dai, BigDL User Group

Hi Jason –

 

The fix is working well, I can now start running jobs OK, but running into that “lost” container error.  Here is the trace.  (400 execs x 4 cores, 40GB GB per exec, 8GB overhead).  Here’s the network:

 

17/01/29 01:38:19 INFO OOM_Train_NN$: DeepLearning Network =

nn.Sequential {

  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]

  (1): nn.Reshape(69)

  (2): nn.Linear(69 -> 100)

  (3): nn.ReLU

  (4): nn.Linear(100 -> 100)

  (5): nn.ReLU

  (6): nn.Linear(100 -> 100)

  (7): nn.ReLU

  (8): nn.Linear(100 -> 1)

}

17/01/29 01:38:20 INFO DistriOptimizer$: Cache thread models...

17/01/29 01:38:27 INFO DistriOptimizer$: Cache thread models... done

17/01/29 01:38:27 INFO DistriOptimizer$: config  {

                learningRate: 0.01

                maxDropPercentage: 0.0

                momentum: 0.9

                warmupIterationNum: 200

                learningRateDecay: 0.002

                dampening: 0.0

                dropPercentage: 0.0

                comupteThresholdbatchSize: 100

}

 

 

7/01/29 10:12:30 INFO DistriOptimizer$: [Epoch 4 28800000/1085245782][Iteration 688][Wall Clock 21328.862921253s] Train 4800000 in 51.694058478seconds. Throughput is 92853.99795109505 records/second. Loss is 11173.878006300865.

17/01/29 10:13:19 INFO DistriOptimizer$: [Epoch 4 33600000/1085245782][Iteration 689][Wall Clock 21380.556979731s] Train 4800000 in 49.042573606seconds. Throughput is 97874.14580976956 records/second. Loss is 10773.208485912715.

17/01/29 10:13:38 ERROR YarnClusterScheduler: Lost executor 33 on phxdpehdc9dn2651.stratus.phx.ebay.com: Container marked as failed: container_e152_1483654296013_307643_02_000062 on host: phxdpehdc9dn2651.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

17/01/29 10:13:38 ERROR TaskSetManager: Task 193 in stage 15229.0 failed 1 times; aborting job

17/01/29 10:13:38 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 15229.0 failed 1 times, most recent failure: Lost task 193.0 in stage 15229.0 (TID 570819, phxdpehdc9dn2651.stratus.phx.ebay.com): ExecutorLostFailure (executor 33 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_307643_02_000062 on host: phxdpehdc9dn2651.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 15229.0 failed 1 times, most recent failure: Lost task 193.0 in stage 15229.0 (TID 570819, phxdpehdc9dn2651.stratus.phx.ebay.com): ExecutorLostFailure (executor 33 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_307643_02_000062 on host: phxdpehdc9dn2651.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node

Driver stacktrace:

        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)

        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

        at scala.Option.foreach(Option.scala:236)        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)

        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1855)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1975)        at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1032)

        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)        at org.apache.spark.rdd.RDD.reduce(RDD.scala:1014)

        at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:220)

        at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:527)

        at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN$.main(OOM_Train_NN.scala:322)

        at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN.main(OOM_Train_NN.scala)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)

 

 

From: Jason Dai <jaso...@gmail.com>


Date: Friday, January 27, 2017 at 5:16 PM
To: "dingdi...@gmail.com" <dingdi...@gmail.com>

von Brzeski, Vadim

unread,
Jan 30, 2017, 12:38:58 AM1/30/17
to Jason Dai, BigDL User Group

One more thing – that exception below is also preceded by this:

 

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 71.0 failed 1 times, most recent failure: Lost task 193.0 in stage 71.0 (TID 13715, spades-0270-1003663.lvs02.eaz.ebayc3.com): java.util.concurrent.ExecutionException: java.util.NoSuchElementException: None.get at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.intel.analytics.bigdl.parameters.FutureResult.waitResult(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:145) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:125) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at scala.Option.getOrElse(Option.scala:121) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:135) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Driver stacktrace:

 

 

From: <bigdl-us...@googlegroups.com> on behalf of "Brzeski, Vadim" <vbrz...@ebay.com>


Date: Sunday, January 29, 2017 at 9:57 AM
To: Jason Dai <jaso...@gmail.com>

--

You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

Jason Dai

unread,
Jan 30, 2017, 5:36:08 AM1/30/17
to von Brzeski, Vadim, BigDL User Group
Hi Vadim,

These exceptions seem to be caused by losing some Spark executors, probably due to the same problem of heartbeat timeout discussed before? Do you see such error messages? And can you check if there are a lot of GCs? We'll look into the memory consumption problems in our environment.

Thanks,
-Jason


--

To unsubscribe from this group and all its topics, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

von Brzeski, Vadim

unread,
Jan 30, 2017, 1:53:16 PM1/30/17
to Jason Dai, BigDL User Group

Thanks.  BTW – I am trying to run this stuff as I said on a newer cluster, Spark 1.6.2, Scala 2.11.  So far no luck,  and I think there is something messed up with our cluster (which we’re looking into), but here’s a new one – never seen this one before :)

 

Caused by: java.lang.Exception: Please initialize AllReduceParameter first!!

                at com.intel.analytics.bigdl.parameters.AllReduceParameter.readGradientPartition(AllReduceParameter.scala:93)

                at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition$lzycompute(AllReduceParameter.scala:61)

                at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition(AllReduceParameter.scala:61)

                at com.intel.analytics.bigdl.parameters.AllReduceParameter.aggregrateGradientParition(AllReduceParameter.scala:185)

                at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$optimize$1.apply(DistriOptimizer.scala:227)

                at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$optimize$1.apply(DistriOptimizer.scala:225)

                at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)

                at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)

                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

                at org.apache.spark.scheduler.Task.run(Task.scala:89)

                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

                at java.lang.Thread.run(Thread.java:745)

 

 

From: <bigdl-us...@googlegroups.com> on behalf of Jason Dai <jaso...@gmail.com>
Date: Monday, January 30, 2017 at 2:36 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi Vadim,

 

These exceptions seem to be caused by losing some Spark executors, probably due to the same problem of heartbeat timeout discussed before? Do you see such error messages? And can you check if there are a lot of GCs? We'll look into the memory consumption problems in our environment.

 

Thanks,

-Jason

 

On Mon, Jan 30, 2017 at 1:38 PM, von Brzeski, Vadim <vbrz...@ebay.com> wrote:

One more thing – that exception below is also preceded by this:

 

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 71.0 failed 1 times, most recent failure: Lost task 193.0 in stage 71.0 (TID 13715, spades-0270-1003663.lvs02.eaz.ebayc3.com): java.util.concurrent.ExecutionException: java.util.NoSuchElementException: None.get at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.intel.analytics.bigdl.parameters.FutureResult.waitResult(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:145) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:125) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at scala.Option.getOrElse(Option.scala:121) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:135) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Driver stacktrace:

 

 

From: <bigdl-us...@googlegroups.com> on behalf of "Brzeski, Vadim" <vbrz...@ebay.com>
Date: Sunday, January 29, 2017 at 9:57 AM
To: Jason Dai <jaso...@gmail.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>

Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi Jason –

To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

 

--

You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.

To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/CAHvkTeFBvtNs3O1SvOPF%3DJmiRdPG6t-8dQmJypmqE_AMSAruWQ%40mail.gmail.com.

dingdi...@gmail.com

unread,
Jan 30, 2017, 2:43:35 PM1/30/17
to BigDL User Group, jaso...@gmail.com, vbrz...@ebay.com
I think the execption has the same root cause with "java.util.NoSuchElementException" in AllReduceParameter, should be caused by losing some spark executors. 

在 2017年1月30日星期一 UTC-8上午10:53:16,von Brzeski, Vadim写道:

From: <span style="font-family:Calibri;color:blac

von Brzeski, Vadim

unread,
Feb 1, 2017, 12:44:18 PM2/1/17
to dingdi...@gmail.com, BigDL User Group, jaso...@gmail.com

Are we sure this AllReduceParameter is b/c of losing some executors?

 

I see this in the log (it is clean, no error messages until that one). I am also now training with 10% of my data, and a batch size per exec = 3000.

 

2017-02-01 10:28:55 INFO  OOM_Train_NN$:295 - DeepLearning Network =

nn.Sequential {

  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]

  (1): nn.Reshape(69)

  (2): nn.Linear(69 -> 70)

 (3): nn.ReLU

  (4): nn.Linear(70 -> 70)

  (5): nn.ReLU

  (6): nn.Linear(70 -> 70)

  (7): nn.ReLU

  (8): nn.Linear(70 -> 1)

}

2017-02-01 10:28:55 INFO  DistriOptimizer$:400 - Cache thread models...

2017-02-01 10:28:58 INFO  DistriOptimizer$:402 - Cache thread models... done

2017-02-01 10:28:58 INFO  DistriOptimizer$:89 - config  {

                learningRate: 0.01

                maxDropPercentage: 0.0

                momentum: 0.9

                warmupIterationNum: 200

                learningRateDecay: 0.002

                dampening: 0.0

                dropPercentage: 0.0

                comupteThresholdbatchSize: 100

}

2017-02-01 10:28:58 INFO  DistriOptimizer$:90 - Shuffle data

2017-02-01 10:28:58 INFO  DistriOptimizer$:93 - Shuffle data complete. Takes 0.032758708s

2017-02-01 10:29:18 INFO  DistriOptimizer$:241 - [Epoch 1 0/108496693][Iteration 1][Wall Clock 0.0s] Train 1200000 in 15.646894009seconds. Throughput is 76692.53714569595 records/second. Loss is 10817.976028867404.

2017-02-01 10:29:37 INFO  DistriOptimizer$:241 - [Epoch 1 1200000/108496693][Iteration 2][Wall Clock 15.646894009s] Train 1200000 in 19.144417971seconds. Throughput is 62681.456381581425 records/second. Loss is 14067.692619235619.

2017-02-01 10:29:54 INFO  DistriOptimizer$:241 - [Epoch 1 2400000/108496693][Iteration 3][Wall Clock 34.79131198s] Train 1200000 in 16.798147873seconds. Throughput is 71436.4469864433 records/second. Loss is 9435.186635609934.

2017-02-01 10:30:17 ERROR TaskSetManager:74 - Task 297 in stage 148.0 failed 1 times; aborting job

2017-02-01 10:30:17 ERROR ApplicationMaster:95 - User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 297 in stage 148.0 failed 1 times, most recent failure: Lost task 297.0 in stage 148.0 (TID 22742, spades-0334-1027666.lvs02.eaz.ebayc3.com): java.lang.Exception: Please initialize AllReduceParameter first!!

                at com.intel.analytics.bigdl.parameters.AllReduceParameter.readGradientPartition(AllReduceParameter.scala:93)

                at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition$lzycompute(AllReduceParameter.scala:61)

                at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition(AllReduceParameter.scala:61)

 

 


Date: Monday, January 30, 2017 at 11:43 AM
To: BigDL User Group <bigdl-us...@googlegroups.com>

dingdi...@gmail.com

unread,
Feb 1, 2017, 2:36:24 PM2/1/17
to BigDL User Group, dingdi...@gmail.com, jaso...@gmail.com, vbrz...@ebay.com
When the program complains "Please initialize Parameter first", it failed to find the parameter from blockmanager. However the parameter is initialized and put in blockmanger before training, and it will never be removed after that. So we thought the executor was lost and the task had to be scheduled to on a different executor. We will try to repro this problem in our env. And Besides, I was wondering if you can send us the spark log(stderr) in the executor which throw the excpetion(from the log, it should be spades-0334-1027666.lvs02.eaz.ebayc3.com) and application master log.

在 2017年2月1日星期三 UTC-8上午9:44:18,von Brzeski, Vadim写道:

      &

Jason Dai

unread,
Feb 2, 2017, 6:57:48 AM2/2/17
to dingdi...@gmail.com, BigDL User Group, von Brzeski, Vadim
Hi Vadim,

This does look very wired - it seems that BigDL fails to find the local gradient in local Spark block manager after it successfully train two batches, but there are no executor failures. It is hard to image why (unless there is someone go cleaning up blocks from block manager, or a partition is moved to a new executor ...) - are there any messages related to dropping blocks because of memory pressure?

I wonder if you can try a few things:

1) Build BigDL using Scala 2.11 (see https://github.com/intel-analytics/BigDL/wiki/Build-Page); the jar release in maven is built using Scala 2.10 for Spark 1.6 - not sure if that will cause any problem.

2) Set "spark.shuffle.reduceLocality.enabled" to false, so that Spark can assign each task to a different executor for training as much as possible.

3) Maybe set "spark.locality.wait" to a larger value (e.g., 10s), so that Spark will assign each partition to where it is cached.

We will try to reproduce the problem using your configs (Spark, Scala, YARN, etc.).

Thanks,
-Jason

von Brzeski, Vadim

unread,
Feb 3, 2017, 12:53:12 AM2/3/17
to Jason Dai, dingdi...@gmail.com, BigDL User Group

Hi Jason –

 

Looks like things has stabilized – I am now able to consistently run jobs OK.  I had done (1) and (2) below already.  I think what has made the difference are the following:

 

a)       Your latest patch

b)       Running with a much smaller batch size (2K – 3K per executor)

c)       150 nodes, 6 cores per node

d)       training on a fraction of my 1.09B records;  right now I have done a few runs with up to 30% of the data (~300M records), and continue pushing it further…..

 

Thanks for all your help.  If I am ultimately successful here with this, I will definitely advertise BigDL and your support here at eBay.

 

V.

 

From: <bigdl-us...@googlegroups.com> on behalf of Jason Dai <jaso...@gmail.com>
Date: Thursday, February 2, 2017 at 3:57 AM
To: "dingdi...@gmail.com" <dingdi...@gmail.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>, "Brzeski, Vadim" <vbrz...@ebay.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix

 

Hi Vadim,

--

You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

Jason Dai

unread,
Feb 3, 2017, 3:39:28 AM2/3/17
to von Brzeski, Vadim, dingdi...@gmail.com, BigDL User Group
Great! We'll try to look into the memory consumption issues on our side.

Thanks,
-Jason

To unsubscribe from this group and all its topics, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/76E6494F-5B42-4684-A173-D2AFA176F0A5%40ebay.com.
Reply all
Reply to author
Forward
0 new messages