I have the following simple NN training code (just getting my feet wet), modeled after the example / tutorials here https://github.com/intel-analytics/BigDL/wiki/Getting-Started.
All is well, until I hit the actual “optimize” – see below – when this error occurs:
17/01/13 17:56:15 ERROR ThreadPool$: Error: java.lang.IllegalArgumentException: requirement failed: input must be vector or matrix
at scala.Predef$.require(Predef.scala:233)
at com.intel.analytics.bigdl.nn.Linear.updateOutput(Linear.scala:66)
at com.intel.analytics.bigdl.nn.Linear.updateOutput(Linear.scala:29)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:33)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:116)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)
at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invokeAndWait$1$$anonfun$apply$2.apply(Engine.scala:103)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Here is my code:
*************************
val sc = new SparkContext(
Engine.init(1, 1, true).get
.setAppName("Sample_NN")
.set("spark.akka.frameSize",
64.toString)
.set("spark.task.maxFailures",
"1")
)
// make up some data
val data = (0 to 100).collect {
case i if i > 75 || i
< 25 ⇒
(0 to 100).collect {
case j if j > 75 ||
j < 25 ⇒
val res =
if (i > 75 &&
j < 25) 23.0
else if (i < 25 &&
j > 75) -45
else 0
(Array(i / 100.0 + 1, j / 100.0
+ 2), res)
}
}.flatMap(x ⇒ x)
val batchSize = 4
val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet
val layer1 = Linear[Double](2,4)
val layer2 = ReLU[Double]()
val output = Sum[Double]()
val model = Sequential[Double]().
add(layer1).
add(layer2).
add(output)
val state =
T(
"learningRate" ->
0.01,
"weightDecay" -> 0.0005,
"momentum" -> 0.9,
"dampening" -> 0.0
)
val optimizer = Optimizer(
model = model,
dataset = trainSet,
criterion = new MSECriterion[Double]()
)
optimizer.
setState(state).
// setValidation(Trigger.everyEpoch,
validationSet, Array(new Loss[Double])).
setOptimMethod(new Adagrad[Double]()).
optimize()
**************************
SampleToBatch is the one in BigDL here: https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Transformer.scala
as is Sample in here: https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/Types.scala
ToSample is my own, defined as follows:
******************
object ToSample {
def apply(nRows: Int, nCols:
Int)
: ToSample =
new ToSample(nRows, nCols)
}
class ToSample(nRows: Int, nCols: Int)
extends Transformer[
(Array[Double], Double) , Sample[Double]] {
private val buffer = new
Sample[Double]()
private var featureBuffer:
Array[Double] = null
private var labelBuffer:
Array[Double] = null
override def apply(prev: Iterator[(Array[Double],
Double)]): Iterator[Sample[Double]] = {
prev.map(x => {
if (featureBuffer == null
|| featureBuffer.length < nRows * nCols) {
featureBuffer = new Array[Double](nRows
* nCols)
}
if (labelBuffer == null
|| labelBuffer.length < nRows) {
labelBuffer = new Array[Double](nRows)
}
var i = 0
while (i < nRows) {
Array.copy(x._1, 0, featureBuffer,
i * nCols, nCols)
labelBuffer(i) = x._2
i += 1
}
buffer.copy(featureBuffer,
labelBuffer,
Array(nRows, nCols), Array(nRows))
})
}
}
********************
Inspecting the MiniBatch, all seems fine:
scala> val q = trainSet.toLocal.data(false)
q: Iterator[com.intel.analytics.bigdl.dataset.MiniBatch[Double]] = non-empty iterator
scala> val z = q.next()
z: com.intel.analytics.bigdl.dataset.MiniBatch[Double] =
MiniBatch((1,.,.) =
1.0 2.0
(2,.,.) =
1.0 2.01
(3,.,.) =
1.0 2.02
(4,.,.) =
1.0 2.03
[com.intel.analytics.bigdl.tensor.DenseTensor of size 4x1x2],0.0
0.0
0.0
0.0
[com.intel.analytics.bigdl.tensor.DenseTensor of size 4x1])
so I can’t understand why I am getting the above error.
Any help would be appreciated.
Thanks.
Vadim.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/4ede1529-0cd3-4bb2-9dd4-6859a38e5bff%40googlegroups.com.--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
Thanks Yiheng! That worked.
Vadim.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/4ede1529-0cd3-4bb2-9dd4-6859a38e5bff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Yiheng Wang
SSG STO Big Data Technology
Intel Asia-Pacific Research & Development Ltd.
No. 880 Zi Xing Road
Shanghai, PRC, 200241
Phone: (86-21) 61166094
--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/bigdl-user-group/CAF3ct6B2DR-EzaF%3Dzq%2BfFM8GpjiKCVbdupYPrd0huaM8Eo3_Rw%40mail.gmail.com.
Spoke too soon :(
Seems like the solution works, but only if batchSize == number of outputs in Linear layer (??). i.e.:
Works OK:
val
batchSize = 6
val sampleShape = Array(1,2)
val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet
val layer1 = Linear[Double](2,6)
val layer2 = ReLU[Double]()
val output = Sum[Double]()
val model = Sequential[Double]().
add(Reshape(Array(2))).
…etc.
Fails – see exception below:
val
batchSize = 10
val sampleShape = Array(1,2)
val trainSet = DataSet.array(data.toArray).transform(ToSample(1,2)).transform(SampleToBatch(batchSize))
val validationSet = trainSet
val layer1 = Linear[Double](2,6)
val layer2 = ReLU[Double]()
val output = Sum[Double]()
val model = Sequential[Double]().
add(Reshape(Array(2))).
…etc.
17/01/16 19:51:22 ERROR ThreadPool$: Error: java.lang.IllegalArgumentException: requirement failed: inconsistent tensor size
at scala.Predef$.require(Predef.scala:233)
at com.intel.analytics.bigdl.tensor.DenseTensorApply$.apply2(DenseTensorApply.scala:63)
at com.intel.analytics.bigdl.tensor.DenseTensor.map(DenseTensor.scala:396)
at com.intel.analytics.bigdl.nn.MSECriterion.updateOutput(MSECriterion.scala:33)
at com.intel.analytics.bigdl.nn.MSECriterion.updateOutput(MSECriterion.scala:27)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractCriterion.forward(AbstractCriterion.scala:43)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:117)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:110)
at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invokeAndWait$1$$anonfun$apply$2.apply(Engine.scala:103)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Date: Friday, January 13, 2017 at 9:59 PM
To: Yan Wan <yan...@intel.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>, "Brzeski, Vadim" <vbrz...@ebay.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
.add(Reshape(Array(batchSize*1, 2))).
Hi,
For the modules that support batch and meanwhile the input can be a multi-dimensional tensor(there is no restriction on the number of dimensions of input tensor), that kind of modules usually have an argument named `nInputDims`.
This argument means that “the number of dimensions of the given input”. Because the input tensor can have indefinite number of dimensions, so that modules can not automatically inference whether the user use the batch or not and what the actual dimensions of input excluding the batch dimension. Hence `nInputDims` represents the actual dimensions of given input excluding the batch. In your case, the `nInputDims` of Sum layer should be 1, because the input size is batchSize * 6 (one dimension excluding the batch dimension).
The sum layer should be `val output = Sum(nInputDims = 1)`.
So your code should be something like:
“””
import com.intel.analytics.bigdl.numeric.NumericDouble
val batchSize = 10
val trainSet = DataSet.array(data.toArray) -> ToSample(1, 2) -> SampleToBatch(batchSize)
val layer1 = Linear(2, 6)
val layer2 = ReLU()
val output = Sum(nInputDims = 1)
val model = Sequential()
.add(Reshape(Array(2)))
.add(layer1)
.add(layer2)
.add(output)
“””
For the reason that `batchSize = 6` runs without error, the sum layer will receive a 6 * 6 tensor, and sum along the first dimension, then a tensor of 1 dimension with size 6 obtained by the criterion. Because the size is the same as the number of labels (= batchSize), there won’t be any errors, but the logic is still not correct. If `batchSize = 10`, a tensor of 1 dimension with size 10 obtained by the criterion, which cannot match the batchSize (6), so the run error occurs.
Hope that would make some helps.
Best regards
Yao
Full worked code attached:
“””
object ToSample {
def apply(nRows: Int, nCols: Int)
: ToSample =
new ToSample(nRows, nCols)
}
class ToSample(nRows: Int, nCols: Int)
extends Transformer[(Array[Double], Double), Sample[Double]] {
private val buffer = new Sample[Double]()
private var featureBuffer: Array[Double] = null
private var labelBuffer: Array[Double] = null
override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[Sample[Double]] = {
prev.map(x => {
if (featureBuffer == null || featureBuffer.length < nRows * nCols) {
featureBuffer = new Array[Double](nRows * nCols)
}
if (labelBuffer == null || labelBuffer.length < nRows) {
labelBuffer = new Array[Double](nRows)
}
var i = 0
while (i < nRows) {
Array.copy(x._1, 0, featureBuffer, i * nCols, nCols)
labelBuffer(i) = x._2
i += 1
}
buffer.copy(featureBuffer, labelBuffer,
Array(nRows, nCols), Array(nRows))
})
}
}
// make up some data
val data = (0 to 100).collect {
case i if i > 75 || i < 25 ⇒
(0 to 100).collect {
case j if j > 75 || j < 25 ⇒
val res =
if (i > 75 && j < 25) 23.0
else if (i < 25 && j > 75) -45
else 0
(Array(i / 100.0 + 1, j / 100.0 + 2), res)
}
}.flatten
val sc = new SparkContext(
Engine.init(1, 1, true).get
.setAppName("Sample_NN")
.set("spark.akka.frameSize", 64.toString)
.set("spark.task.maxFailures", "1")
.setMaster("local[4]")
)
import com.intel.analytics.bigdl.numeric.NumericDouble
val batchSize = 10
val trainSet = DataSet.array(data.toArray) -> ToSample(1, 2) -> SampleToBatch(batchSize)
val layer1 = Linear(2, 6)
val layer2 = ReLU()
val output = Sum(nInputDims = 1)
val model = Sequential()
.add(Reshape(Array(2)))
.add(layer1)
.add(layer2)
.add(output)
val state =
T(
"learningRate" -> 0.01,
"weightDecay" -> 0.0005,
"momentum" -> 0.9,
"dampening" -> 0.0
)
val optimizer = Optimizer(
model = model,
dataset = trainSet,
criterion = new MSECriterion[Double]()
)
optimizer.
setState(state).
// setValidation(Trigger.everyEpoch, validationSet, Array(new Loss[Double])).
setOptimMethod(new Adagrad[Double]()).
optimize()
“””
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/bigdl-user-group/1189B713-D9DC-4A0C-88EF-CCE4E0C329AD%40ebay.com.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/1189B713-D9DC-4A0C-88EF-CCE4E0C329AD%40ebay.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/2D9FB5B117EE62468567426DDD3FEA39315AF489%40SHSMSX101.ccr.corp.intel.com.To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
Hi –
The above suggestions worked (thanks Yao), but now I have a different problem.
When I train locally using the code (the one in the above thread and the working example you provided), all is fine. I get a reasonable model weights (model.getParameters returns non-zero values for weights and biases), and a good fit in training.
But when I try the same on RDD, I get all biases = 0 and a bad fit.
Again, the network:
val layer1 = Linear[Double](dimInput,nHidden)
val layer2 = ReLU[Double]()
val layer3 = Linear[Double](nHidden,nHidden)
val layer4 = ReLU[Double]()
val output = Linear[Double](nHidden,1) //Sum[Double](nInputDims = 1)
val model = Sequential[Double]().
add(Reshape(Array(dimInput))).
add(layer1).
add(layer2).
add(layer3).
add(layer4).
add(output)
Local:
val trainSet = DataSet.array(data.toArray).transform(ToSample(1,dimInput)).transform(SampleToBatch(batchSize))
leads to this:
scala> println(model.getParameters())
(0.6438069051434786
-0.5983624270225454
-1.222042899057738
1.2080124683211286
0.09588958746823727
…..
….
[com.intel.analytics.bigdl.tensor.DenseTensor of size 67],-0.029515928487775192
-0.08776935671924736
0.05927649370933306
0.18495866162243968
-0.003225973822120612
-0.008541260257896508
….
[com.intel.analytics.bigdl.tensor.DenseTensor of size 67])
and a good fit.
But with RDD:
val sampleShape = Array(1,dimInput)
val batching = OOMBatching(batchSize, sampleShape)
val trainSetRDD = sc.makeRDD(data).coalesce(numExecutors*numCores, true).coalesce(numExecutors)
val trainSet = DataSet.rdd(trainSetRDD) -> batching
where OOMBatching:
object OOMBatching {
def apply(batchSize: Int, sampleShape: Array[Int]): OOMBatching =
new OOMBatching(batchSize, sampleShape)
}
/**
* Batching samples into mini-batch
* @param batchSize The desired mini-batch size.
* @param sampleShape Shape of the training sample
*/
class OOMBatching(batchSize: Int, sampleShape: Array[Int]) extends
Transformer[(Array[Double], Double), MiniBatch[Double]] {
override def apply(prev: Iterator[(Array[Double], Double)]): Iterator[MiniBatch[Double]] = {
new Iterator[MiniBatch[Double]] {
private val featureTensor: Tensor[Double] = Tensor[Double]()
private val labelTensor: Tensor[Double] = Tensor[Double]()
private var featureData: Array[Double] = null
private var labelData: Array[Double] = null
private val featureLength = sampleShape.product
private val labelLength = 1
override def hasNext: Boolean = prev.hasNext
override def next(): MiniBatch[Double] = {
if (prev.hasNext) {
var i = 0
while (i < batchSize && prev.hasNext) {
val sample = prev.next()
if (featureData == null || featureData.length < batchSize *
featureLength) {
featureData = new Array[Double](batchSize * featureLength)
}
if (labelData == null || labelData.length < batchSize *
labelLength) {
labelData = new Array[Double](batchSize * labelLength)
}
Array.copy(sample._1, 0, featureData, i * featureLength,
featureLength)
labelData(i) = sample._2
i += 1
}
featureTensor.set(Storage[Double](featureData), storageOffset = 1, sizes =
Array(i) ++ sampleShape)
labelTensor.set(Storage[Double](labelData), storageOffset = 1, sizes =
Array(i, 1))
MiniBatch(featureTensor, labelTensor)
}
else {
null
}
}
}
}
}
leads to this: all bias parameters exactly 0:
scala> print(model.getParameters)
(0.5835011785523165
-0.29177034775166744
-0.19938245877956728
0.17784682057321344
0.5514315980564399
….
[com.intel.analytics.bigdl.tensor.DenseTensor of size 67],0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
….
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
[com.intel.analytics.bigdl.tensor.DenseTensor of size 67])
Thanks again!
V.
I guess this is due to inconsistent batch size.
The “batchSize” within OOMBatching is the batch per node. i.e if your cluster size is 4, then the total batch size is batchSize * 4. How about try to specify the batchSize within OOMBatching as “batchSize used in local”/ cluster_size ?
Thanks,
Zhichao
From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com]
On Behalf Of von Brzeski, Vadim
Sent: Wednesday, January 18, 2017 11:58 AM
To: Jason Dai <jaso...@gmail.com>; Zhang, Yao <yao....@intel.com>
Cc: Yiheng Wang <yih...@gmail.com>; Wan, Yan <yan...@intel.com>; BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi –
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
--
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.
Hi Jason, Zhichao –
Thanks, but I don’t think that’s it.
1) When I tried it, I had numExecutors = numCores = 1, and batchSize = 100.
2) Then I tried numExecutors = 4, numCores = 1, and with Utils.getBatchSize(totalBatch) , this time with batchSize = 400 à same result.
Here’s my Engine conf and data generation mechanism below, in case you want to try it. Like I said, it works great in local model.
(BTW: where I’m headed with this: my real dataset (for a regression problem) has 1B+ rows and 70 columns. Am I delusional trying something like this with Big-DL?
V.
val
sc = new SparkContext(
Engine.init(numExecutors, numCores, true).get
.setAppName("Sample_NN")
.set("spark.akka.frameSize", 64.toString)
.set("spark.task.maxFailures", "1")
.set("spark.scheduler.minRegisteredResourcesRatio", "1")
)
sc.setLogLevel("ERROR")
val dimInput = 2
val data = (0 to 100).collect {
case i if i > 75 || i < 25 =>
(0 to 100).collect {
case j if j > 75 || j < 25 =>
val res =
if (i > 75 && j < 25) 2.0
else if (i < 25 && j > 75) -4.0
else 0
(Array(i / 100.0, j / 100.0), res)
}
}.flatten
val sampleShape = Array(1,dimInput)
From:
Jason Dai <jaso...@gmail.com>
Date: Tuesday, January 17, 2017 at 9:14 PM
To: "Li, Zhichao" <zhich...@intel.com>
Cc: "Brzeski, Vadim" <vbrz...@ebay.com>, "Zhang, Yao" <yao....@intel.com>, Yiheng Wang <yih...@gmail.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi Vadim,
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.
For more options, visit https://groups.google.com/d/optout.
Hi guys –
Thanks for your help so far and your offer to help in future.
Yiheng (and all) – here is what I found:
First, I had to do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get” because sometimes I get this error: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this.
I also set nHidden = 6, num iterations = 2000, learning rate = 0.005
Anyway, when I do this:
val trainSetRDD = sc.makeRDD(data)
I get this fit:
17/01/18 15:30:15 INFO DistriOptimizer$: [Epoch 80 2400/2500][Iteration 2000][Wall Clock 96.279817182s] Train 100 in 0.032721897seconds. Throughput is 3056.057538473396 records/second. Loss is 0.4063624393630257.
But when I do this:
val
numExecutors = 1 // args(3).toInt
val numCores = 1 // args(4).toInt
val trainSetRDD = sc.makeRDD(data).coalesce(numExecutors*numCores, true).coalesce(numExecutors)
I get a different worse fit:
17/01/18 15:26:54 INFO DistriOptimizer$: [Epoch 80 2400/2500][Iteration 2000][Wall Clock 102.645296308s] Train 100 in 0.03926751seconds. Throughput is 2546.634609630201 records/second. Loss is 0.7909712745636441.
I do the coalesce steps because of what I read here: https://github.com/intel-analytics/BigDL/pull/353 regarding the error above.
Is the coalesce advised? And any solution for the above error ?
Thanks
V.
From: Jason Dai <jaso...@gmail.com>
Date: Wednesday, January 18, 2017 at 4:04 AM
To: Yiheng Wang <yih...@gmail.com>
Cc: "Brzeski, Vadim" <vbrz...@ebay.com>, "Li, Zhichao" <zhich...@intel.com>, "Zhang, Yao" <yao....@intel.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
BTW, Optimizer.optimize() will return a trained model after it's done, while the original model passed to Optimizer is not updated when running on Spark.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/EE6589D5-81D9-46F8-8DAD-766BC3250891%40ebay.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/F03D1005CC201743BE1B3E1C1E14668B6ABB6BE0%40SHSMSX101.ccr.corp.intel.com.
For more options, visit https://groups.google.com/d/optout.
--
Yiheng Wang
SSG STO Big Data Technology
Intel Asia-Pacific Research & Development Ltd.
No. 880 Zi Xing Road
Shanghai, PRC, 200241
Phone: (86-21) 61166094
spark.shuffle.reduceLocality.enabled
to true
Hi Yiheng –
I did as you suggested, and things seem OK on my small sample dataset. I am now running into the “Detect multi-task run on one Executor/Container. Currently not support this” issue on my real large dataset.
Right before calling optimize(), I do this re-partition as you suggested:
val
trainSetRDD =
if (doRepartition) {
xFit.repartition(numExecutors * numCores)
} else {
xFit
}
log.info("trainSetRDD num partitions = "+trainSetRDD.getNumPartitions)
log.info("trainSetRDD count = "+trainSetRDD.count)
val validationSetRDD =
if (doRepartition) {
xVal.repartition(numExecutors * numCores)
} else {
xVal
}
log.info("validationSetRDD num partitions = "+validationSetRDD.getNumPartitions)
log.info("validationSetRDD count = "+validationSetRDD.count)
val batchingT = com.ebay.mktgscience.oom.bigDL.OOMBatching(batchSize, sampleShape)
val trainSet = DataSet.rdd(trainSetRDD) -> batchingT
log.info("trainSet.size = "+trainSet.size())
val batchingV = com.ebay.mktgscience.oom.bigDL.OOMBatching(batchSize, sampleShape)
val validationSet = DataSet.rdd(validationSetRDD) -> batchingV
log.info("validationSet.size = "+validationSet.size())
And I get this result:
17/01/19 10:57:06 INFO OOM_Train_NN$: trainSetRDD num partitions = 500
17/01/19 10:59:16 INFO OOM_Train_NN$: trainSetRDD count = 1085223099
17/01/19 10:59:16 INFO OOM_Train_NN$: validationSetRDD num partitions = 500
17/01/19 10:59:39 INFO OOM_Train_NN$: validationSetRDD count = 120595645
17/01/19 11:00:51 INFO OOM_Train_NN$: trainSet.size = 1085223099
17/01/19 11:01:48 INFO OOM_Train_NN$: validationSet.size = 120595645
17/01/19 11:01:48 INFO DistriOptimizer$: Cache thread models...
17/01/19 11:01:49 ERROR TaskSetManager: Task 469 in stage 91.0 failed 1 times; aborting job
17/01/19 11:01:49 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 469 in stage 91.0 failed 1 times, most recent failure: Lost task 469.0 in stage 91.0 (TID 11391, hdc9-phx04-0160-0117-032.stratus.phx.ebay.com): java.lang.IllegalArgumentException: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this
at scala.Predef$.require(Predef.scala:233)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$9.apply(DistriOptimizer.scala:366)
Here is my entire spark-submit, spark version 1.6.1:
***********
export OMP_NUM_THREADS=1
export KMP_BLOCKTIME=0
export OMP_WAIT_POLICY=passive
export DL_ENGINE_TYPE=mklblas
HIVEVER=1.2.1000.2.4.2.0-258
HIVEDIR=hive-$HIVEVER
APACHEVER=2.7.1.2.4.2.0-258
MODELNAME=nn_bigDL
N_ITER=100
N_BATCH=500000 # inside OOMBatching, I do: private val batchSize = Utils.getBatchSize(totalBatch)
LEARNING_RATE=0.01
N_HIDDEN=100
N_EXECUTORS=500
N_CORES=1
DO_REPARTITION=true
PRIORMODEL=lm_3
/apache/spark/bin/spark-submit \
--name "OOM_Model_NN2" \
--master "yarn" \
--num-executors $N_EXECUTORS \
--executor-cores $N_CORES \
--deploy-mode "cluster" \
--driver-memory 12G \
--executor-memory 32G \
--conf "spark.rpc.askTimeout=240s" \
--conf "spark.yarn.executor.memoryOverhead=6000" \
--conf "spark.network.timeout=2000" \
--conf "spark.executor.heartbeatInterval=60s" \
--conf "spark.sql.shuffle.partitions=500" \
--conf "spark.executor.extraJavaOptions=-server -XX:MaxPermSize=1024m -XX:+UseG1GC" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.kryoserializer.buffer=256m" \
--conf "spark.kryoserializer.buffer.max=1024m" \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" \
--conf "spark.yarn.maxAppAttempts=1" \
--conf "spark.shuffle.reduceLocality.enabled=true" \
--driver-java-options "-XX:MaxPermSize=2G -XX:+UseG1GC"\
--driver-library-path "/apache/hadoop/lib/native:/apache/hadoop/lib/native/Linux-amd64-64" \
--driver-class-path "/apache/hadoop/share/hadoop/common/hadoop-common-$APACHEVER.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.2.4.2.0-258.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-$APACHEVER.jar" \
--jars "/apache/hadoop/share/hadoop/common/hadoop-common-$APACHEVER.jar,/apache/hadoop/lib/hadoop-lzo-0.6.0.2.4.2.0-258.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-$APACHEVER.jar,/apache/hive/lib/hive-metastore-$HIVEVER.jar,/apache/hive/lib/hive-common-$HIVEVER.jar,/apache/spark/lib/datanucleus-api-jdo-3.2.6.jar,/apache/spark/lib/datanucleus-core-3.2.10.jar,/apache/spark/lib/datanucleus-rdbms-3.2.9.jar" \
--verbose \
--queue "hddq-exprce-orgmrktg" \
--files /apache/hive/conf/hive-site.xml \
--class "com.ebay.mktgscience.oom.bigDL.OOM_Train_NN" \
./jars/oom-model-1.0-SNAPSHOT.jar $MODELNAME $N_ITER $N_BATCH $LEARNING_RATE $N_HIDDEN $N_EXECUTORS $N_CORES $DO_REPARTITION $PRIORMODEL
************
Thanks.
V.
From:
<bigdl-us...@googlegroups.com> on behalf of Yiheng Wang <yih...@gmail.com>
Date: Wednesday, January 18, 2017 at 9:13 PM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: Jason Dai <jaso...@gmail.com>, "Li, Zhichao" <zhich...@intel.com>, "Zhang, Yao" <yao....@intel.com>, "Wan, Yan" <yan...@intel.com>, BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi Vadim
Something want to be clarified:
"First, I had to do “Engine.init(1, 1, true).get” instead of “Engine.init(1, 1, true).get”
Do you mean use Engine.init(1, 1, true) instead of Engine.init(4, 1, true)?
Thanks for the advice. Already in progress :) with:
N_EXECUTORS=200
N_CORES=1
(and only doing xFit.repartition(numExecutors * numCores) as before, no coalesce)
Seems like it is working! Its already done multiple rounds of
count at DistriOptimizer.scala:399
reduce at DistriOptimizer.scala:220
Keep fingers crossed!
BTW: on other topic: any sample code for computing predictions given a fitted model? I tried it myself, writing a class like Batching, but I got this Exception:
scala> val predictor = OOMPredict(batchSize, sampleShape, modelBroadcast.asInstanceOf[Broadcast[Module[Double]]])
predictor: OOMPredict = $iwC$$iwC$OOMPredict@67023824
scala> val predictions = DataSet.rdd(trainSetRDD) -> predictor
java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@6902e2bf)
Thanks.
V.
Hi Xin –
Thanks. I tried following the /loadmodel example.
I tried saving / loading the model to HDFS, but was not successful. It seems like the saveModel() and Module.load() methods only operate on the local filesystem (true?). How do I save / load models when I run a spark-submit job on a cluster?
Thanks.
V.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
Thanks for the advice!
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
Jason –
On you last comment – executors w/ many cores: I am limited to 8 cores in my environment (per policy).
Anyway, I have been able to successfully run a few times with 160 executors and 6 cores each (8 cores have me out of memory problems).
However, sometimes I get a job failed with this error (happens after about 100 or so iterations; all of a sudden it dies):
17/01/22 14:19:18 ERROR YarnClusterScheduler: Lost executor 38 on phxaishdc9dn0781.phx.ebay.com: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
17/01/22 14:19:18 ERROR TaskSetManager: Task 30 in stage 2458.0 failed 1 times; aborting job
17/01/22 14:19:18 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 2458.0 failed 1 times, most recent failure: Lost task 30.0 in stage 2458.0 (TID 59323, phxaishdc9dn0781.phx.ebay.com): ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 2458.0 failed 1 times, most recent failure: Lost task 30.0 in stage 2458.0 (TID 59323, phxaishdc9dn0781.phx.ebay.com): ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_217381_02_000486 on host: phxaishdc9dn0781.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1855)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD.count(RDD.scala:1164)
at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:236)
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:532)
at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN$.main(OOM_Train_NN.scala:306)
at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN.main(OOM_Train_NN.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
This last one happened with 200 executors and 6 cores. (I have been trying bigger and bigger batch sizes too, and thus the 200 number of executors; this last one was 18e6 total batch = 90K per executor so I wouldn’t run out of memory. How large can I make the batch per executor w/ 40GB executor memory and 6 cores per executor?)
V.
From:
Jason Dai <jaso...@gmail.com>
Date: Friday, January 20, 2017 at 3:38 AM
To: 邱鑫 <qiuxin...@gmail.com>
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
Indeed – I did see such heartbeat time out messages. That’s probably it then.
Thanks!
V.
From:
Jason Dai <jaso...@gmail.com>
Date: Monday, January 23, 2017 at 4:00 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.
Thanks,
-Jason
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
Hi Jason (and all)
So I have trained the 1B+ set a few times, but am not getting the results I want, so I am playing around with different parameters, number of hiddens, layers, etc. Here’s the thing:
1. Go with 160 exec, 6 cores each. 40GB per exec, with 8GB overhead. Total batch 30M, batch per exec 187500. 2 ReLU layers, 100 hiddens each. Finishes ok.
2. Make my network a bit more complex: 3 ReLU layers, 100 hiddens each. Same job config. Get this: Container killed by YARN for exceeding memory limits. 48.0 GB of 48 GB physical memory used.
3. OK, so then I go with 200 exec, 5 cores each. Then I get the dreaded: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this
Any ideas?
V.
From: Jason Dai <jaso...@gmail.com>
Date: Monday, January 23, 2017 at 4:00 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.
Thanks,
-Jason
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
Thanks. Cut down my batch size, trying again. Sometimes I also get this one:
17/01/26 19:57:37 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 715 in stage 25.0 failed 1 times, most recent failure: Lost task 715.0 in stage 25.0 (TID 9276, phxdpehdc9dn2398.stratus.phx.ebay.com): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_279325_01_2122733 on host: phxdpehdc9dn2398.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 715 in stage 25.0 failed 1 times, most recent failure: Lost task 715.0 in stage 25.0 (TID 9276, phxdpehdc9dn2398.stratus.phx.ebay.com): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_279325_01_2122733 on host: phxdpehdc9dn2398.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1855)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD.count(RDD.scala:1164)
at com.intel.analytics.bigdl.dataset.DistributedDataSet$class.transform(DataSet.scala:180)
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet.transform(DataSet.scala:208)
at com.intel.analytics.bigdl.dataset.AbstractDataSet$class.$minus$greater(DataSet.scala:91)
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet.$minus$greater(DataSet.scala:208)
….
During the batching transformation. Not sure if this is something on your end or on Spark/Hadoop.
V.
From:
<bigdl-us...@googlegroups.com> on behalf of Jason Dai <jaso...@gmail.com>
Date: Thursday, January 26, 2017 at 5:51 PM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi Vadim,
Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.
Thanks,
-Jason
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/bigdl-user-group/CAHvkTeHAdw9KzJkWA-ibkqsBFucg3AHncYU-S9dnFtp%3DODktGg%40mail.gmail.com.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
This
Detect multi-task run on one Executor/Container. Currently not support this at scala.Predef$.require(Predef.scala:219)
is really killing me now, after decreasing the batch size. Even upped the executor cores to 10 (100 executors), spark.shuffle.reduceLocality.enabled=false.
I can’t get a successful run anymore. About to give up on this until this is fixed – not possible to do any real work in this situation.
V.
From:
Jason Dai <jaso...@gmail.com>
Date: Thursday, January 26, 2017 at 8:32 PM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Looks like Spark is cancelling the job as the executor (YARN container) is lost - maybe due to YARN killing it? Are there any error messages from the Spark tasks?
-Jason
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi Vadim,
First I think your batch size seems too large for the learning rate; try lowering the batch size (for instance we used a batch size of 1K~2K when training the Inception model in our example), and tuning the hyper-parameters for better accuracy.
And to address the OOM problem, you can also try reducing the batch size (while keeping the core#) for now. We'll try to reproduce the "multi-task run" issues on our side.
Thanks,
-Jason
On Fri, Jan 27, 2017 at 9:23 AM, von Brzeski, Vadim <vbrz...@ebay.com> wrote:
Hi Jason (and all)
So I have trained the 1B+ set a few times, but am not getting the results I want, so I am playing around with different parameters, number of hiddens, layers, etc. Here’s the thing:
1. Go with 160 exec, 6 cores each. 40GB per exec, with 8GB overhead. Total batch 30M, batch per exec 187500. 2 ReLU layers, 100 hiddens each. Finishes ok.
2. Make my network a bit more complex: 3 ReLU layers, 100 hiddens each. Same job config. Get this: Container killed by YARN for exceeding memory limits. 48.0 GB of 48 GB physical memory used.
3. OK, so then I go with 200 exec, 5 cores each. Then I get the dreaded: requirement failed: Detect multi-task run on one Executor/Container. Currently not support this
Any ideas?
V.
From: Jason Dai <jaso...@gmail.com>
Date: Monday, January 23, 2017 at 4:00 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Another possibility is that there are too many GCs, and the executors are removed by the master because the heartbeat times out. You can check the Spark logs to see if there are such error messages.
Thanks,
-Jason
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/65af3dc8-6ac1-4784-994f-3a83d72a43e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
-Jason
<span style="fo
val optimizer = Optimizer(
model = model,
dataset = trainSet,
criterion = new MSECriterion[Double]()
).asInstanceOf[DistriOptimizer[Double]].disableCheckSingleton()
Thanks!! Will give it a shot.
BTW: after many trials, I went back to my 160 executors / 6 cores setup and it seems to be running on an older cluster we have, which limits us to 8 cores per exec. I was having some memory issues with a 3-hidden-layer network and a very large batch on this cluster, but now since I have lowered my batch, it seems to be running. This older cluster is running Spark 1.6.1 w/ Scala 2.10.
We also have a newer cluster w/ Spark 1.6.2 Scala 2.11, with higher mem nodes, where you can have 10-12 cores per exec. This is where I kept running into the aforementioned bug _all the time_, i.e. no successful runs on this new cluster. This is where I will try your patch mentioned below. I will let you know.
Again, thanks for your support.
Vadim.
Hi Jason –
The fix is working well, I can now start running jobs OK, but running into that “lost” container error. Here is the trace. (400 execs x 4 cores, 40GB GB per exec, 8GB overhead). Here’s the network:
17/01/29 01:38:19 INFO OOM_Train_NN$: DeepLearning Network =
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]
(1): nn.Reshape(69)
(2): nn.Linear(69 -> 100)
(3): nn.ReLU
(4): nn.Linear(100 -> 100)
(5): nn.ReLU
(6): nn.Linear(100 -> 100)
(7): nn.ReLU
(8): nn.Linear(100 -> 1)
}
17/01/29 01:38:20 INFO DistriOptimizer$: Cache thread models...
17/01/29 01:38:27 INFO DistriOptimizer$: Cache thread models... done
17/01/29 01:38:27 INFO DistriOptimizer$: config {
learningRate: 0.01
maxDropPercentage: 0.0
momentum: 0.9
warmupIterationNum: 200
learningRateDecay: 0.002
dampening: 0.0
dropPercentage: 0.0
comupteThresholdbatchSize: 100
}
7/01/29 10:12:30 INFO DistriOptimizer$: [Epoch 4 28800000/1085245782][Iteration 688][Wall Clock 21328.862921253s] Train 4800000 in 51.694058478seconds. Throughput is 92853.99795109505 records/second. Loss is 11173.878006300865.
17/01/29 10:13:19 INFO DistriOptimizer$: [Epoch 4 33600000/1085245782][Iteration 689][Wall Clock 21380.556979731s] Train 4800000 in 49.042573606seconds. Throughput is 97874.14580976956 records/second. Loss is 10773.208485912715.
17/01/29 10:13:38 ERROR YarnClusterScheduler: Lost executor 33 on phxdpehdc9dn2651.stratus.phx.ebay.com: Container marked as failed: container_e152_1483654296013_307643_02_000062 on host: phxdpehdc9dn2651.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
17/01/29 10:13:38 ERROR TaskSetManager: Task 193 in stage 15229.0 failed 1 times; aborting job
17/01/29 10:13:38 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 15229.0 failed 1 times, most recent failure: Lost task 193.0 in stage 15229.0 (TID 570819, phxdpehdc9dn2651.stratus.phx.ebay.com): ExecutorLostFailure (executor 33 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_307643_02_000062 on host: phxdpehdc9dn2651.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 15229.0 failed 1 times, most recent failure: Lost task 193.0 in stage 15229.0 (TID 570819, phxdpehdc9dn2651.stratus.phx.ebay.com): ExecutorLostFailure (executor 33 exited caused by one of the running tasks) Reason: Container marked as failed: container_e152_1483654296013_307643_02_000062 on host: phxdpehdc9dn2651.stratus.phx.ebay.com. Exit status: -100. Diagnostics: Container released on a *lost* node
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1855)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1975) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1014)
at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:220)
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:527)
at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN$.main(OOM_Train_NN.scala:322)
at com.ebay.mktgscience.oom.bigDL.OOM_Train_NN.main(OOM_Train_NN.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
From: Jason Dai <jaso...@gmail.com>
Date: Friday, January 27, 2017 at 5:16 PM
To: "dingdi...@gmail.com" <dingdi...@gmail.com>
One more thing – that exception below is also preceded by this:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 71.0 failed 1 times, most recent failure: Lost task 193.0 in stage 71.0 (TID 13715, spades-0270-1003663.lvs02.eaz.ebayc3.com): java.util.concurrent.ExecutionException: java.util.NoSuchElementException: None.get at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.intel.analytics.bigdl.parameters.FutureResult.waitResult(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:145) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:125) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at scala.Option.getOrElse(Option.scala:121) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:135) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Driver stacktrace:
From: <bigdl-us...@googlegroups.com> on behalf of "Brzeski, Vadim" <vbrz...@ebay.com>
Date: Sunday, January 29, 2017 at 9:57 AM
To: Jason Dai <jaso...@gmail.com>
--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/91228770-D2EC-4997-9EF2-41CBCA9502BF%40ebay.com.
--
To unsubscribe from this group and all its topics, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
Thanks. BTW – I am trying to run this stuff as I said on a newer cluster, Spark 1.6.2, Scala 2.11. So far no luck, and I think there is something messed up with our cluster (which we’re looking into), but here’s a new one – never seen this one before :)
Caused by: java.lang.Exception: Please initialize AllReduceParameter first!!
at com.intel.analytics.bigdl.parameters.AllReduceParameter.readGradientPartition(AllReduceParameter.scala:93)
at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition$lzycompute(AllReduceParameter.scala:61)
at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition(AllReduceParameter.scala:61)
at com.intel.analytics.bigdl.parameters.AllReduceParameter.aggregrateGradientParition(AllReduceParameter.scala:185)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$optimize$1.apply(DistriOptimizer.scala:227)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$optimize$1.apply(DistriOptimizer.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
From:
<bigdl-us...@googlegroups.com> on behalf of Jason Dai <jaso...@gmail.com>
Date: Monday, January 30, 2017 at 2:36 AM
To: "Brzeski, Vadim" <vbrz...@ebay.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi Vadim,
These exceptions seem to be caused by losing some Spark executors, probably due to the same problem of heartbeat timeout discussed before? Do you see such error messages? And can you check if there are a lot of GCs? We'll look into the memory consumption problems in our environment.
Thanks,
-Jason
On Mon, Jan 30, 2017 at 1:38 PM, von Brzeski, Vadim <vbrz...@ebay.com> wrote:
One more thing – that exception below is also preceded by this:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 71.0 failed 1 times, most recent failure: Lost task 193.0 in stage 71.0 (TID 13715, spades-0270-1003663.lvs02.eaz.ebayc3.com): java.util.concurrent.ExecutionException: java.util.NoSuchElementException: None.get at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.parameters.FutureResult$$anonfun$waitResult$1.apply(AllReduceParameter.scala:220) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.intel.analytics.bigdl.parameters.FutureResult.waitResult(AllReduceParameter.scala:220) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:145) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:125) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2$$anonfun$4.apply(AllReduceParameter.scala:139) at scala.Option.getOrElse(Option.scala:121) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:139) at com.intel.analytics.bigdl.parameters.AllReduceParameter$$anonfun$3$$anon$2.call(AllReduceParameter.scala:135) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Driver stacktrace:
From: <bigdl-us...@googlegroups.com> on behalf of "Brzeski, Vadim" <vbrz...@ebay.com>
Date: Sunday, January 29, 2017 at 9:57 AM
To: Jason Dai <jaso...@gmail.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi Jason –
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/91228770-D2EC-4997-9EF2-41CBCA9502BF%40ebay.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/bigdl-user-group/CAHvkTeFBvtNs3O1SvOPF%3DJmiRdPG6t-8dQmJypmqE_AMSAruWQ%40mail.gmail.com.
From: <span style="font-family:Calibri;color:blac
Are we sure this AllReduceParameter is b/c of losing some executors?
I see this in the log (it is clean, no error messages until that one). I am also now training with 10% of my data, and a batch size per exec = 3000.
2017-02-01 10:28:55 INFO OOM_Train_NN$:295 - DeepLearning Network =
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]
(1): nn.Reshape(69)
(2): nn.Linear(69 -> 70)
(3): nn.ReLU
(4): nn.Linear(70 -> 70)
(5): nn.ReLU
(6): nn.Linear(70 -> 70)
(7): nn.ReLU
(8): nn.Linear(70 -> 1)
}
2017-02-01 10:28:55 INFO DistriOptimizer$:400 - Cache thread models...
2017-02-01 10:28:58 INFO DistriOptimizer$:402 - Cache thread models... done
2017-02-01 10:28:58 INFO DistriOptimizer$:89 - config {
learningRate: 0.01
maxDropPercentage: 0.0
momentum: 0.9
warmupIterationNum: 200
learningRateDecay: 0.002
dampening: 0.0
dropPercentage: 0.0
comupteThresholdbatchSize: 100
}
2017-02-01 10:28:58 INFO DistriOptimizer$:90 - Shuffle data
2017-02-01 10:28:58 INFO DistriOptimizer$:93 - Shuffle data complete. Takes 0.032758708s
2017-02-01 10:29:18 INFO DistriOptimizer$:241 - [Epoch 1 0/108496693][Iteration 1][Wall Clock 0.0s] Train 1200000 in 15.646894009seconds. Throughput is 76692.53714569595 records/second. Loss is 10817.976028867404.
2017-02-01 10:29:37 INFO DistriOptimizer$:241 - [Epoch 1 1200000/108496693][Iteration 2][Wall Clock 15.646894009s] Train 1200000 in 19.144417971seconds. Throughput is 62681.456381581425 records/second. Loss is 14067.692619235619.
2017-02-01 10:29:54 INFO DistriOptimizer$:241 - [Epoch 1 2400000/108496693][Iteration 3][Wall Clock 34.79131198s] Train 1200000 in 16.798147873seconds. Throughput is 71436.4469864433 records/second. Loss is 9435.186635609934.
2017-02-01 10:30:17 ERROR TaskSetManager:74 - Task 297 in stage 148.0 failed 1 times; aborting job
2017-02-01 10:30:17 ERROR ApplicationMaster:95 - User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 297 in stage 148.0 failed 1 times, most recent failure: Lost task 297.0 in stage 148.0 (TID 22742, spades-0334-1027666.lvs02.eaz.ebayc3.com): java.lang.Exception: Please initialize AllReduceParameter first!!
at com.intel.analytics.bigdl.parameters.AllReduceParameter.readGradientPartition(AllReduceParameter.scala:93)
at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition$lzycompute(AllReduceParameter.scala:61)
at com.intel.analytics.bigdl.parameters.AllReduceParameter.gradientPartition(AllReduceParameter.scala:61)
From: "dingdi...@gmail.com" <dingdi...@gmail.com>
Date: Monday, January 30, 2017 at 11:43 AM
To: BigDL User Group <bigdl-us...@googlegroups.com>
&
Hi Jason –
Looks like things has stabilized – I am now able to consistently run jobs OK. I had done (1) and (2) below already. I think what has made the difference are the following:
a) Your latest patch
b) Running with a much smaller batch size (2K – 3K per executor)
c) 150 nodes, 6 cores per node
d) training on a fraction of my 1.09B records; right now I have done a few runs with up to 30% of the data (~300M records), and continue pushing it further…..
Thanks for all your help. If I am ultimately successful here with this, I will definitely advertise BigDL and your support here at eBay.
V.
From:
<bigdl-us...@googlegroups.com> on behalf of Jason Dai <jaso...@gmail.com>
Date: Thursday, February 2, 2017 at 3:57 AM
To: "dingdi...@gmail.com" <dingdi...@gmail.com>
Cc: BigDL User Group <bigdl-us...@googlegroups.com>, "Brzeski, Vadim" <vbrz...@ebay.com>
Subject: Re: [bigdl-user-group] Re: trying simple NN, but getting error: requirement failed: input must be vector or matrix
Hi Vadim,
--
You received this message because you are subscribed to a topic in the Google Groups "BigDL User Group" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/bigdl-user-group/sIAdlDt71Gc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
bigdl-user-gro...@googlegroups.com.
To post to this group, send email to
bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/CAHvkTeE8SDe0BtSuC1fsaDD-05oJsusC%2B%2Be2MYHr06M2jcCn4A%40mail.gmail.com.
To unsubscribe from this group and all its topics, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/CAHvkTeE8SDe0BtSuC1fsaDD-05oJsusC%2B%2Be2MYHr06M2jcCn4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/76E6494F-5B42-4684-A173-D2AFA176F0A5%40ebay.com.