Language modeling using LSTM in BigDL

301 views
Skip to first unread message

MR

unread,
Jun 22, 2017, 8:23:28 AM6/22/17
to BigDL User Group

Hello everyone!


I'am using BigDL with Python for the language modeling. I'am trying to do word prediction using LSTM with GloVe.


I preprocessed sentences using Spark and know I have for the training set:

  1. For each sentence a list of vectors, one for each word
  2. For each sentence the vector representing the last word which is the output

For example a sentence in the training set like "hi how are" + "you" becomes in Python a tuple like this: ([vector1,vector2,vector3],[vector_out])


From this tuples, which are stored in an RDD, I would like to create Samples in order to train LSTM.

The problem is that I don't know how to create a Sample RDD where training labels are vectors. I think I'am quite confused about Sample class.

I filled Sample.features with a matrix where each row is the vector representing a word. It should be ok, the LSTM layer will do the recurrence taking a row for each step.

What about Sample.labels? How can I train an LSTM in BigDL in order to get a model for tasks other than classification?


Sorry, I'am really new with the BigDL and Spark world. Thank you!

Li, Zhichao

unread,
Jun 22, 2017, 8:25:06 PM6/22/17
to Mario Ruggieri, BigDL User Group

([vector1,vector2,vector3],[vector_out])  -> Sample(tensor, tensor) and a sample set can be represented as an RDD[Sample] for training or validation.

 

There’s a similar notebook example for this which is also base on LSTM and GloVe: https://github.com/intel-analytics/BigDL/tree/v0.1.1/pyspark/bigdl/models/textclassifier.

 

Thanks,

Zhichao

From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of Mario Ruggieri
Sent: Thursday, June 22, 2017 8:13 PM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: [bigdl-user-group] Language modeling using LSTM in BigDL

 

Hello everyone!

 

I'am using BigDL with Python for the language modeling. I'am trying to do word prediction using LSTM with GloVe.

 

I preprocessed sentences using Spark and know I have for the training set:

  1. For each sentence a list of vectors, one for each word
  2. For each sentence the vector representing the last word which is the output

For example a sentence in the training set like "hi how are" + "you" becomes in Python a tuple like this: ([vector1,vector2,vector3],[vector_out])

 

From this tuples, which are store in an RDD, I would like to create Samples in order to train LSTM.

The problem is that I don't know how to create a Sample RDD where training labels are vectors. I think I'am quite confused about Sample class.

I filled Sample.features with a matrix where each row is the vector representing a word. It should be ok, the LSTM layer will be the recurrence taking a row for each step.

What about Sample.labels? How can I train an LSTM in BigDL in order to get a model for tasks other than classification?

 

Sorry, I'am really new with the BigDL and Spark world. Thank you!

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/5146bfaf-8100-4554-8ba5-7347acd4087a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MR

unread,
Jun 27, 2017, 6:46:34 AM6/27/17
to BigDL User Group
Thank you for your answer. This is what I did:

For samples I created two matrices, one for inputs and one for outputs. For example:

"hi how are you" trasformed using GloVe into [vec1,vec2,vec3,vec4] and so:
input_matrix = [vec1, vec2, vec3], output_matrix = [vec2, vec3, vec4]

input and output matrix are numpy arrays reshaped where each row is a word vector. I used them to construct the Sample using Sample.from_ndarray()

Then I created an LSTM for word prediction in this way:

model = Sequential()
model.add(Recurrent().add(LSTM(input_dim, hidden_dim)))
model.add(TimeDistributed(Linear(hidden_dim, output_dim)))

and the Optimizer with TimeDistributedCriterion(MSECriterion()) as criterion. 
My aim is to get next words vectors as outputs and to get the last one. Then I search for the nearest word in the GloVe dictionary.

Now the problem is the following: when I start training with optimizer.optimize(), the first epoch is done correctly. From the second one I get this error:

2017-06-27 12:04:09 ERROR ThreadPool$:115 - Error: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 4800

at java.util.Arrays.rangeCheck(Arrays.java:120)

at java.util.Arrays.fill(Arrays.java:3114)

at com.intel.analytics.bigdl.tensor.ArrayStorage.fill(ArrayStorage.scala:61)

at com.intel.analytics.bigdl.tensor.ArrayStorage.fill(ArrayStorage.scala:23)

at com.intel.analytics.bigdl.tensor.DenseTensor.fill(DenseTensor.scala:230)

at com.intel.analytics.bigdl.tensor.DenseTensor.zero(DenseTensor.scala:243)

at com.intel.analytics.bigdl.nn.TimeDistributedCriterion.updateGradInput(TimeDistributedCriterion.scala:115)

at com.intel.analytics.bigdl.nn.TimeDistributedCriterion.updateGradInput(TimeDistributedCriterion.scala:36)

at com.intel.analytics.bigdl.nn.abstractnn.AbstractCriterion.backward(AbstractCriterion.scala:81)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:200)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:191)

at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:191)

at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:748)


What's wrong? Can you help me? 


Thank you!


Il giorno venerdì 23 giugno 2017 02:25:06 UTC+2, Zhichao Li ha scritto:

([vector1,vector2,vector3],[vector_out])  -> Sample(tensor, tensor) and a sample set can be represented as an RDD[Sample] for training or validation.

 

There’s a similar notebook example for this which is also base on LSTM and GloVe: https://github.com/intel-analytics/BigDL/tree/v0.1.1/pyspark/bigdl/models/textclassifier.

 

Thanks,

Zhichao

From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of Mario Ruggieri
Sent: Thursday, June 22, 2017 8:13 PM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: [bigdl-user-group] Language modeling using LSTM in BigDL

 

Hello everyone!

 

I'am using BigDL with Python for the language modeling. I'am trying to do word prediction using LSTM with GloVe.

 

I preprocessed sentences using Spark and know I have for the training set:

  1. For each sentence a list of vectors, one for each word
  2. For each sentence the vector representing the last word which is the output

For example a sentence in the training set like "hi how are" + "you" becomes in Python a tuple like this: ([vector1,vector2,vector3],[vector_out])

 

From this tuples, which are store in an RDD, I would like to create Samples in order to train LSTM.

The problem is that I don't know how to create a Sample RDD where training labels are vectors. I think I'am quite confused about Sample class.

I filled Sample.features with a matrix where each row is the vector representing a word. It should be ok, the LSTM layer will be the recurrence taking a row for each step.

What about Sample.labels? How can I train an LSTM in BigDL in order to get a model for tasks other than classification?

 

Sorry, I'am really new with the BigDL and Spark world. Thank you!

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.

To post to this group, send email to bigdl-u...@googlegroups.com.

Jason Dai

unread,
Jun 27, 2017, 9:29:36 AM6/27/17
to MR, BigDL User Group
Hi Mario,

I wonder if you can share the example, so that we can try to reproduce the issue on our side.

Thanks,
-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/b0191178-164f-43e2-a612-8789fd8f9114%40googlegroups.com.

EmmeR

unread,
Jun 27, 2017, 12:36:49 PM6/27/17
to BigDL User Group
Please, look at the file attached. 
I also shared the dataset which is stored in data.mat. It contains a vocabulary, the training and test data which are 4 words lines. Each word is represented by the word index in the vocabulary.
lstm_word_prediction.py and data.mat need to be in the same folder or change the data file path.
This is the command that I use to launch the training:

${SPARK_HOME}/bin/spark-submit \

            --py-files ${PYTHON_API_ZIP_PATH}, [PATH TO lstm_word_prediction.py]  \

            --jars ${BigDL_JAR_PATH} \

            --conf spark.driver.extraClassPath=${BigDL_JAR_PATH} \

            --conf spark.executor.extraClassPath=bigdl-0.2.0-SNAPSHOT-jar-with-dependencies.jar \

            --conf spark.executorEnv.PYTHONHASHSEED=${PYTHONHASHSEED} \

             lstm_word_prediction.py


and this is for testing:


${SPARK_HOME}/bin/spark-submit \

            --py-files ${PYTHON_API_ZIP_PATH}, [PATH TO lstm_word_prediction.py]  \

            --jars ${BigDL_JAR_PATH} \

            --conf spark.driver.extraClassPath=${BigDL_JAR_PATH} \

            --conf spark.executor.extraClassPath=bigdl-0.2.0-SNAPSHOT-jar-with-dependencies.jar \

            --conf spark.executorEnv.PYTHONHASHSEED=${PYTHONHASHSEED} \

             lstm_word_prediction.py \

    --action test \

    --modelPath [PATH TO MODEL]


Sorry for the code quality, it is in the experimental phase.

I hope you will help me, thank you so much!



To post to this group, send email to bigdl-us...@googlegroups.com.
lstm_word_prediction.py
data.mat

Wan, Yan

unread,
Jun 27, 2017, 10:59:38 PM6/27/17
to Jason Dai, MR, BigDL User Group

Hi

 

The Error throws at

at com.intel.analytics.bigdl.nn.TimeDistributedCriterion.updateGradInput(TimeDistributedCriterion.scala:115)

In the source code, it is gradInput.resizeAs(input).zero

If the input is an empty Tensor, the .zero operation will throw an error.

Could you please check the input size and share the example?

 

Bests,

Yan

To post to this group, send email to bigdl-us...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

 

--

You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.

EmmeR

unread,
Jun 28, 2017, 4:02:54 AM6/28/17
to BigDL User Group
I shared the code in the previous answer. Do you mean the input sample to the LSTM? It is never empty. The error occurs from the second epoch. 

If I use (for example) 10 epochs, the error starts from the second epoch and the optimizer saves in the checkpoint path only the first 5 epochs (model.2328, model.4655, model.6982, model.9309, model.11636). This is a very strange behaviour


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.

To post to this group, send email to bigdl-u...@googlegroups.com.

EmmeR

unread,
Jun 29, 2017, 7:52:21 AM6/29/17
to BigDL User Group
I read the TimeDistributedCriterion.updateGradInput function and the problem seems to be here:

                 val timeDim = 2
require(input.size(timeDim) == target.size(timeDim),
s"target should have as many elements as input")
gradInput.resizeAs(input).zero()
val nstep = input.size(timeDim)

From the second epoch maybe input is empty...but why???

On the other hand the input and target tensors here are intended to be 3D. I'am using a 2D tensor for input and target where each row represent a vector for each LSTM time stamp. What's wrong? Can someone help me?




Il giorno mercoledì 28 giugno 2017 04:59:38 UTC+2, Yan Wan ha scritto:


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.

To post to this group, send email to bigdl-u...@googlegroups.com.

Wan, Yan

unread,
Jun 29, 2017, 10:30:39 PM6/29/17
to EmmeR, BigDL User Group

Hi,

 

The input should be a 3D tensor. If there is only one utterance, please still add 1 additional dimension ahead to represent 1 batch size.

The target should be a 2D tensor. The first dimension should be the batch size.

 

Previously, You mentioned

For example a sentence in the training set like "hi how are" + "you" becomes in Python a tuple like this: ([vector1,vector2,vector3],[vector_out])

Do you mean the input is “hi how are” and the target is “you” ?

In this way, it is not a particular language model. The criterion should be a distance function?

If the input of the Recurrent layer is three words, the output will also be a three words predicting the next three words. Please add a Select(2, -1) layer to let model select the last output as the predicted word, and use some Distance function as criterion to measure the closeness of the predicted word and target word?

EmmeR

unread,
Jun 30, 2017, 7:13:54 AM6/30/17
to BigDL User Group
If you read my last answers, my idea was to use n words for input and n words for output, as in a typical language model task.

The input in Python should be a 2D tensor (a numpy matrix) of sequence_len x vector_dim size, where each row represents a word. 

The Optimizer turns the 2D matrix in a 3D Scala tensor using as first dimension the batch size specified. If I use 3D numpy arrays as inputs the Optimizer throws an error because it turns it in a 4D tensor.

This is my model:

model = Sequential()
model.add(Recurrent().add(LSTM(input_dim, hidden_dim)))
model.add(TimeDistributed(Linear(hidden_dim, output_dim)))

If I take the predictions after the first epoch, I get a matrix (for each test sample) of n_words x vector_dim elements as I expected.
Problems start from the second one....

EmmeR

unread,
Jul 2, 2017, 7:03:25 AM7/2/17
to BigDL User Group
Sorry, did someone read the source code attached? 
Thank you very much for you help, it's really important for me.

zhangl...@gmail.com

unread,
Jul 2, 2017, 9:53:38 PM7/2/17
to BigDL User Group
Ok, I am trying to reproduce this issue, please wait some time.

zhangl...@gmail.com

unread,
Jul 3, 2017, 8:27:45 AM7/3/17
to BigDL User Group
Hi, we have reproduced your issue and just fixed the issue in our new code, please pull our master code and try again. Thank you.

Cherry


On Sunday, July 2, 2017 at 7:03:25 PM UTC+8, EmmeR wrote:

EmmeR

unread,
Jul 4, 2017, 3:39:55 AM7/4/17
to BigDL User Group, zhangl...@gmail.com
Thank you so much, this is a great news for the BigDL community! It works fine!

Can you tell me which validation method can be used for continuous outputs in order to validate the model?

Wan, Yan

unread,
Jul 5, 2017, 12:15:00 AM7/5/17
to EmmeR, BigDL User Group, zhangl...@gmail.com

Hi,

 

The CosineEmbeddingCriterion can be used to evaluate the distance between the continuous output and target.

EmmeR

unread,
Jul 5, 2017, 3:49:02 AM7/5/17
to BigDL User Group
Yes, but it gives me this error:

Error: java.lang.ClassCastException: com.intel.analytics.bigdl.tensor.DenseTensor cannot be cast to com.intel.analytics.bigdl.utils.Table

while MSECriterion() works fine but for my purpose the cosine distance is better. How to fix this error?

On the other hand I need a validation method for optimizer.set_validation(). Top1Accuracy, TopNAccuracy and Loss don't work as expected.

Jason Dai

unread,
Jul 5, 2017, 4:26:30 AM7/5/17
to EmmeR, BigDL User Group
You may refer to https://github.com/intel-analytics/BigDL/pull/1130 for a sample language implemented in python. Please note that it is still work in progress, and the interface may change before it is merged into the project.

Thanks,
-Jason 

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/5abe1acc-c8e5-44d2-a073-85f41054ebed%40googlegroups.com.

EmmeR

unread,
Jul 5, 2017, 6:09:44 AM7/5/17
to BigDL User Group
Ok, it is still in experimental stage, thank you for your work.

For the CosineEmbeddingCriterion() which is in the master, I was wondering how can I create a Sample from the python side. I have to create a table (tuple?) for the input where there are two tensors: one for inputs and one for output targets. The target y is a vector of 1 and -1. 

Now the problem is that Sample in constructed to have just one np.array for input and one np.array for output.


shell...@gmail.com

unread,
Jul 6, 2017, 3:16:07 AM7/6/17
to BigDL User Group
Please refer to https://github.com/intel-analytics/BigDL/wiki/Python-Support
for the python part.

For a simple example, you can create a Sample by:

from bigdl.nn.layer import *
from bigdl.nn.criterion import *
from bigdl.optim.optimizer import *
from bigdl.util.common import *

input
= np.random.uniform(0, 1, (3, 5)).astype("float32")
target
= np.random.uniform(0, 1, (3)).astype("float32")
sample
= Sample.from_ndarray(input, target)

> sample

<bigdl.util.common.Sample at 0x7f7bf497ad50>

> input

array
([[ 0.9237712 , 0.68297315, 0.49127841, 0.33461636, 0.74126232],
 
[ 0.87127173, 0.75266618, 0.02735208, 0.58697593, 0.98401189],
 
[ 0.46674389, 0.54515141, 0.31183699, 0.80972594, 0.70999968]], dtype=float32)


> target
array
([ 0.83898467, 0.45599914, 0.49628651], dtype=float32)

EmmeR

unread,
Jul 6, 2017, 11:54:16 AM7/6/17
to BigDL User Group
Thank you but I know how to create a Sample in Python. My problem is to create a Sample for the CosineEmbeddingCriterion()

shell...@gmail.com

unread,
Jul 9, 2017, 10:55:22 PM7/9/17
to BigDL User Group
Hi,

The CosineEmbeddingCriterion receives a Table input {x1, x2} and a Table target {1, -1} and will calculate the cosine similarity between x1 and x2.
This might not be the desired function in your case since it needs to calculate the distance between input and target in your scenario.

We are currently working on the process to support this CosineDistanceCriterion.

Will update information in this issue:


Bests,
Yan
Reply all
Reply to author
Forward
0 new messages