Use cnndailymail with the transformer model

622 views
Skip to first unread message

fdie...@googlemail.com

unread,
Oct 22, 2017, 3:24:44 PM10/22/17
to tensor2tensor
Hi All,

I am wondering if anyone has already tried to apply the transformer model to the cnndailymail summarization task. Though the vanilla transformer model isn't designed for summarization it would serves as an initial data point. This is in line with the paper https://arxiv.org/abs/1704.04368 which presents such a data point for a 'traditional' LSTM-2-LSTM + attention translation model with ROUGE-1/2/L:  31.33% / 11.81% / 28.83%. In other words the question is if the transformer performs similar for this task compared to these numbers. 

I've already played around with this using the transformer_big_single_gpu configuration (cut the input and targets to the leading 50 or 100 tokens). However my results are pretty miserable. ROUGE-L = 12% -14% on the test set. The strange thing is that on a 10k subset of the training set the my ROUGE-L numbers are even worse ~10% and stable over many epochs (trained up to 1000k epochs) though the loss goes constantly down.

Btw, another finding was that the decoder output was maximum 100 tokens long. There seems to be a parameter which prevents the generation of longer output sequences. Can anyone point me to this magic parameter, even after code inspection I wasn't able to identify it (note that I didn't use the interactive decoding which obviously comes with such a parameter which has a default of 100).

Well, thanks a lot in advance
Best
Frank

Lukasz Kaiser

unread,
Oct 22, 2017, 8:52:30 PM10/22/17
to fdie...@googlemail.com, urva...@stanford.edu, tensor2tensor
Hi Frank.

Urvashi (CC) has been working on CNN/DailyMail in T2T. There were
multiple problems: the data wasn't prepared well (that PR is already
merged on github, but not yet in the pip release), and then we need a
script to re-tokenize and compute a comparable ROUGE. Also,
transformer_prepend seems to be doing much better than base.

Urvashi, when you have some results, can you tell us more?

Thanks for your interest and sharing the work guys!

Lukasz
> --
> You received this message because you are subscribed to the Google Groups
> "tensor2tensor" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tensor2tenso...@googlegroups.com.
> To post to this group, send email to tensor...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tensor2tensor/dc1226ec-5260-4f1a-bda9-bce6f8ee351f%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

vinc...@yahoo.com

unread,
Nov 2, 2017, 11:21:28 AM11/2/17
to tensor2tensor
Hello Lukasz,

I tried the transformer_prepend, going nowhere in terms of ROUGE or any other metrics.

Is there something broken in the problem setup ?

Anyone else with results ?

Cheers.
Vincent

vinc...@yahoo.com

unread,
Nov 2, 2017, 3:14:12 PM11/2/17
to tensor2tensor

hmmm maybe I nned to wait much longer, I'll keep you posted.

vinc...@yahoo.com

unread,
Nov 3, 2017, 1:42:06 PM11/3/17
to tensor2tensor
config is transformer_prepend

After more then 500K steps of 1024 batch_size:

Approx bleu is slightly over 0.6
rouge_2_fscore about 0.68
rouge_Lfscore 0.76

But when I decode a sample of cnndm/test sample of 50 paragraphs, it's really bad .....

I am wondering if subwords are suited for this task, because there are a lot of non existing words in the output.

fdie...@googlemail.com

unread,
Nov 4, 2017, 4:05:30 PM11/4/17
to tensor2tensor
Vince, the numbers you report are they from the dev-set (automatically generated during training). If this is the case, did you set the option --eval_run_autoregressive ? If not the ground truth targets are fed back to the decoder and not the actual decoded symbol. This may explain the overly good looking bleu and rouge scores.

Btw, I was able to install everything and I'm now running the experiment (started a few hours ago). I am using

--problem=summarize_cnn_dailymail32k
--model=transformer
--hparams_set=transformer_base_single_gpu

First I tried transformer_big_single_gpu but this failed due to using too much memory (on a P6000) until now it looks like transformer_base_single_gpu works. I am though wondering if one needs to reduce sequence lengths by  setting --hparams="max_input_seq_length=XXX,max_target_seq_length=YYY". I am trying XXX=YYY=50 and =100.  For your results did you restrict the sequence lengths too? 

vinc...@yahoo.com

unread,
Nov 4, 2017, 4:48:26 PM11/4/17
to tensor2tensor
Thanks for your info.
I did not set that option, I will
For the sequence length you may be right, but I think default value should work.
However I had to set the batch size to 1024 and if you want to learn with base_single_gpu (as I did) learning_rate=0.05
works better than the default 0.2

Results are not good at all.

I think for this dataset we need to work with words, not subwords. the corpus is small, already tokenized. I will try.

fdie...@googlemail.com

unread,
Nov 7, 2017, 9:54:30 AM11/7/17
to tensor2tensor
Vince,

in the meanwhile I run a couple of experiments with the summarize_cnn_dailymail32k problem. 
Results are though rather bad (results below are on the val-set obtained during training with --eval_run_autoregressive set): 

PROBLEM=summarize_cnn_dailymail32k
MODEL=transformer

transformer_base_single_gpu     max_input_seq_length=100   max_target_seq_length=100      learning_rate=0.2         batch_size=2048
Step             rouge_2_fscore          rouge_L_fscore
645469           0.1638                      0.1499
740388           0.1554                      0.1467


transformer_base_single_gpu    max_input_seq_length=50   max_target_seq_length=50       learning_rate=0.2         batch_size=2048
Step             rouge_2_fscore          rouge_L_fscore
169911           0.1229                      0.1460
900000           0.1139                      0.1423


transformer_base_single_gpu   max_input_seq_length=100  max_target_seq_length=100    learning_rate=0.05        batch_size=1024
Step             rouge_2_fscore          rouge_L_fscore
1063285          0.1323                      0.1416
1292743          0.1308                      0.1294


transformer_big_single_gpu    max_input_seq_length=100   max_target_seq_length=100     learning_rate=0.05        batch_size=1024
Step             rouge_2_fscore          rouge_L_fscore
301559           0.1621                      0.1563
587090           0.1559                      0.1474

Though not directly comparable due to generated on the val-set and not on the test-set, all these numbers are far-far away 
from the lstm-2-lstm + attention numbers reported in Get To The Point: Summarization with Pointer-Generator Networks  https://arxiv.org/abs/1704.04368 
There the authors report ROUGE-1/2/L numbers:  31.33% / 11.81% / 28.83% on the test set.

So what is your experience on this. Do you get similar bad numbers?

vinc...@yahoo.com

unread,
Nov 7, 2017, 10:55:03 AM11/7/17
to tensor2tensor

According to Lukasz, transformer_prepend is the way to go.
But I did get any good results either.

On your experiments, something is wrong. Yu can't expect any results with 100 as input sequence.
Remember, it's subwords, and stories are several hundreds of words on the input side, so you need at least 1000 tokens.

Anyway I have the feeling something is wrong with the setup anyway. 

fdie...@googlemail.com

unread,
Nov 7, 2017, 1:31:47 PM11/7/17
to tensor2tensor
Vince, wrt the number of tokens, I believe that 100 tokens (even when using subword tokens) should give at least reasonable results. My argument is as follows.
In the before mentioned paper 'Get To The Point: Summarization with Pointer-Generator Networks'  https://arxiv.org/abs/1704.04368 the following numbers are given.

lead-3 baseline (ours)    40.34   17.7    36.57        for rouge-1/2/L              when using just the first three sentences as the summarization!


Counting the number of tokens in the first three sentences of the train, dev, test parts I get an average number of tokens of 55,  81,  81 for train, dev and test, respectively.


Calculating the 'full word token coverage' for these three sentences wrt the word piece vocabulary I get a coverage rate of 86% for train, dev and test.


Thus I am pretty confident  that with 100 word-piece tokens I should cover on average roughly the first three sentences (this holds in particular for the training).


This truncation approach is of course suboptimal but appears reasonable to me for testing the feasibility of the transformer model for summarization. I mean one should 

at least get reasonable close to the LSTM-2-LSTM + attention model numbers reported above (assuming the transformer performs similar well).




vinc...@yahoo.com

unread,
Nov 7, 2017, 2:05:48 PM11/7/17
to tensor2tensor
Am I reading well ?

4 Dataset We use the CNN/Daily Mail dataset (Hermann et al., 2015; Nallapati et al., 2016), which contains online news articles (781 tokens on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average). We used scripts supplied by Nallapati et al. (2016) to obtain the same version of the the data, which has 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs.

vinc...@yahoo.com

unread,
Nov 7, 2017, 2:08:03 PM11/7/17
to tensor2tensor
and then

During training and at test time we truncate the article to 400 tokens and limit the length of the summary to 100 tokens for training and 120 tokens at test time.3



Le mardi 7 novembre 2017 19:31:47 UTC+1, fdie...@googlemail.com a écrit :

fdie...@googlemail.com

unread,
Nov 7, 2017, 2:22:26 PM11/7/17
to tensor2tensor
Vine, not sure what you're asking me. What I say is just that most information needed to get reasonable rouge-scores are in the first 3 sentences of the stories and that one covers them roughly by the first 100 tokens. One should not expect to get competitive results by restricting the data in this way. However, it should provide enough information to get something reasonable and not these numbers I send around - at least if the transformer model does something useful wrt this task. 

vinc...@yahoo.com

unread,
Nov 7, 2017, 3:16:38 PM11/7/17
to tensor2tensor

You wrote these results were obtained with "using the first 3 sentences".

100 tokens is not very long, that's all I am saying. but yes it should give you more than waht you get.

I am trying with other settings also.

Lukasz Kaiser

unread,
Nov 7, 2017, 6:09:10 PM11/7/17
to vinc...@yahoo.com, tensor2tensor
Below are results and comments from Urvashi using transformer_prepend
and transformer_decoder as model. A PR with a ROUGE script will be coming,
but the email can give some hints for now.

Lukasz

----------

The rouge scores from the last generated model are:
INFO:tensorflow:rouge_1_f_score: 0.2557
INFO:tensorflow:rouge_1_precision: 0.2579
INFO:tensorflow:rouge_1_recall: 0.2558
INFO:tensorflow:rouge_2_f_score: 0.1292
INFO:tensorflow:rouge_2_precision: 0.1305
INFO:tensorflow:rouge_2_recall: 0.1290
INFO:tensorflow:rouge_l_f_score: 0.2438
INFO:tensorflow:rouge_l_precision: 0.2459
INFO:tensorflow:rouge_l_recall: 0.2438

The decoder does output some empty predictions. The rouge script
identifies this as empty input and prints a warning to the console,
but note that the examples are simply treated as those with rouge
score 0.0 which is the expected behavior.

In this paper, the results of a baseline seq2seq with attention and a
50k vocabulary on the test set are:
Rouge 1=31.33
Rouge 2=11.81
Rouge L=28.83

Qualitatively, the outputs seem to be rather short, in fact, cut off
mid sentence, example: “Sam Fuller Jr. says he decided to leave the
Church of ND in 2007 after almost taking his own life. He spent 28
days in a mental institution following the incident, and that is when
he finally left the church. The 47-year-old”

The average length of a target is 56 tokens and max length 1722
tokens, while the average length of a decoded sequence is 39 tokens
and max length is 97. I noticed the hparam "max length" is set to 256.
Does this apply to both source and target sequences? If so, we might
need to make it longer. I ran a quick experiment with placing some dev
example sources in a file and decoding with the decode_from_file flag
turned on. The outputs for these were extremely bizarre. I’ve attached
a log with the outputs, but they are identical for all the examples,
they are extremely repetitive and most importantly they are completely
different from when I decode the dev set directly. Any ideas on what
might be causing this? Looks like a definite bug, but not sure if its
cropped up before.
> --
> You received this message because you are subscribed to the Google Groups
> "tensor2tensor" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tensor2tenso...@googlegroups.com.
> To post to this group, send email to tensor...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tensor2tensor/58bc05e8-291b-43b3-b28c-4f1b3acbaeab%40googlegroups.com.

est namoc

unread,
Sep 17, 2018, 12:34:39 AM9/17/18
to tensor2tensor
I used transformer prepend hparams set and transformer model with addition params (hidden_layer = 5, batch_size = 2048)

ran on 2 GPU for 250k steps (total 500k steps)

the output result is extremely repetitive as well. I am using the decode_from_file flag, am not aware there is another way to do it, may i know how are you decoding without that flag, Lukasz?

However, decode from file flag does work well in typical translation task.  
Message has been deleted

tim

unread,
Sep 18, 2018, 4:56:21 AM9/18/18
to tensor2tensor

Unfortunately, i have the same problem.
I used tensorflow1.9 and tensor2tensor 1.9,  transformer model and transformer prepend hparams as README suggested.
run on 2 GPU for 250k steps.
the decoder result is  either too long or too short. 

""""""""""""""""""" result example """"""""""""""""""""
INFO:tensorflow:Inference results OUTPUT: By. Daily Mail Reporter. Follow @@Mail Reporter. Follow @@Mail Reporter. Follow @@Mail Reporter. Follow @@Mail Reporter. Follow @@Mail Reporter. Follow @@Mail Reporter.  #Mail Reporter. Follow @@Mail Reporter.  #Mail Reporter. Follow @@Mail Reporter.  #Mail Reporter.  #Mail Reporter.  #Mail Reporter.  #Mail Reporter.  #Mail Reporter.  #Mail Reporter.  #Mail Reporter. 
......repeat many times

 """"""""""""""""""" result example """"""""""""""""""""
INFO:tensorflow:Inference results OUTPUT: By. Sam


在 2018年9月17日星期一 UTC+8下午12:34:39,est namoc写道:
Reply all
Reply to author
Forward
0 new messages