How to use Keras RNN,LSTM,GRU to predict subsequence from sequence

1,457 views
Skip to first unread message

gund...@gmail.com

unread,
May 12, 2016, 3:36:38 AM5/12/16
to Keras-users
I have a data that looks like this:




It can be viewed here and has been included in the code below.
In actuality I have ~7000 samples (row).


The task is given antigen, predict the corresponding epitope.
So epitope is always a substring of antigen. 
Here is my code running on Recurrent Neural Network under Keras. It was modeled according the example.


My question are:

 1. Can Recurrent Neural Network used to predict subsequence as posed above?
 2. How can I improve the accuracy of my code?
 3. How can I modify my code so that it can run faster?


Here is my running code which gave very bad accuracy score.

    #!/usr/bin/env python
   
# -*- coding: utf-8 -*-
   
from __future__ import print_function
   
import sys
   
import json
   
import pandas as pd
   
from keras.models import Sequential
   
from keras.engine.training import slice_X
   
from keras.layers.core import Activation,  RepeatVector, Dense
   
from keras.layers import recurrent, TimeDistributed
   
import numpy as np
   
from six.moves import range
   
   
class CharacterTable(object):
       
'''
        Given a set of characters:
        + Encode them to a one hot integer representation
        + Decode the one hot integer representation to their character output
        + Decode a vector of probabilties to their character output
        '''

       
def __init__(self, chars, maxlen):
           
self.chars = sorted(set(chars))
           
self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
           
self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
           
self.maxlen = maxlen
   
       
def encode(self, C, maxlen=None):
            maxlen
= maxlen if maxlen else self.maxlen
            X
= np.zeros((maxlen, len(self.chars)))
           
for i, c in enumerate(C):
                X
[i, self.char_indices[c]] = 1
           
return X
   
       
def decode(self, X, calc_argmax=True):
           
if calc_argmax:
                X
= X.argmax(axis=-1)
           
return ''.join(self.indices_char[x] for x in X)
   
   
class colors:
        ok
= '\033[92m'
        fail
= '\033[91m'
        close
= '\033[0m'
   
    INVERT
= True
    HIDDEN_SIZE
= 128
    BATCH_SIZE
= 64
    LAYERS
= 3
   
# Try replacing GRU, or SimpleRNN
    RNN
= recurrent.LSTM
   
   
   
def main():
       
"""
        Epitope_core = answers
        Antigen      = questions
        """

   
        epi_antigen_df
= pd.io.parsers.read_table("http://dpaste.com/0BPTWTN.txt")
        antigens
= epi_antigen_df["Antigen"].tolist()
        epitopes
= epi_antigen_df["Epitope Core"].tolist()
   
        allchars
= "".join(antigens+epitopes)
        allchars
= list(set(allchars))
        aa_chars
=  "".join(allchars)
        sys
.stderr.write(aa_chars + "\n")
   
        max_antigen_len
= len(max(antigens, key=len))
        max_epitope_len
= len(max(epitopes, key=len))
   
        X
= np.zeros((len(antigens),max_antigen_len, len(aa_chars)),dtype=np.bool)
        y
= np.zeros((len(epitopes),max_epitope_len, len(aa_chars)),dtype=np.bool)
       
        ctable
= CharacterTable(aa_chars, max_antigen_len)
   
        sys
.stderr.write("Begin vectorization\n")
       
for i, antigen in enumerate(antigens):
            X
[i] = ctable.encode(antigen, maxlen=max_antigen_len)
       
for i, epitope in enumerate(epitopes):
            y
[i] = ctable.encode(epitope, maxlen=max_epitope_len)
   
   
       
# Shuffle (X, y) in unison as the later parts of X will almost all be larger digits
        indices
= np.arange(len(y))
        np
.random.shuffle(indices)
        X
= X[indices]
        y
= y[indices]
   
       
# Explicitly set apart 10% for validation data that we never train over
        split_at
= len(X) - len(X) / 10
       
(X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
       
(y_train, y_val) = (y[:split_at], y[split_at:])
   
        sys
.stderr.write("Build model\n")
        model
= Sequential()
       
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
       
# note: in a situation where your input sequences have a variable length,
       
# use input_shape=(None, nb_feature).
        model
.add(RNN(HIDDEN_SIZE, input_shape=(max_antigen_len, len(aa_chars))))
       
# For the decoder's input, we repeat the encoded input for each time step
        model
.add(RepeatVector(max_epitope_len))
       
# The decoder RNN could be multiple layers stacked or a single layer
       
for _ in range(LAYERS):
            model
.add(RNN(HIDDEN_SIZE, return_sequences=True))
   
       
# For each of step of the output sequence, decide which character should be chosen
        model
.add(TimeDistributed(Dense(len(aa_chars))))
        model
.add(Activation('softmax'))
   
        model
.compile(loss='categorical_crossentropy',
                    optimizer
='adam',
                    metrics
=['accuracy'])
   
       
# Train the model each generation and show predictions against the validation dataset
       
for iteration in range(1, 200):
           
print()
           
print('-' * 50)
           
print('Iteration', iteration)
            model
.fit(X_train, y_train, batch_size=BATCH_SIZE, nb_epoch=5,
                    validation_data
=(X_val, y_val))
           
###
           
# Select 10 samples from the validation set at random so we can visualize errors
           
for i in range(10):
                ind
= np.random.randint(0, len(X_val))
                rowX
, rowy = X_val[np.array([ind])], y_val[np.array([ind])]
                preds
= model.predict_classes(rowX, verbose=0)
                q
= ctable.decode(rowX[0])
                correct
= ctable.decode(rowy[0])
                guess
= ctable.decode(preds[0], calc_argmax=False)
               
# print('Q', q[::-1] if INVERT else q)
               
print('T', correct)
               
print(colors.ok + '☑' + colors.close if correct == guess else colors.fail + '☒' + colors.close, guess)
               
print('---')
   
   
if __name__ == '__main__':
        main
()




  

Xose

unread,
May 16, 2016, 8:24:38 AM5/16/16
to Keras-users
Unless the image of the data is truncated, I don't see that the Epitope is a substring of the Antigen, but a shorter different sequence. In the case that the Epitope is always a substring from the antigen, you could tackle the problem as a labeling problem (for each character of the antigen, decide if it is a part of the Epitope sequence), instead of a RNN encoder-decoder. If it is a different sequence, your approach is right, but I'm not sure if the data will be enough, I don't have enough experience with RNN encoder-decoder.

Also, I may be wrong, but, does your model correctly model the padded sequences? Do you have an output class for the padding character? Otherwise, your model would try to predict 
correct characters for the padding.

For the error analysis, try to get global numbers, specially at the beginning if you have a lot of errors, to catch if you are making always the same mistake in all sequences.

gund...@gmail.com

unread,
May 18, 2016, 9:36:36 PM5/18/16
to Keras-users

Hi


Unless the image of the data is truncated, I don't see that the Epitope is a substring of the Antigen, but a shorter different sequence.


Indeed it's always a substring of antigen. My image above is truncated. The actual set can be seen here.
It's downloadable too. 
 
In the case that the Epitope is always a substring from the antigen, you could tackle the problem as a labeling problem (for each character of the antigen, decide if it is a part of the Epitope sequence), instead of a RNN encoder-decoder.


Do you mean multiclass classification problem? Can you give an example code? 




 
Also, I may be wrong, but, does your model correctly model the padded sequences? Do you have an output class for the padding character? Otherwise, your model would try to predict 
correct characters for the padding.

I did not include any padded sequence. How can I include that? In antigen or epitope


 

For the error analysis, try to get global numbers, specially at the beginning if you have a lot of errors, to catch if you are making always the same mistake in all sequences.



How can I implement that?


Thanks and hope to hear from you again.


Best,
P.D.  
Reply all
Reply to author
Forward
Message has been deleted
0 new messages