Most performant way to read in-memory matrix training set

19 views
Skip to first unread message

Patrick McCarthy

unread,
Mar 9, 2021, 9:22:04 AM3/9/21
to TensorFlow End Users - GETTING STARTED, TUTORIALS & HOW-TO'S

I'm looking to find the fastest possible way to train a small model from datasets in memory.

The model itself is simply Word2Vec, but instead of learning online from a text file I'm following the method of Li 2019 (https://link.springer.com/article/10.1007/s41019-019-0096-6)

Basically, to train a vocabulary V for S steps, you create a prestaged target matrix S*V, and similarly a negative constrasting matrix which can also be S*V. To train, each epoch the algorithm takes a slice target[i,:] from either the target or the negative matrix.

QUESTION - What is the fastest way to feed this data to a keras model?

According to the tensorboard profiler, my current approach has an average step time of 67us, min 24, max 6707 on Google Colab.

My current approach is to put each matrix into a generator, and use tf.data.Dataset.from_generator to read from them, and then sample between the two datasets:

def build_pos_neg_generators(positive_matrix: np.ndarray, negative_matrix: np.ndarray, conf: Dict[str,Any]) -> tf.data.Dataset:

    positive_t = tf.transpose(tf.constant(positive_matrix))
    negative_t = tf.transpose(tf.constant(negative_matrix))

    def pos_generator():
    
        VOCAB = tf.expand_dims(tf.range(conf['vocab_size']),1)
        ONES = tf.ones((conf['vocab_size'],1))
        
        for i in range(conf['num_pos_columns']):
            yield (
                (VOCAB,
                tf.expand_dims(tf.nn.embedding_lookup(positive_t,i),1)),
                ONES
            )

    def neg_generator():
        
        VOCAB = tf.expand_dims(tf.range(conf['vocab_size']),1)
        ZEROS = tf.zeros((conf['vocab_size'],1))

        for i in range(conf['num_neg_columns']):
            yield (
                (VOCAB,
                tf.expand_dims(tf.nn.embedding_lookup(negative_t,i),1)),
                ZEROS
            )

    num_ns = conf['num_ns']
    vocab_size = conf['vocab_size']

    pos_dset = (
        tf.data.Dataset.from_generator(pos_generator,
            output_signature=(
                            (tf.TensorSpec(shape=(vocab_size,1),dtype=tf.int32),
                            tf.TensorSpec(shape=(vocab_size,1),dtype=tf.int32)),
                            tf.TensorSpec(shape=(vocab_size,1),dtype=tf.int32)
                            ))
    )        
    neg_dset = (
        tf.data.Dataset.from_generator(neg_generator,
            output_signature=(
                            (tf.TensorSpec(shape=(vocab_size,1),dtype=tf.int32),
                            tf.TensorSpec(shape=(vocab_size,1),dtype=tf.int32)),
                            tf.TensorSpec(shape=(vocab_size,1),dtype=tf.int32)
                            ))
        .repeat(num_ns)
    )

    return tf.data.experimental.sample_from_datasets([pos_dset, neg_dset], weights=[(1/(num_ns+1)),(num_ns/(num_ns+1))]).prefetch(AUTOTUNE)

    
    
Is there a better way to read from these two, fixed matrices?
Reply all
Reply to author
Forward
0 new messages