1. Run your session with log device placement
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
That'll show if you have some errant ops placed on CPU. This can cause large transfers, for instance, if your dataset is a tensor on CPU, and `Rank` op is on GPU, then it'll copy entire dataset to GPU during each invocation.
2. Don't use feed_dict. Anything that goes into feed_dict is in Python-land, hence on CPU and will require GPU copy. If you have some Python values you need to reuse, save them into TensorFlow variable and use the variable value later. For instance, here's a snippet that saved MNIST batch_sizex10 labels matrix into variable
targets = tf.Variable(tf.zeros_initializer((batchSize, 10), dtype=dtype))
targets_init = targets.assign(labels_onehot)
sess.run(targets_init, feed_dict={labels_placeholder: train_labels[:batchSize]})