Per-feature metric collection

Eugene Kuznetsov

unread,

Jun 19, 2019, 7:40:12 PM6/19/19

to tensor2tensor

I want to extract an internal metric (more specifically approx_bleu_score) for every feature in the test set.

I know that I can run t2t-eval, but that gives me the average of approx_bleu_score over the entire set, instead of one entry per feature.

I can run t2t-decode and then hack into t2t-bleu to dump per-feature values, but (1) that's not the exact same metric, and (2) that takes way, way longer (for the same set, t2t-decode takes something like 50x as long as t2t-eval, and it's becoming prohibitively slow.)

The idea is to use approx_bleu_score values to identify and filter out mislabeled features. (I can do that via t2t-decode, but I think that the internal metric would be better for this purpose, or at least would identify a different set of mislabeled features.)

I tried to dig into the code and got as far as find where the metric is calculated (utils/metric.py:problem_metric_fn -> utils/bleu_hook.py:bleu_score), I can modify bleu_score() to return an array instead of a single value, but then the return value is passed to tf.metrics.mean() and I'm unsure how to get the array past that.

Is there any reasonably simple way to get this done?

Lukasz Kaiser

unread,

Jun 19, 2019, 8:32:11 PM6/19/19

to Eugene Kuznetsov, tensor2tensor

Hmm, I think this is one case where the framework fights against what you want.

I think with hacking you're on the right track: how about just removing the mean and returning it all for your use?

Lukasz

--
You received this message because you are subscribed to the Google Groups "tensor2tensor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tensor2tenso...@googlegroups.com.
To post to this group, send email to tensor...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tensor2tensor/d51df319-0224-4872-9f60-3e8ee27ab4f5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eugene Kuznetsov

unread,

Jun 19, 2019, 9:02:22 PM6/19/19

to tensor2tensor

That is not enough:

File "/home/aidev/.local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1460, in _call_model_fn_eval
    features, labels, model_fn_lib.ModeKeys.EVAL, config)
File "/home/aidev/.local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/aidev/.local/lib/python2.7/site-packages/tensor2tensor/utils/t2t_model.py", line 1414, in wrapping_model_fn
    use_tpu=use_tpu)
File "/home/aidev/.local/lib/python2.7/site-packages/tensor2tensor/utils/t2t_model.py", line 1533, in estimator_model_fn
    losses_dict)
File "/home/aidev/.local/lib/python2.7/site-packages/tensor2tensor/utils/t2t_model.py", line 1683, in estimator_spec_eval
    loss=loss)
File "/home/aidev/.local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/model_fn.py", line 194, in __new__
    eval_metric_ops = _validate_eval_metric_ops(eval_metric_ops)
File "/home/aidev/.local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/model_fn.py", line 572, in _validate_eval_metric_ops
    'tuples, given: {} for key: {}'.format(value, key))
TypeError: Values of eval_metric_ops must be (metric_value, update_op) tuples, given: Tensor("Neg:0", shape=(?, ?, 1, 1), dtype=float32) for key: metrics-translate_enru_iwslt32k/targets/neg_log_perplexity

Eugene Kuznetsov

unread,

Jun 19, 2019, 10:55:51 PM6/19/19

to tensor2tensor

I think I got it:

tensorflow/ops/metrics_impl.py
def accuracy(values, weights, name=None):
with variable_scope.variable_scope(name, 'acc', (values, weights)):
    values = math_ops.to_float(values)

    total = metric_variable([100000], dtypes.float32, name='total')
    count = metric_variable([], dtypes.int32, name='count')

    if weights is None:
      num_values = math_ops.to_float(array_ops.size(values))
    else:
      values, _, weights = _remove_squeezable_dimensions(
          predictions=values, labels=None, weights=weights)
      weights = weights_broadcast_ops.broadcast_weights(
          math_ops.to_float(weights), values)
      values = math_ops.multiply(values, weights)
      num_values = math_ops.reduce_sum(weights)

    with ops.control_dependencies([values]):
      update_count_op = state_ops.assign_add(count, math_ops.to_int32(num_values))
      update_total_op = state_ops.scatter_update(total, math_ops.range(math_ops.to_int32(count), math_ops.to_int32(count)+array_ops.size(values), dtype=dtypes.int32), values)
    return total, control_flow_ops.group((update_total_op, update_count_op))

(had to repurpose metrics.accuracy, because it refused to see the function when I created my own)

tensor2tensor/utils/metrics.py, problem_metric_fn():
... return tf.metrics.accuracy(scores, weights) if metric_fn==bleu_hook.bleu_score else tf.metrics.mean(scores, weights)

It's really ugly, but seems to work.

Eugene Kuznetsov

unread,

Jun 20, 2019, 12:14:58 AM6/20/19

to tensor2tensor

By the way, it looks like the default behavior for t2t-eval is to run 100 "steps" (for some unspecified definition of "step") and quit. Took me some time to understand why I was only getting a fraction of the expected data (I had 40k features and 100 steps only covered around 13k of them). I don't think that's correct , or expected by users.

Lukasz Kaiser

unread,

Jun 30, 2019, 1:53:55 PM6/30/19

to Eugene Kuznetsov, tensor2tensor

Some data-sets have evaluation data that's so large that it's not
possible to run on all of them in each eval -- so we run some part,
100 batches (steps). It's not that different from training. But yes -
you need to be careful with that part.

Lukasz

On Wed, Jun 19, 2019 at 9:15 PM Eugene Kuznetsov <name...@fastmail.fm> wrote:
>
> By the way, it looks like the default behavior for t2t-eval is to run 100 "steps" (for some unspecified definition of "step") and quit. Took me some time to understand why I was only getting a fraction of the expected data (I had 40k features and 100 steps only covered around 13k of them). I don't think that's correct , or expected by users.
>

> --
> You received this message because you are subscribed to the Google Groups "tensor2tensor" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tensor2tenso...@googlegroups.com.
> To post to this group, send email to tensor...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/tensor2tensor/c017bb10-9add-4f28-8c9e-cf2a2b100bdf%40googlegroups.com.

Eugene Kuznetsov

unread,

Jun 30, 2019, 3:51:46 PM6/30/19

to tensor2tensor

It is different from training, because training eventually works its way through the entire dataset, but evaluation only tackles the first 100 batches. You are essentially second-guessing the user, who typically chooses the evaluation set to be small enough to be tractable, and does not expect you to effectively throw away an unspecified fraction of it.

This is made even more confusing by the choice to use 100 batches, since most reasonable people would look at the progress, which is reported as "10/100", "20/100", etc, and assume that the tool is reporting percentages (10% of the eval set, 20% of the eval set, etc.)

At the very least, t2t-eval should either default to running the entire evaluation set, or have an explicit command-line flag to enable this behavior (in addition to the existing command-line option to set the number of steps.) Limiting eval to 100 batches would be acceptable for periodic evaluation inside t2t-train, assuming that it cycles through the set and does not just go back to start every time.

On Sunday, June 30, 2019 at 10:53:55 AM UTC-7, Lukasz Kaiser wrote:

Some data-sets have evaluation data that's so large that it's not
possible to run on all of them in each eval -- so we run some part,
100 batches (steps). It's not that different from training. But yes -
you need to be careful with that part.

Lukasz

On Wed, Jun 19, 2019 at 9:15 PM Eugene Kuznetsov <name...@fastmail.fm> wrote:
>
> By the way, it looks like the default behavior for t2t-eval is to run 100 "steps" (for some unspecified definition of "step") and quit. Took me some time to understand why I was only getting a fraction of the expected data (I had 40k features and 100 steps only covered around 13k of them). I don't think that's correct , or expected by users.
>
> --
> You received this message because you are subscribed to the Google Groups "tensor2tensor" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to tensor...@googlegroups.com.

Reply all

Reply to author

Forward