multi-class scoring and multi-label classification

3 views
Skip to first unread message

Joseph Picone

unread,
Apr 13, 2024, 12:15:22 AMApr 13
to nedc_research, ECE 8527, ece_sd_...@googlegroups.com, ece_sd_20...@googlegroups.com
Here is a nice discussion of how you convert a 4x4 confusion matrix to
F1 scores for a multi-class detection problem:

https://www.baeldung.com/cs/multi-class-f1-score#:~:text=For%20a%20multi%2Dclass%20classification,distinct%20classifiers%20for%20each%20class.

For our cardiology data, we have a slightly different situation. There
are six possible outcomes, but the output can have one or more of them.
For example:

ref = [1, 0, 0, 1, 0, 0]
ref = [1, 1, 1, 0, 0, 0]

are possible. So there are 2^6 = 64 possible outcomes. We could compute
F1 scores for each of these, and then average them, but that is a bit of
overkill.

A general problem you don't see discussed much is how do you score a
system when the output can have multiple attributes. Continuing the
example above:

A1, A2, A3, A4, A5, A6]
ref = [ 1, 0, 0, 1, 0, 0]
hyp1 = [ 1, 1, 0, 0, 1, 0]
hyp2 = [ 1, 0, 1, 1, 0, 1]

Is hyp1 better or worse than hyp2?

We could compute metrics for each attribute using a one vs. all approach:

Sens(A1)
Sens(A2)
...

We could then average the sensitivity scores for an overall sensitivity
score. We could also do a weighted average if the attributes did not
occur the same number of times.

This problem is apparently now called multi-label or multi-output
classification:

https://en.wikipedia.org/wiki/Multi-label_classification
https://scikit-learn.org/stable/modules/multiclass.html

The scoring paradigm we will use for cardiology is the macro-averaging
and micro-average described here:

https://www.evidentlyai.com/classification-metrics/multi-class-metrics#:~:text=Accuracy%20measures%20the%20proportion%20of,predictions%20made%20by%20the%20model.

-Joe





Joseph Picone

unread,
Apr 13, 2024, 4:35:49 PMApr 13
to ece_sd_...@googlegroups.com, ECE 8527
I revised the scoring script based on this very nice article:

https://medium.com/synthesio-engineering/precision-accuracy-and-f1-score-for-multi-label-classification-34ac6bdfb404

Things are now much clearer and clean.

What we are computing are measures known as micro-F1 and macro-F1. For
the cardiology data, here is what we see:

nedc_130_[1]: p
/data/isip/data/tnmg_code/ece_8527/evaluation
nedc_130_[1]: score.py ref_tnmg.csv hyp_tnmg.csv
Metric 1: simple accuracy
err / acc = 0.0242 / 0.9758

Metric 2: micro accuracy / precision / recall / f1
micro acc / prec / rec / f1 = 0.9960 / 0.9207 / 0.9557 / 0.9379

Metric 3: macro accuracy / precision / recall / f1
[1dAVb] acc / prec / rec / f1 = 0.9927 / 0.8667 / 0.9286 / 0.8966
[RBBB] acc / prec / rec / f1 = 0.9952 / 0.8947 / 1.0000 / 0.9444
[LBBB] acc / prec / rec / f1 = 1.0000 / 1.0000 / 1.0000 / 1.0000
[SB] acc / prec / rec / f1 = 0.9952 / 0.8333 / 0.9375 / 0.8824
[AF] acc / prec / rec / f1 = 0.9964 / 1.0000 / 0.7692 / 0.8696
[ST] acc / prec / rec / f1 = 0.9964 / 0.9474 / 0.9730 / 0.9600
macro acc / prec / rec / f1 = 0.9960 / 0.9237 / 0.9347 / 0.9255

It is interesting to see those LBBB numbers :)

This is the scoring program we will use for the ECE 8527 class project.

Later I will turn this into nedc_cardio_eval :)

-Joe

Joseph Picone

unread,
Apr 21, 2024, 1:17:42 PMApr 21
to ECE 8527
> Just to confirm is metric 1 (simple accuracy) is unhealthy vs healthy?
>

Not quite.

> Metric 3: macro accuracy / precision / recall / f1
>   [1dAVb] acc / prec / rec / f1 = 0.9927 / 0.8667 / 0.9286 / 0.8966
>    [RBBB] acc / prec / rec / f1 = 0.9952 / 0.8947 / 1.0000 / 0.9444
>    [LBBB] acc / prec / rec / f1 = 1.0000 / 1.0000 / 1.0000 / 1.0000
>      [SB] acc / prec / rec / f1 = 0.9952 / 0.8333 / 0.9375 / 0.8824
>      [AF] acc / prec / rec / f1 = 0.9964 / 1.0000 / 0.7692 / 0.8696
>      [ST] acc / prec / rec / f1 = 0.9964 / 0.9474 / 0.9730 / 0.9600

>     macro acc / prec / rec / f1 = 0.9960 / 0.9237 / 0.9347 / 0.9255

There are six possible outcomes for each slide. A slide can have any
combination of these six things. So your system does this:

.dat file machine
8-channel => learning => [0,1,0,1,0,1]
16-bit samples magic 6D vector of
300 Hz sample freq outcomes
2200 samples/channel

This is not a two-way decision. There are 2^6 possible outcomes.

I sorted the data into healthy (all vectors are [0,0,0,0,0,0])
and unhealthy (all vectors have a least one "1") just to make it a
little easier to understand the data. For training and dev testing you
really should pool all the data into one file.

For evaluation, I will give you the .dat files, but not the answers. You
will give me a spreadsheet of vectors that is in the same format as the
scoring examples:

nedc_130_[1]: head ../evaluation/tests/ref_tnmg.csv
1dAVb,RBBB,LBBB,SB,AF,ST
0,0,0,0,0,0
0,0,1,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0

-Joe



Reply all
Reply to author
Forward
0 new messages