Effective Detection of Multimedia Protocol Tunneling using Machine Learning
Diogo Barradas, Nuno Santos, Luís Rodrigues
https://censorbib.nymity.ch/#Barradas2018a
https://www.usenix.org/conference/usenixsecurity18/presentation/barradas
This paper builds better classifiers for
Facet (
https://censorbib.nymity.ch/#Li2014a),
CovertCast (
https://censorbib.nymity.ch/#McPherson2016a), and
DeltaShaper (
https://censorbib.nymity.ch/#Barradas2017a). These three
systems all work by tunneling data inside an encrypted video stream.
Their original authors evaluated them in three different ways. This
paper unifies the evaluation, checking each system against each of the
previously used evaluation techniques: it turns out that the Χ² test (as
used by the Facet paper) outperforms Kullback–Leibler divergence (used
by CovertCast) and the earth mover's distance (used by DeltaShaper).
CovertCast was especially detectable by the Χ² test—the authors
speculate that it may be because it is tuned for different video
settings than YouTube now uses.
They then test additional classifiers based on decision trees, and find
that they outperform even the Χ² test. (They only tried decision
tree–based techniques, not any other ML algorithms.) Their training and
test data are synthetic: between 200 and 1000 videos from YouTube,
either livestreams, videos of chat sessions, or generic popular videos,
depending on the intended usage of the system under test, captured for
60 seconds in the steady state. They (separately) test two different
feature sets: (1) summary statistics such as packet length mean,
variance, and percentiles, as well as packet burst lengths; and (2)
packet length histograms, quantized into 5-byte buckets. Feature set (2)
performs a little better. They report a 70% true positive rate against
Facet, with a very small number of false positives—by "very small" they
mean 1% or 2%. (Cf.
https://censorbib.nymity.ch/#Wang2015a §7, which
speculates on the impact of a FPR of 1%.) The amount of state the
classifier requires per connection is around 2 KB.
Finally, they try out some unsupervised techniques for classification.
The advantage of these is that they don't require separate labeled
covert/non-covert training sets—but they do need a clean corpus of known
non-covert traffic. They test a one-class SVM, autoencoder, and
isolation forest. None of them performs as well as the supervised
techniques, even when giving the classifier the advantage of labeled
covert traffic to optimize each classifier's hyperparameters. Only the
autoencoder approach shows potential promise.