IMRBOK Help Needed: Need Help with VCCDB and R

16 views
Skip to first unread message

Jeff Lowder

unread,
Nov 25, 2022, 9:37:12 PM11/25/22
to SiRA-public
Hi,

Ask: I'm looking for someone who has more experience with R than I do, in order to do some analysis of the incident data in the Veris Community Database (VCDB, http://veriscommunity.net/vcdb.html). Specifically, I'm looking for a volunteer to generate a "trellis display" or "small multiples" consisting of 7 Partial Dependence Plots (PDPs), where there would be a PDP for each of the 7 predictor variables listed below.

- Predictor Variables (all 7 of which correspond to the 7 categories of "Action" in the VERIS schema):
 -- malware
 -- hacking
 -- social
 -- misuse
 -- physical
 -- error
 -- environmental
- Prediction Variable: Time to Discovery (timeline.discovery.value, as defined at http://veriscommunity.net/discovery.html#section-incident-timeline)

Would someone be willing to assist with this?

Thanks in advance!

Jeff

Corey Neskey

unread,
Dec 4, 2022, 1:19:46 PM12/4/22
to SiRA-public
Hello Jeff,

Play with the plotmo package. I made a basic random forest classification model and named it classifier_RF. Assuming you already have a model built that you want to see a PDP for you'd just run this but swap out 'classifier_RF':

plotmo(classifier_RF,
       pmethod="partdep")

Which makes:

Rplot.png

For nerds:
``` R
library(readr)
library(tidyverse)
library(randomForest)
library(plotmo)
library(caTools)

# 1. Import the veris csv
vcdb <- read_csv("VCDB-master/data/csv/vcdb.csv/vcdb.csv")

# 2. Narrow to just `actions.*` and `timeline.discovery.value`
vcdb_ad <- select(vcdb,starts_with("action."))
vcdb_ad$"timeline.discovery.value" <- vcdb$"timeline.discovery.value"
vcdb_ad <- select(vcdb_ad,-ends_with(".notes"))
vcdb_ad <- select(vcdb_ad,-ends_with(".cve"))
vcdb_ad <- select(vcdb_ad,-contains("vector"))
vcdb_ad <- select(vcdb_ad,-contains("variety"))
vcdb_ad <- select(vcdb_ad,-contains("result"))
vcdb_ad <- select(vcdb_ad,-contains("action.social."))
vcdb_ad <- select(vcdb_ad,-contains("action.malware."))
vcdb_ad <- select(vcdb_ad,-contains("unknown"))

# 3. Trim "action." from the variable names.
names(vcdb_ad) <- sub("action.", "", names(vcdb_ad))

# 4. Drop rows w/o timeline values
vcdb_ad_remna <- filter(vcdb_ad, timeline.discovery.value != "N/A")

# 5. Change True and False to 1, 0
vcdb_ad_remna_char <- vcdb_ad_remna*1

# 6. classify the data with random forest
# Loading data
data(vcdb_ad_remna_char)

# Structure
str(vcdb_ad_remna_char)

# Splitting data in train and test data
split <- sample.split(vcdb_ad_remna_char, SplitRatio = 0.7)
split

train <- subset(vcdb_ad_remna_char, split == "TRUE")
test <- subset(vcdb_ad_remna_char, split == "FALSE")

# Fitting Random Forest to the train dataset
classifier_RF = randomForest(x = train[-8],
                             y = train$timeline.discovery.value,
                             ntree = 10)
classifier_RF

# Predicting the Test set results
y_pred = predict(classifier_RF, newdata = test[-8])

# Confusion Matrix
confusion_mtx = table(test[, 8], y_pred)
confusion_mtx

# Plotting model
plot(classifier_RF)

# Importance plot
importance(classifier_RF)

# Variable importance plot
varImpPlot(classifier_RF)

plotres(classifier_RF)

plotmo(classifier_RF,
       pmethod="partdep")

```
Reply all
Reply to author
Forward
0 new messages