Rapidminer Turbo Prep

0 views

Skip to first unread message

Ermelindo Klatt

unread,

Aug 5, 2024, 1:21:53 AM8/5/24

to incetdecam

RapidMiner Studio has always provided sophisticated data prep capabilities, but it was a bit daunting for analysts to learn," said Ingo Mierswa, RapidMiner founder. "With Turbo Prep, analysts now have access to a purpose-built data prep experience right inside RapidMiner Studio. When combined with RapidMiner Auto Model, analysts can now easily build predictive models on their own. Even experienced data scientists will love the productivity gains they'll get with Turbo Prep."

Turbo Prep is included with paid and academic RapidMiner Studio licenses. Everyone who upgrades to RapidMiner 9 will automatically receive a 30-day trial of RapidMiner Studio Large, including Turbo Prep.

RapidMiner brings artificial intelligence to the enterprise through an open and extensible data science platform. Built for analytics teams, RapidMiner unifies the entire data science lifecycle from data prep to machine learning to predictive model deployment. 400,000 analytics professionals use RapidMiner products to drive revenue, reduce costs, and avoid risks. For more information, visit rapidminer.com.

But not for much longer, folks! RapidMiner recently released a really nice functionality for data preparation, RapidMiner Turbo Prep. You will soon know why we picked this name ?, but the basic idea is that Turbo Prep provides a new data preparation experience that is fast and fun to use with a drag and drop interface.

Once you load the data it can be seen immediately in a data-centric view, along with some data quality indicators. At the top of the columns, the distributions and quality measurements of the data are displayed. These indicate whether the columns will be helpful for machine learning and modeling. Say, for example, the majority of the data in a column is missing, this could confuse a machine learning model, so it is often better to remove it all together. If the column acts as an ID, that means practically all of the values only occur once in the data set, so this not useful for identifying patterns, and also should be removed.

Now, we need to merge the two data sets together. Turbo Prep uses smart algorithms to intelligently identify data matches. Two data sets are a good match if they have two columns that match with each other. And two columns match well with each other if they contain similar values. In this example, we see a pretty high match of 94%.

And just like Turbo Prep, Auto Model processes can be opened in RapidMiner Studio, showing the full process with annotations. With Auto Model, every step is explained with its importance and why certain predictions are made during model creation. We can see exactly how the full model was created; there are no black boxes!

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

This technical article will teach you how to pre-process data, create your own neural networks, and train and evaluate models using the US-CERT's simulated insider threat dataset. The methods and solutions are designed for non-domain experts; particularly cyber security professionals. We will start our journey with the raw data provided by the dataset and provide examples of different pre-processing methods to get it "ready" for the AI solution to ingest. We will ultimately create models that can be re-used for additional predictions based on security events. Throughout the article, I will also point out the applicability and return on investment depending on your existing Information Security program in the enterprise.

Note: To use and replicate the pre-processed data and steps we use, prepare to spend 1-2 hours on this page. Stay with me and try not to fall asleep during the data pre-processing portion. What many tutorials don't state is that if you're starting from scratch; data pre-processing takes up to 90% of your time when doing projects like these.

The author provides these methods, insights, and recommendations *as is* and makes no claim of warranty. Please do not use the models you create in this tutorial in a production environment without sufficient tuning and analysis before making them a part of your security program.

It's important for newcomers to any data science discipline to know that the majority of your time spent will be in data pre-processing and analyzing what you have which includes cleaning up the data, normalizing, extracting any additional meta insights, and then encoding the data so that it is ready for an AI solution to ingest it.

Examining the raw US-CERT data requires you to download compressed files that must be extracted. Note just how large the sets are compared to how much we will use and reduce at the end of the data pre-processing.

In our article, we saved a bunch of time by going directly to the answers.tar.bz2 that has the insiders.csv file for matching which datasets and individual extracted records of are value. Now, it is worth stating that in the index provided there has correlated record numbers in extended data such as the file, and psychometric related data. We didn't use the extended meta in this tutorial brief because of the extra time to correlate and consolidate all of it into a single CSV in our case.

To see a more comprehensive set of feature sets extracted from this same data, consider checking out this research paper called "Image-Based Feature Representation for Insider Threat Classification." We'll be referring to that paper later in the article when we examine our model accuracy.

Before getting the data encoded and ready for a function to read it; we need to get the data extracted and categorized into columns that we need to predict one. Let's use good old Excel to insert a column into the CSV. Prior to the screenshot we took and added all the rows from the referenced datasets in "insiders.csv" for scenario 2.

The scenario (2) is described in scenarios.txt: "User begins surfing job websites and soliciting employment from a competitor. Before leaving the company, they use a thumb drive (at markedly higher rates than their previous activity) to steal data."

In the above photo, this is a snippet of all the different record types essentially appended to each other and properly sorted by date. Note that different vectors (http vs. email vs. device) do not all align easily have different contexts in the columns. This is not optimal by any means but since the insider threat scenario includes multiple event types; this is what we'll have to work with for now. This is the usual case with data that you'll get trying to correlate based on time and multiple events tied to a specific attribute or user like a SIEM does.

In the aggregation set; we combined the relevant CSV's after moving all of the items mentioned from the insiders.csv for scenario 2 into the same folder. To formulate the entire 'true positive' only dataset portion; we've used powershell as shown below:

Right now we have a completely imbalanced dataset where we only have true positives. We'll also have to add true negatives and the best approach is to have an equal amount of record types representing in a 50/50 scenario of non-threat activity. This is almost never the case with security data so we'll do what we can as you'll find below. I also want to point out, that if you're doing manual data processing in an OS shell-- whatever you import into a variable is in memory and does not get released or garbage collected by itself as you can see from my PowerShell memory consumption after a bunch of data manipulation and CSV wrangling, I've bumped up my usage to 5.6 GB.

Let's look at the R1 dataset files. We'll need to pull from that we know are confirmed true negatives (non-threats) for each of the 3 types from filenames we used in the true positive dataset extracts (again, it's from the R1 dataset which have benign events).

We'll merge a number of records from all 3 of the R1 true negative data sets from logon, http, and device files. Note, that in the R1 true negative set, we did not find an emails CSV which adds to the imbalance for our aggregate data set.

In pre-processing our data we've already added all the records of interest below and selected various other true-negative non-threat records from the R1 dataset. Now we have our baseline of threats and non-threats concatenated in a single CSV. To the left, we've added a new column to denote a true/false or (1 or 0) in a find and replace scenario.

Above, you can also see we started changing true/false strings to numerical categories. This is us beginning on our path to encode the data through manual pre-processing which we could save ourselves the hassle as we see in future steps in RapidMiner Studio and using the Pandas Dataframe library in Python for Tensorflow. We just wanted to illustrate some of the steps and considerations you'll have to perform. Following this, we will continue processing our data for a bit. Let's highlight what we can do using excel functions before going the fully automated route.

We're also manually going to convert the date field into Unix Epoch Time for the sake of demonstration and as you seen it becomes a large integer with a new column. To remove the old column in excel for rename, create a new sheet such as 'scratch' and cut the old date (non epoch timestamp) values into that sheet. Reference the sheet along with the formula you see in the cell to achieve this effect. This formula is: "=(C2-DATE(1970,1,1))*86400" without quotes.

In our last manual pre-processing work example you need to format the CSV in is to 'categorize' by label encoding the data. You can automate this as one-hot encoding methods via a data dictionary in a script or in our case we show you the manual method of mapping this in excel since we have a finite set of vectors of the records of interest (http is 0, email is 1, and device is 2).