Difficulty encoding to One Hot Data

58 views
Skip to first unread message

Jorick van Bon

unread,
Apr 29, 2024, 6:25:50 AM4/29/24
to NCCU DS4CS
Dear Mr. Hsiao,

I have been working ahead due to some significant time limitations the upcoming week, therefore I have already started working on H-calls4fam_TFDS_rnn. The rnn model is clear to me, and as a result I have been able to get through the first question without any problems. However, for Q2, we have to use one hot encoding and a dense neural network to classify the same dataset, and I have been stuck on this problem for multiple days now. As a result I was hoping to get some help or hints. 

I am unable to apply OneHotEncoder to the train_dataset and the val_dataset, with one method as exception, which adds all values to an X_train, y_train, X_val, y_val list as numpy values using for loops, after which these lists get converted to numpy array, encoded, put through one hot encoder and finally put into to_categorical. While I feel like this might be too complicated and not the correct method, all other attempts have been unsuccessful. 

However, if I try to use these X_train, y_train etc for the model, I have been unable to set a proper input_shape, as when I set it to the X_train.shape[1], validating it using X_val is not working, and vice versa. In other words, the only way I have been able to get the model to work is training the model without validators and test it using the training dataset when making the confusion matrix, otherwise it gives errors with regard to the input. This seems incorrect, but as mentioned all of my other attempts have been unsuccessful the past days. 

How could I successfully implement the model, or what is going wrong? I can understand if you cannot provide the full answer, but some guidance would be greatly appreciated. 

Thank you in advance,

Kind regards,
Jorick van Bon

Mike Hsiao

unread,
Apr 29, 2024, 10:43:47 AM4/29/24
to NCCU DS4CS
Hi, Jorick,

1)
There should be 121 vocabularies, so similar to D02_DynamicAnalysis_win.ipynb, you can create a big table that store the frequency of each call invoked by every malware. Actualty, D02 has done it for you, but you need to change it to one-hot style table.

I suggest using the following function on the big table.
df.map(lambda x: 1 if x > 0 else 0)

And then you can easily transform a Dataframe to numpy data structure for later neural network analysis.
One more thing, the correct name of such data preprocessing should be called "multi_hot", not "one_hot", I hope you ware not mislead by the name.

2) 
If you read want to use a more fashion way to deal with the data, you may want to check CategoryEncoding.

3)
If there is still some problem, you can always output the table (malware_call_df) in D02_DynamicAnalysis_win. And run classification and test in Orange.


Thanks,
HSIAO



jorick...@gmail.com 在 2024年4月29日 星期一下午6:25:50 [UTC+8] 的信中寫道:
Reply all
Reply to author
Forward
0 new messages