Creating ARFF files from (Kaggle) CSV files

1,411 views
Skip to first unread message

Johannes Amtén

unread,
Oct 21, 2013, 10:06:16 AM10/21/13
to wekamooc...@googlegroups.com

Oftentimes when you have data that you want to crunch, it will be in a CSV format and need to be converted to ARFF for processing i Weka.

For instance, most Kaggle problems come with two CSV files, one for training and one for testing. The training file will contain class data, whereas we are supposed to fill in the class data in the test file. What is the most convenient way of converting these files to the ARFF format?

You can import each CSV-file in Weka and export it as an ARFF file. This would however lead to ARFF files that are incompatible with each other, since the order of the values for the nominal attributes may be different and some values may be missing from the attribute definitions in each file.

I guess you could copy-paste the training and testing CSV files into one big CSV file, import that file into Weka and save the result as an ARFF file. Then you could copy the header from that ARFF file and paste it into the training and testing CSV files to create training and testing ARFF files. Would that be the most efficient way to convert the CSV files into ARFF files?

Peter Reutemann

unread,
Oct 21, 2013, 11:37:23 PM10/21/13
to WekaMOOC
Pretty much. I'm not aware of an elegant solution.

But after you've created the large CSV file, you could simply use the
Explorer and the RemoveRange instance filter to remove the correct
range of instances from the combined dataset to generate ARFF files
with the same content as the original CSV files.

For instance, the first CSV file has 150 rows and the second 250. Then
you could use the following two filter setups to re-create the
original datasets again from the combined dataset:
- recreate 1st dataset
weka.filters.unsupervised.instance.RemoveRange -V -R 1-150
- recreate 2nd dataset
weka.filters.unsupervised.instance.RemoveRange -R 1-150

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cms.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174

Olivier Boudry

unread,
Oct 22, 2013, 4:56:04 AM10/22/13
to wekamooc...@googlegroups.com
Hi Johannes,

I'm using R with the library RWeka when I have to convert ARFF files.

> library(RWeka)
> setwd("C:/Users/obou/Desktop/Octave/pca")
> data = read.csv("SAheart.csv")
> write.arff(data, "SAheart.arff")

That's the most convenient method I found to convert between the two file formats. I also noticed that sometimes Weka has problems with .csv files that R opens without problems (newlines within quote chars for example).

Peter Reutemann

unread,
Oct 25, 2013, 12:06:22 AM10/25/13
to WekaMOOC
> Pretty much. I'm not aware of an elegant solution.

I've attached an interactive ADAMS flow that you can use with today's
0.4.4 release (https://adams.cms.waikato.ac.nz/). Just open the flow
with the "Flow editor" tool and run it. It will prompt you for input
files, output directory and relation name.
kaggle.flow
Reply all
Reply to author
Forward
0 new messages