We apologize for cross-posting.
New dataset released: SMS Spam Collection v.1
-----------------------------------------------------------------------
The SMS Spam Collection v.1 is a public set of SMS labeled messages
that have been collected for mobile phone spam research. It has one
dataset composed by 5,574 English, real and non-enconded messages,
tagged according being legitimate (ham) or spam.
The collection is free for all purposes, and it is public available
at:
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
This corpus has been collected from free or free for research sources
at the Internet including the Grumbletext Web site, the NUS SMS
Corpus, Caroline Tag's PhD Thesis, and a smaller previous collection
(SMS Spam Corpus v.0.1:
http://www.esp.uem.es/jmgomez/smsspamcorpus/,
available for historic comparison).
A comprehensive study of this corpus can be found in the following
paper, which offers a number of statistics, studies and baseline
results for several machine learning methods:
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the
study of SMS Spam Filtering: New Collection and Results. Proceedings
of the 2011 ACM Symposium on Document Engineering (ACM DOCENG'11),
Mountain View, CA, USA, 2011. (Accepted)
With best regards,
Tiago A. Almeida
School of Electrical and Computer Engineering
University of Campinas, Sao Paulo, Brazil
José María Gómez Hidaldo
R&D Department, Optenet
Las Rosas, Madrid, Spain