in simple text classification, bayes works on tokens. one simple way
to build a corpus of tokens is to split the JS into tokens using a
simple regex. here's a simple example from the test/ directory:
>>> doc[:100]
'<SCRIPT language="javascript">\n var p_url =
"http://paksusic.cn/nuc/exe.php";\nfunction SS()\n{'
>>> m = re.findall('\w+[^\w]', doc)
>>> len(m)
373
>>> m
['SCRIPT ', 'language=', 'javascript"', 'var ', 'p_url ', 'http:',
'paksusic.', 'cn/', 'nuc/', 'exe.', 'php"', 'function ', 'SS(',
'try{', 'ret=', 'new ', 'ActiveXObject(', 'snpvw.', 'Snapshot ',
'Viewer ', 'Control.', '1"', 'var ', 'arbitrary_file ', 'p_url;', 'var
', 'dest ', 'C:', 'Program ', 'Files/', 'Outlook ', 'Express/',
'wab.', "exe'", 'document.', 'write(', 'object ', 'classid=',
'clsid:', 'F0E42D60-', '368C-', '11D0-', 'AD81-', "00A0C90DC8D9'",
'id=', "attack'", 'object>', 'attack.', 'SnapshotPath ', ...
which is truncated but you get the idea. looking at the token
distribution over malicious and then benign JS samples i wonder if
this would work.
have you considered a simple strategy like this? simpler to implement
and test than building a feature vector as you described it.
-- jose