Javascript features extractor

10 views
Skip to first unread message

Neha Jain

unread,
Jul 19, 2010, 6:09:00 AM7/19/10
to pho...@googlegroups.com
Hello Reader,

Javascript is quite a nasty in language in itself, sorry for offending some of your feelings. But I feel this after analyzing the features and working on building a feature extractor for it! Though its interesting, because of it, btw..
Well, JS uses encoding, splitting and type conversions quite often and many a times in one line only. For instance for Iframe creation on approach is:

document.write(<iframe ...>);

and other is document.write('<' + string.fromCharCode(105) + 'fram' +unescape('%65') + '>');

Now, a simple regex search like :

re.findall("\<iframe.*\>", js)

is going to miss it.
So, I think it is quite obvious that I write some code that takes such script as input and returns a properly formatted JS to the feature extractor so that no such attempts go undetected. Other way could be that I assume such attempts to hide the tags likes - iframe, object, embed etc.. are deliberate attempts only in case of mal scripts and write a function that if find such notorious JS mark it malicious and drop it from the further investigation. (Comments??)

Well, I was thinking of using split(';') cascaded with split('+') etc.. but that is going to be quite long. Some pointers on how to do it,short and smart, would be quite generous, and I would be grateful.

--
Sincere Regards,
Neha Jain

Neha Jain

unread,
Jul 21, 2010, 6:04:02 AM7/21/10
to pho...@googlegroups.com
Following are some of the regular expressions I have been working with to get the function calls and variable names. They seem to work just alright.
# function evaluation: fetches functions like: eval, document.getElementById etc..
m = re.findall("(\w+|\w+\.\w+|\w+\.\w+\.\w+)\(.*?\);",js)

#variable extraction:
v = re.findall("var\s+(\w+)\s*=\s*new\s+(\w+)\(.*?\);?",js)
v = re.findall("var\s+(\w+)\s*[=;,]+",js)

This seem to be working better than the earlier implementations in the javascriptfeatures.py
I am working on improving them as well as other regular expressions for feature extraction. I was anyways thinking if I am going on the right track, and if there was any other approach for the same task that was better in terms of time and space than this.
Reply all
Reply to author
Forward
0 new messages