Hi!
The code that I use to do something similar looks something like this:
String.implement({
sanitiseWord: function() {
var s = this.replace(/\r/g, '\n').replace(/\n/g, ' ');
var rs = [];
rs.push(/<!--.+?-->/g); // comments
rs.push(/<title>.+?<\/title>/g); // Title
rs.push(/<style[^>]*?>.+?<\/style>/g); // Style info
rs.push(/<(\/)?(meta|link|style|div|head|html|body|span|table|
colgroup|col|tbody|thead|tfoot|tr|td|font|!\[)[^>]*?>/g); //
Unnecessary tags
rs.push(/<[^>\s]*?:[^>]*?>/g); // Namespaced elements
rs.push(/<\?[^>]*?>/g); // Processing instructions
rs.push(/<[^>]*?\?>/g); // Processing instructions
rs.push(/ v:.*?=".*?"/g); // Weird nonsense attributes
rs.push(/ style=".*?"/g); // Styles
rs.push(/ class=".*?"/g); // Classes
rs.push(/( ){2,}/g); // Redundant s
rs.push(/<p>(\s| )*?<\/p>/g); // Empty paragraphs
for (var i = 0; i < rs.length; i++) {
s = s.replace(rs[i], '');
}
s = s.replace(/\s+/g, ' ');
var el = new Element('span');
return el.set('html', s).get('html'); // Balance unbalanced tags
}
});
It works pretty well, stripping out all but the simplest HTML tags.
Just to be safe, though, I run the same regular expressions server-
side, and then pass the text through HTMLTidy (well, jTidy, the Java
implementation) with some pretty strict parameters.