How to strip all HTML code for a block of text?

536 views
Skip to first unread message

blabla12345

unread,
Aug 4, 2010, 12:29:37 PM8/4/10
to Railo
Hi,

I have a block of text (may be several paragraphs) and there are some
embedded HTML/CSS code within. What I'd like to do is:
to display it as if or almost rendered by a web browser inside a
Textarea tag (read only) so that stuff like
<span style="font-weight: bold;">bold</span> would look like:

Bold (bold or not is not important here)

Using Textarea for its Cols and Rows attributes, don't see any
difference in using Iframe for that matter.

Any thoughts?

Thanks.

Don

Ryan LeTulle

unread,
Aug 4, 2010, 12:34:20 PM8/4/10
to ra...@googlegroups.com

Peter Boughton

unread,
Aug 4, 2010, 12:52:02 PM8/4/10
to ra...@googlegroups.com
For doing this in CF, have a look at AntiSamy (Java-based project).

http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project
and
http://blog.pengoworks.com/index.cfm/2008/1/3/Using-AntiSamy-to-protect-your-CFM-pages-from-XSS-hacks


The intent of that project is to produce "clean" HTML with no
injection attacks in - but it should also be possible to simply create
a policy that says "remove all HTML".
(Though I've not used it yet, so don't know how that would actually work.)


Note that the JS-based script Ryan has linked to is just a very
simple/crude regular expression (/<\S[^><]*>/g) which is easy to trip
up, so not really recommended for any user-facing stuff.

Ryan LeTulle

unread,
Aug 4, 2010, 12:55:01 PM8/4/10
to ra...@googlegroups.com

Peter Boughton

unread,
Aug 4, 2010, 1:04:26 PM8/4/10
to ra...@googlegroups.com
Yeah, cflib really needs a way to post comments (or better still, code
reviews). :/

blabla12345

unread,
Aug 4, 2010, 1:09:02 PM8/4/10
to Railo
Thank you all.

The scriptKit is pretty neat with a simple test.

Is the following js file the only file required?
http://www.javascriptkit.com/jkincludes/dropdowntabs.js
not sure of the scripts.css is required as well...

and of course, the credit would be maintained with the source code.


On Aug 4, 12:55 pm, Ryan LeTulle <bayous...@gmail.com> wrote:
> how bout:
>
> http://www.cflib.org/index.cfm?event=page.udfbyid&udfid=12
>
> complain to ray
>
> </ryan>
>
> On Wed, Aug 4, 2010 at 11:52 AM, Peter Boughton <bought...@gmail.com> wrote:
> > For doing this in CF, have a look at AntiSamy (Java-based project).
>
> >http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project
> > and
>
> >http://blog.pengoworks.com/index.cfm/2008/1/3/Using-AntiSamy-to-prote...

blabla12345

unread,
Aug 4, 2010, 3:57:34 PM8/4/10
to Railo
Ok, sorry previously I didn't look closely at the source code, now I
did.

Here's the scoop,

<script type="text/javascript">

// Strip HTML Tags (form) script- By JavaScriptKit.com (http://
www.javascriptkit.com)
// For this and over 400+ free scripts, visit JavaScript Kit-
http://www.javascriptkit.com/
// This notice must stay intact for use

function stripHTML(){
var re= /<\S[^><]*>/g
for (i=0; i<arguments.length; i++)
arguments[i].value=arguments[i].value.replace(re, "")
}

<cfoutput>
stripHTML(#myCFhtmlData#,document.getElementById('sdata').value);
<cfoutput>
</script>

<form>
<textarea id="sdata" cols="40" rows="35"/>
</form>

OUTCOME:
textarea is blank (no data)

Is it because document.getElementById('fieldName').value may not be
used to populate a textarea?

Thanks.

>On Aug 4, 12:34 pm, Ryan LeTulle <bayous...@gmail.com> wrote:
> http://www.javascriptkit.com/script/script2/removehtml.shtml
>
> </ryan>
>

Ryan LeTulle

unread,
Aug 4, 2010, 4:02:50 PM8/4/10
to ra...@googlegroups.com
put this after </form>:

<script type="text/javascript">
stripHTML():
</script>

</ryan>

blabla12345

unread,
Aug 4, 2010, 4:33:42 PM8/4/10
to Railo
Nope.

See the last portion of the script:

<cfoutput>
stripHTML(#myCFhtmlData#,document.getElementById('sdata').value);
<cfoutput>

Instead, how could we turn the js into a cf script

<cfscript>
function removeHTML(str) {
var str.re= /<\S[^><]*>/g;
for (i=0; i<arguments.length; i++)
{
arguments[i].value=arguments[i].value.replace(str.re, "");
}
// return html remove text
return(str);
}
</cfscript>

What do we use for arguments.length
for the cfscript?

Thanks.

On Aug 4, 4:02 pm, Ryan LeTulle <bayous...@gmail.com> wrote:
> put this after </form>:
>
> <script type="text/javascript">
> stripHTML():
> </script>
>
> </ryan>
>

Peter Boughton

unread,
Aug 4, 2010, 4:51:16 PM8/4/10
to ra...@googlegroups.com
Here's the quick and dirty CFML version:

<textarea id="sdata" cols="40" rows="35">#HtmlEditFormat(
MyCfHtmlData.replaceAll( '<[^<>]++>' , '' ) )#</texarea>

That'll remove all tag-like constructs and escape any stray < that
might get left over.


If for some reason you need to do in JS, use this:

function stripTags(text) { return text.replace( /<[^<>]+>/g , '' ); }

document.getElementById('sdata').innerHTML =
stripTags('<cfoutput>#JsStringFormat(MyCfHtmlData)#</cfoutput>');


But, as I said before, using a regex to remove things that might look
like tags is not the best solution.

blabla12345

unread,
Aug 4, 2010, 5:38:24 PM8/4/10
to Railo
One more thing, how could we leave the <p> and </p> tag intact. Just
this guy. Many thanks.

blabla12345

unread,
Aug 4, 2010, 5:33:48 PM8/4/10
to Railo
Beautiful, Peter. You rock. Many thanks.

Don

On Aug 4, 4:51 pm, Peter Boughton <bought...@gmail.com> wrote:

Peter Boughton

unread,
Aug 4, 2010, 5:57:23 PM8/4/10
to ra...@googlegroups.com
> One more thing, how could we leave the <p> and </p> tag intact.  Just
> this guy.

The proper answer is to use the AntiSamy project I linked to earlier -
it'll make it trivial to do small changes like this.


However, the previous regex can be adapted with a negative lookahead
to exclude P tags, like so:

(?!</?p>)<[^<>]+>

Which will then not remove <p> or </p> tags - but we're close to the
limit of what's sensible with regex and HTML - any more complex and
things start getting messy.

blabla12345

unread,
Aug 4, 2010, 6:26:27 PM8/4/10
to Railo
Thanks, Peter. I just realize I was silly, for whatever inside the
Textarea space, the <p> and </p> is useless. Oh well, I'll let it sit
there for now or look into iframe ...

Don

Stefan

unread,
Aug 4, 2010, 9:29:22 PM8/4/10
to Railo
You could replace <p> with chr(10) to make a line break in the
textarea. Bit of cumbersome parsing to put the <p></p> back after
submit. I would not be surprised if there is UDF for it at CFLIB, but
I have not checked.

denstar

unread,
Aug 4, 2010, 10:57:42 PM8/4/10
to ra...@googlegroups.com
There's the Jericho HTML java library as well (AntiSamy is freakishly cool BTW).

Does a pretty nice job of either stripping HTML or formatting HTML as
plain text.

Easy as pie to use from CFML, too! Just drop the jericho jar in the
WEB-INF/lib folder, and do something like this:

source = createObject("java","net.htmlparser.jericho.Source");
plainText = source.init("your HTML here").getRenderer().toString;

Might even have a railo extension for it out at some point. Jericho
does other nifty stuff too.

:Den

--
All action is for the sake of some end; and rules of action, it seems
natural to suppose, must take their whole character and color from the
end to which they are subservient.
John Stuart Mill

denstar

unread,
Aug 4, 2010, 11:00:56 PM8/4/10
to ra...@googlegroups.com
Er-

plainText = source.init("your HTML here").getRenderer().toString();

Even.

blabla12345

unread,
Aug 4, 2010, 11:15:19 PM8/4/10
to Railo
Stefan, thanks, the chr(10) line break works.

Don

Todd Rafferty

unread,
Aug 5, 2010, 2:36:18 PM8/5/10
to ra...@googlegroups.com
Speaking of AntiSamy:
http://www.petefreitag.com/item/760.cfm

--
~Todd Rafferty ** Volunteer Railo Open Source Community Manager ** http://getrailo.org/
Reply all
Reply to author
Forward
0 new messages