Jason Grout

unread,

Oct 4, 2012, 5:50:21 PM10/4/12

to sage-devel

(apologies for possible multiple posts--I've sent this twice to gmane
and it hasn't appeared)

I've implemented some sanitizing of public worksheets [1] and applied it
to demo.sagenb.org as a test. The concerns from before were that
javascript was executing on the page, leading to malware being on the page.

Can people test the new html sanitizing being done on demo.sagenb.org?

If it looks good (especially to William), I'll roll it out to the other
*.sagenb.org servers and we'll turn published worksheets back on.

Thanks,

Jason

[1] https://github.com/sagemath/sagenb/pull/100

kcrisman

unread,

Oct 4, 2012, 8:49:34 PM10/4/12

to sage-...@googlegroups.com

On Thursday, October 4, 2012 5:50:25 PM UTC-4, jason wrote:

(apologies for possible multiple posts--I've sent this twice to gmane
and it hasn't appeared)

I've implemented some sanitizing of public worksheets [1] and applied it
to demo.sagenb.org as a test. The concerns from before were that
javascript was executing on the page, leading to malware being on the page.

return text.replace('<', '<')

Wow, what a hammer; so does that just mean all the html structure becomes visible? (In the event that branch is reached.)

Jason Grout

unread,

Oct 4, 2012, 9:00:01 PM10/4/12

to sage-...@googlegroups.com

On 10/4/12 7:49 PM, kcrisman wrote:
>
>
> On Thursday, October 4, 2012 5:50:25 PM UTC-4, jason wrote:
>
> (apologies for possible multiple posts--I've sent this twice to gmane
> and it hasn't appeared)
>
> I've implemented some sanitizing of public worksheets [1] and
> applied it

> to demo.sagenb.org <http://demo.sagenb.org> as a test. The concerns

> from before were that
> javascript was executing on the page, leading to malware being on
> the page.
>
>
> return text.replace('<', '<')
>
> Wow, what a hammer; so does that just mean all the html structure
> becomes visible? (In the event that branch is reached.)
>

Yep. I figure that's better than just returning an empty string. Do
you think an empty string is better? Note that this would only be for
output in an <html> block.

Thanks,

Jason

Jason Grout

unread,

Oct 4, 2012, 5:32:55 PM10/4/12

to sage-...@googlegroups.com

I just submitted a pull request [1] that does some sanitizing for public
worksheets. I applied this to demo.sagenb.org and temporarily
re-enabled published worksheets there. Can people test this to see if
it prevents bad content from making it through in published worksheets?

I suppose we still have the problem of a worksheet that may have bad
content that is stripped out when viewed as a public worksheet, but that
bad content is preserved and loaded when a user clicks "Edit a copy".

If the changes above look okay (especially to William), I'll roll them
out to the rest of *.sagenb.org.

Andrea Lazzarotto

unread,

Oct 5, 2012, 4:54:51 AM10/5/12

to sage-...@googlegroups.com

By "can people test this" do you mean I am allowed to try to craft an awful piece of malicious code with injected JS (of course without doing something too bad)?

Andrea Lazzarotto
(inviato da Android)

Volker Braun

unread,

Oct 5, 2012, 7:11:50 AM10/5/12

to sage-...@googlegroups.com

Looks good!

To decrease the value of sagenb.org as spam link farm we should probably also add add_nofollow=True:

html_cleaner = SageCleaner(page_structure=False, remove_tags=('head', 'title'), style=True, add_nofollow=True)

William Stein

unread,

Oct 6, 2012, 9:06:20 AM10/6/12

to sage-...@googlegroups.com

On Thu, Oct 4, 2012 at 2:50 PM, Jason Grout <jason...@creativetrax.com> wrote:
> (apologies for possible multiple posts--I've sent this twice to gmane and it
> hasn't appeared)
>
> I've implemented some sanitizing of public worksheets [1] and applied it to
> demo.sagenb.org as a test. The concerns from before were that javascript
> was executing on the page, leading to malware being on the page.
>
> Can people test the new html sanitizing being done on demo.sagenb.org?
>
> If it looks good (especially to William), I'll roll it out to the other
> *.sagenb.org servers and we'll turn published worksheets back on.

I hate to criticize this, since I know it was a lot of work to
implement. But since I'm the one who gets my network cutoff when
things go wrong, I've got to take this seriously.

I happen to have read straight through the book "The Tangled Web: A
Guide to Securing Modern Web Applications" this summer. There are
thousands of known and truly dangerous exploits that one can sneak
into an HTML+CSS document. (It's amazing the sorts of crazy evil
things one can do using weird character sets combined with browser
implementation issues, especially if said browser is IE.) The book
convincingly argues that the only way to have any hope of making a
safe HTML document is to use a whitelist approach.

The code you wrote uses the exact opposite approach -- the blacklist
approach. Here's a typical stackoverflow discussion about exactly
the approach in your code:

http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/2702587#2702587

Note the comment there: "Note that this uses a blacklist approach to
filter out evil bits, rather than whitelist, but only a whitelisting
approach can guarantee safety. – Søren Løvborg Nov 26 '11 at 21:10"

Also, the lxml documentation describes the function you're using as:
"Removes unwanted tags and content." Instead, the "whitelist
approach" to sanitizing an HTML+CSS document is to "include wanted
tags and content", which is a very, very different thing.

There are Python libraries that implement a whitelist approach, e.g.,
see the discussion here:

http://stackoverflow.com/questions/1606201/how-can-i-make-html-safe-for-web-browser-with-python

-- William

P.S. Also, this worries me a little:

if el.tag=='script' and el.get('type')=='math/tex' and not el.get('src'):

I wonder if there is a way to put malware into a mathjax script tag?

William Stein

unread,

Oct 6, 2012, 9:08:59 AM10/6/12

to sage-...@googlegroups.com

There's a FAQ linked to from that page that nicely explains whitelist/blacklist:

http://jacobian.org/writing/untrusted-users-and-html/

"I’ve literally seen hundreds of recipes for stripping unsafe HTML
that are about as effective as a screen door on a submarine."

William

>
> -- William
>
> P.S. Also, this worries me a little:
>
> if el.tag=='script' and el.get('type')=='math/tex' and not el.get('src'):
>
> I wonder if there is a way to put malware into a mathjax script tag?

--
William Stein
Professor of Mathematics
University of Washington
http://wstein.org

Andrea Lazzarotto

unread,

Oct 6, 2012, 9:12:19 AM10/6/12

to sage-...@googlegroups.com

2012/10/6 William Stein <wst...@gmail.com>

I wonder if there is a way to put malware into a mathjax script tag?

Probably yes.

I ask again: Jason, can I try to hack?

Thank you.

--
Andrea Lazzarotto - http://andrealazzarotto.com

Volker Braun

unread,

Oct 6, 2012, 11:19:30 AM10/6/12

to sage-...@googlegroups.com

Before you even get to the question of black/whitelisting you have to deal with malformed documents. Are your rules (black or white) going to apply to subtly broken tags? I think lxml does the only sane thing here: Parse the document into a valid xml document, apply rules, and then write everything out into a new (valid) xml document.

The lxml.html.clean.Cleaner class defaults to removing unknown tags and, for the known tags, removing those that are troublesome. So it is technically whitelisting. The main difference to BeautifulSoup, say, is that it uses C libraries to do the heavy lifting so its faster.

William Stein

unread,

Oct 6, 2012, 11:58:19 AM10/6/12

to sage-...@googlegroups.com

On Sat, Oct 6, 2012 at 8:19 AM, Volker Braun <vbrau...@gmail.com> wrote:
> Before you even get to the question of black/whitelisting you have to deal
> with malformed documents. Are your rules (black or white) going to apply to
> subtly broken tags? I think lxml does the only sane thing here: Parse the
> document into a valid xml document, apply rules, and then write everything
> out into a new (valid) xml document.
>
> The lxml.html.clean.Cleaner class defaults to removing unknown tags and, for
> the known tags, removing those that are troublesome. So it is technically
> whitelisting.

What is the whitelist it is using, and why is that whitelist a good
choice? I don't see a whitelist in the lxml.html.cleaner docs.

The page [1] has a publicly discussed and thought through white list
of tags and css. In fact, [1] is much longer than the source code [2]
of lxml's Cleaner. Moreover, reading the source of [2] gives me no
confidence whatever that it generates something safe. e.g., this is
not confidence inspiring to me:

# This is an IE-specific construct you can have in a stylesheet to
# run some Javascript:
_css_javascript_re = re.compile(
r'expression\s*\(.*?\)', re.S|re.I)

[1] http://wiki.whatwg.org/wiki/Sanitization_rules
[2] http://lxml.de/api/lxml.html.clean-pysrc.html

> --
> You received this message because you are subscribed to the Google Groups
> "sage-devel" group.
> To post to this group, send email to sage-...@googlegroups.com.
> To unsubscribe from this group, send email to
> sage-devel+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sage-devel?hl=en.

Volker Braun

unread,

Oct 6, 2012, 2:17:24 PM10/6/12

to sage-...@googlegroups.com

The whitelist is lxml.html.defs.tags, see http://lxml.de/api/lxml.html.defs-module.html

Cleaning CSS should probably be considered a separate problem, especially since Microsoft decided in their infinite wisdom to allow embedded javascript in CSS files (hence the _css_javascript_re). But we use none of that since Jason's patch explicitly removes all style tags.

Andrea Lazzarotto

unread,

Oct 6, 2012, 2:24:02 PM10/6/12

to sage-...@googlegroups.com

2012/10/6 Volker Braun <vbrau...@gmail.com>

But we use none of that since Jason's patch explicitly removes all style tags.

What about style attributes?

Rob Beezer

unread,

Oct 6, 2012, 4:19:33 PM10/6/12

to sage-...@googlegroups.com

On Thursday, October 4, 2012 2:50:25 PM UTC-7, jason wrote:

to demo.sagenb.org as a test.

I've been regularly getting

503 Service Unavailable

No server is available to handle this request.

back from demo.sagenb.org the past couple of days. Has anybody else been successful testing these changes?

Thanks,
Rob

Andrea Lazzarotto

unread,

Oct 6, 2012, 4:28:58 PM10/6/12

to sage-...@googlegroups.com

2012/10/6 Rob Beezer <goo...@beezer.cotse.net>

Has anybody else been successful testing these changes?

No. Partially because I got those errors too and partially because I'm waiting to be authorized by Jason to intentionally try to inject some proof of concept XSS in the public worksheets.

Rob Beezer

unread,

Oct 6, 2012, 4:32:35 PM10/6/12

to sage-...@googlegroups.com

Thanks, Andrea. I want to run some of textbook-worksheets through this, especially since they have been mangled by the lxml module once already. ;-)

Rob

Volker Braun

unread,

Oct 6, 2012, 5:18:34 PM10/6/12

to sage-...@googlegroups.com

Only the html() output is run through lxml, not the whole worksheet. Indeed we do want to allow the Sage server to put javascript into the notebook. So to clarify: Only html() output injected by the user into the notebook is sanitized and stripped of styles.

Jason Grout

unread,

Oct 7, 2012, 12:47:15 AM10/7/12

to sage-...@googlegroups.com

On 10/6/12 1:17 PM, Volker Braun wrote:
> The whitelist is lxml.html.defs.tags, see
> http://lxml.de/api/lxml.html.defs-module.html
>
> Cleaning CSS should probably be considered a separate problem,
> especially since Microsoft decided in their infinite wisdom to
> allow embedded javascript in CSS files (hence the _css_javascript_re).
> But we use none of that since Jason's patch explicitly removes all style
> tags.

William: very good points, and Volker: very good points back!

In comparing the html5lib allowed tags list and the lxml allowed tag
list, I see some differences.

Allowed in html5lib, but not lxml:

article, aside, audio, canvas, command, datagrid, datalist, details,
dialog, event-source, figcaption, figure, footer, header, keygen, m,
meter, multicol, nav, nextid, output, progress, section, sound, source,
spacer, time, video

Allowed in lxml, but not html5lib (this list takes out the tags that we
are blocking explicitly (e.g., style=False), or implictly with the
default options):

base, basefont, bdo, body, html, isindex, noscript

(html is just allowed because we repurposed the html tag in Sage. We
should fix that to strip off the outermost html tag and then remove all
other html tags).

It would be easy to add the latter tags to the remove_tags list for
lxml. On the other hand, it would also be easy to switch to html5lib.
I agree with Volker that fundamentally, lxml and html5lib by default
approach things in the same way: a whitelist of pre-defined tags are
allowed.

Thanks,

Jason

Andrea Lazzarotto

unread,

Oct 7, 2012, 5:02:00 AM10/7/12

to sage-...@googlegroups.com

2012/10/4 Jason Grout <jason...@creativetrax.com>

I've implemented some sanitizing of public worksheets [1] and applied it to demo.sagenb.org as a test. The concerns from before were that javascript was executing on the page, leading to malware being on the page.

Are you sure you did it?

http://demo.sagenb.org/home/pub/427/

Here is a normal javascript which doesn't get filtered.

Jason Grout

unread,

Oct 9, 2012, 10:52:19 AM10/9/12

to sage-...@googlegroups.com

On 10/7/12 4:02 AM, Andrea Lazzarotto wrote:
>
>
> 2012/10/4 Jason Grout <jason...@creativetrax.com
> <mailto:jason...@creativetrax.com>>

>
> I've implemented some sanitizing of public worksheets [1] and

> applied it to demo.sagenb.org <http://demo.sagenb.org> as a test.

> The concerns from before were that javascript was executing on the
> page, leading to malware being on the page.
>
>
> Are you sure you did it?
>
> http://demo.sagenb.org/home/pub/427/
>
> Here is a normal javascript which doesn't get filtered.
>

You're right. demo.sagenb.org was actually using the old code. I've
fixed that now (and since demo was having problems sometimes restarting
(which has been a problem before my changes...), I also put the
javascript filtering on alpha.sagenb.org).

Thanks,

Jason

Jason Grout

unread,

Oct 9, 2012, 11:27:00 AM10/9/12

to sage-...@googlegroups.com

On 10/6/12 11:47 PM, Jason Grout wrote:
> It would be easy to add the latter tags to the remove_tags list for
> lxml. On the other hand, it would also be easy to switch to html5lib. I
> agree with Volker that fundamentally, lxml and html5lib by default
> approach things in the same way: a whitelist of pre-defined tags are
> allowed.

By the way, I am now removing the tags additionally specified in the
html5lib library:
https://github.com/jasongrout/sagenb/commit/003436a47bc97429ea25287d2aa15fc4a2b527eb

Thanks,

Jason

Andrea Lazzarotto

unread,

Oct 9, 2012, 1:19:11 PM10/9/12

to sage-...@googlegroups.com

2012/10/9 Jason Grout <jason...@creativetrax.com>

By the way, I am now removing the tags additionally specified in the html5lib library

I tried every attack to which Firefox is vulnerable from http://html5sec.org and none worked. I tried to exploit the mathjax script just a bit, I'm not a security expert.

BTW your cleaning script also strips down all style, including the "highlighted" text made with the TinyMCE editor.

Best regards,

Reply all

Reply to author

Forward