aloha - internal links

4 views
Skip to first unread message

Laurent Savaëte

unread,
Sep 5, 2012, 7:16:13 AM9/5/12
to ductus-d...@googlegroups.com
It took almost a week to figure where the bug was, but I finally nailed
it :)

You create links the same way for both internal and external links, you
can search internal ones by url (eg. type "rest" in the url input and
you'll be shown a list of pages named "/en/Restaurant_vocabulary" and so
on).
Search *should* work whether you type a urn or url (ie: both "en:rest"
or "/en/rest" should bring up the same results).
If you don't pick any result from the list, aloha will assume you're
doing an external link (like if you type in "http://wikiotics.org" and
hit enter).

For now, we need to store two data-xxx microdata attributes in the links
to work out the difference between internal and external ones.
I think there's a way to do without these extra attributes, but I'll
work on this once more pressing stuff is solved.
A CSS class is added to each link, so we can display them accordingly in
view mode.
The good news is, it works with the standard aloha link plugin and just
a simple "repository" (about 100 lines of code) requesting links from
the backend. So the volume of (our own) code is very limited.

Jim, it would be great if you had a quick look at the few commits that
contain python code. Just skip the js code, it's pretty clean (I think).

Edit http://laurent.dev.wikiotics.net/en/visual_example2 for a test drive.

Jim Garrison

unread,
Sep 6, 2012, 12:42:12 AM9/6/12
to ductus-d...@googlegroups.com
On 09/05/12 04:16, Laurent Savaëte wrote:
> It took almost a week to figure where the bug was, but I finally nailed
> it :)
>
> You create links the same way for both internal and external links, you
> can search internal ones by url (eg. type "rest" in the url input and
> you'll be shown a list of pages named "/en/Restaurant_vocabulary" and so
> on).
> Search *should* work whether you type a urn or url (ie: both "en:rest"
> or "/en/rest" should bring up the same results).
> If you don't pick any result from the list, aloha will assume you're
> doing an external link (like if you type in "http://wikiotics.org" and
> hit enter).
>
> For now, we need to store two data-xxx microdata attributes in the links
> to work out the difference between internal anexternal ones.
> I think there's a way to do without these extra attributes, but I'll
> work on this once more pressing stuff is solved.
> A CSS class is added to each link, so we can display them accordingly in
> view mode.
> The good news is, it works with the standard aloha link plugin and just
> a simple "repository" (about 100 lines of code) requesting links from
> the backend. So the volume of (our own) code is very limited.
>
> Jim, it would be great if you had a quick look at the few commits that
> contain python code. Just skip the js code, it's pretty clean (I think).
>
> Edit http://laurent.dev.wikiotics.net/en/visual_example2 for a test drive.

I think the code looks great. From my perspective it is ready to be
merged into master whenever you think is good. I am really excited
about this, and appreciate you taking the time to look into it and get
everything right.

I have only a few comments.

In d92b35cf3 you changed the API of search_pages(), as now it is
supposed to accept a dictionary of various things. However, I don't
think you updated all the existing code to the new API, so some things
are probably broken. Actually, my vote would be to stick with the old
API, but it doesn't really matter. If you need to pass a dict to the
function, you can do it by calling search_pages(**d). For more
information on this syntax, see
http://www.saltycrane.com/blog/2008/01/how-to-use-args-and-kwargs-in-python/
(I recall that you were asking about it recently, and it's quite
simple/elegant once you get the hang of it). If you want me to propose
a patch of how I would do the API, let me know, it's only a few lines.

I would argue that the function urn_from_url() is misnamed as it does
not return a urn. I would call it fqpagename_from_url().

Obviously we need to start sanitizing input for real before we deploy to
the main site. I think it makes most sense to sanitize on output, for
one reason: it is possible that new attacks will be found over time,
which will require more aggressive sanitization in the future. In this
case, we would likely want to store the sanitized version in cache
(memcache or disk cache), as the operation can be somewhat expensive
(right?). Even with this enabled, we would probably want to sanitize on
input, just to keep ridiculous stuff out of the database altogether.
Actually, for now, sanitizing on input is sufficient, though we should
acknowledge that in the future we will probably end up sanitizing on
output as well (as soon as we find the need to sanitize something more
aggressively). So those are my thoughts, though they require no action.
If you haven't made one already, it would be very nice to have a test
case to make sure sanitization is actually working...

Laurent Savaëte

unread,
Sep 8, 2012, 12:49:33 PM9/8/12
to ductus-d...@googlegroups.com

> I think the code looks great. From my perspective it is ready to be
> merged into master whenever you think is good. I am really excited
> about this, and appreciate you taking the time to look into it and get
> everything right.
>
> I have only a few comments.
>
> In d92b35cf3 you changed the API of search_pages(), as now it is
> supposed to accept a dictionary of various things. However, I don't
> think you updated all the existing code to the new API, so some things
> are probably broken. Actually, my vote would be to stick with the old
> API, but it doesn't really matter. If you need to pass a dict to the
> function, you can do it by calling search_pages(**d). For more
> information on this syntax, see
> http://www.saltycrane.com/blog/2008/01/how-to-use-args-and-kwargs-in-python/
> (I recall that you were asking about it recently, and it's quite
> simple/elegant once you get the hang of it). If you want me to propose
> a patch of how I would do the API, let me know, it's only a few lines.

Ok, done. I had seen the **kwargs scheme before, but never seen a clear
explanation for it. That blog post was a great one!

> I would argue that the function urn_from_url() is misnamed as it does
> not return a urn. I would call it fqpagename_from_url().

corrected.

> Obviously we need to start sanitizing input for real before we deploy to
> the main site. I think it makes most sense to sanitize on output, for
> one reason: it is possible that new attacks will be found over time,
> which will require more aggressive sanitization in the future. In this
> case, we would likely want to store the sanitized version in cache
> (memcache or disk cache), as the operation can be somewhat expensive
> (right?). Even with this enabled, we would probably want to sanitize on
> input, just to keep ridiculous stuff out of the database altogether.
> Actually, for now, sanitizing on input is sufficient, though we should
> acknowledge that in the future we will probably end up sanitizing on
> output as well (as soon as we find the need to sanitize something more
> aggressively). So those are my thoughts, though they require no action.
> If you haven't made one already, it would be very nice to have a test
> case to make sure sanitization is actually working...

as it stands now, there are 2 whitelisting systems:
- 1 in aloha which *cleans up* before sending to the server. You can
find the list of allowed tags/attributes in the edit_text.html
template. Anything else is removed by aloha
- 1 on the server, which *validates* what it receives. The current
whitelist is the default genshi list plus two aloha-xxxxx attributes
(for internal/external links). See genshi docs (and mostly source code)
for the list. It is currently a bit more lax than what aloha does, but
it's reasonably safe I'd say. If you think we should tighten it up, we
can just redefine the list when initializing the validator.

In terms of efficiency, the validator parses the incoming HTML stream
into a tree, then runs across the whole tree and modifies (removes)
what it doesn't like. Then I compare original VS cleaned strings, and
throw an error if there's a difference. And then we do it all a second
time :)
We could speed it up by deriving a class from the genshi sanitizer, and
redifining its clean method so that it raises an exception instead of
cleaning up. It's a page long, so not a big job.
And obviously, we should find a way to avoid doing the job twice!

Speed aside, I think the code is reasonably safe as it is. I'll go
ahead and merge into master.

Reply all
Reply to author
Forward
0 new messages