Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Sanitizing user inputs in multiple languages

1 view
Skip to first unread message

Sam

unread,
Apr 4, 2007, 9:01:04 PM4/4/07
to
An application I am developing needs to accept input in any of the 15
languages I've opted for, from a single, common HTML form.

I generally sanitize user inputs from an HTML form by specifying a
list of allowed characters. How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8? Is there a better method?

Jürgen Exner

unread,
Apr 4, 2007, 10:06:49 PM4/4/07
to

Quite simple actually. Just take the superset of the white lists of
characters for each language.

> Is there a better method?

Depends on your definition of "better".
More secure? Probably no, white lists are much more secure than black lists.
Easier? Well, depends on your definition of "sanitize". If you want to e.g.
eliminate x-site scripting, then you can simply remove those few characters,
that are know to cause x-site scripting. There are modules to do that.

jue


Sam

unread,
Apr 5, 2007, 9:55:05 AM4/5/07
to
( Discussion thread in Google Group -
http://groups.google.com/group/perl.beginners/browse_thread/thread/0f2f98e68b039cd2/95f19485ac944239#95f19485ac944239
)

Sam wrote:
> How can I do a similar sanitization for
> inputs that can be in any of the 15 languages if the encoding is
> UTF-8?

Jürgen Exner wrote:
> Quite simple actually. Just take the superset of the white lists of
> characters for each language.

Yes, I had something similar in mind too, but am stuck on how to
actually implement it. Here's what I've picked up so far -

1. First make sure that input is UTF-8 encoded (
http://www.w3.org/International/questions/qa-forms-utf-8 )
2. Select the allowed characters in each language (all alphabets and
numbers) and use in a regex.

For implementing the second step, I found a useful UTF-8 encoding
table which has the UTF and Hex code for the characters in each
language ( http://www.utf8-chartable.de/ ).

Here's the problem - How do I do identify the important characters of
a language I don't know? For example, I know some of the alphabets of
the Arabic language, but don't really know if characters like (for eg)
the ARABIC POETIC VERSE SIGN is necessary. Second, how do I use the
UTF-8 hex codes for the characters in a regex?

Somebody must have a better solution ... I do get the feeling that
this approach isn't great.

Jürgen Exner wrote:
> Well, depends on your definition of "sanitize". If you want to e.g.

> eliminate x-site scripting, ...

Yes, the final intention is to prevent x-site scripting, but am not
aware of any widely used, popular modules for this.

0 new messages