2.62 -- multiple fixes...

10 views
Skip to first unread message

The Editor

unread,
Mar 9, 2009, 9:50:49 PM3/9/09
to bolt...@googlegroups.com
Thanks to Hans and Linly, I've been able to put out a much improved
release. I modified some of the suggestions, omitted some and added a
few more. But I followed the posts pretty closely and think I
addressed most of the issues. Here's my fix list:

Fixed a bug in handling stamps by not defining the BOLTstampsDir
Added % to several filters for improved utf handling
Simplified utf2url and url2utf functions, and got rid of the
utf_encode/decode lines.
Reworked the searchPageList code so the group parameter can handle utf
code. Thanks Hans
Changed the displayFmt code to handle utf's.
Changed the logic switching in PageShortcuts, it should have been set to false.
Simplified the UTF8_strip so it now works properly.
Fixed bug in BOLTstampsMax causing page histories to be lost. Very
sorry about this bug!
Fixed rename command to work with utf pagenames
Fixed login command to allow utf member id's. This could allow just
about any kind of member name...

I did NOT yet do a lot of extensive testing on this release, so I'm
still calling it fairly experimental. But I was eager to get it out
the door so other's can help with the testing. And also, to repairf
the major bug in the new stampsMax parameter (which I'm very sorry
about). So hold off on using this on a production site until we've
gotten some feedback on this release. If there are additional tweaks
we need to make, I'll be happy to issue another release tomorrow or
the next day to polish it off further.

My many thanks again, esp to Hans for all his hard work poring through
the code and offering many excellent suggestions! These UTF page names
are quite cool!

Cheers,
Dan

Hans

unread,
Mar 10, 2009, 4:32:29 AM3/10/09
to bolt...@googlegroups.com
First feedback: installed 2.62, now I only see 'Invalid Link.' for
every site link.
I have not tried to analyse this rather serious bug.

~Hans

The Editor

unread,
Mar 10, 2009, 5:17:57 AM3/10/09
to bolt...@googlegroups.com
Oops, I was mostly testing utf pages and it missed this.

line ~1958 of engine.php should read:

return strtr($text, $utf2asc);

and not

$text = strtr($text, $utf2asc);

With this fix it now seem to work either utf mode or not.

I'm wondering though if we should have some additional functionality
though, as creating utfpages then switching utf mode off leave
%encoded pages in the output. It seems these should all be utf no
matter what... Unless escaped or something. I like the idea of doing
this in the domarkup function, which might simplify things in other
places as well. But we can try more ideas once we have the basics
working.

I just uploaded a revised 2.62 version. Hopefully this will work a
bit better. :)

Cheers,
Dan

Linly

unread,
Mar 10, 2009, 5:24:44 AM3/10/09
to BoltWire
Testing report:

Hi Dan,

Good news:

1) The {p1} is normal now.

2) The edit, data, copy, rename, delete, source, ALL test well.
Chinese traditional, simplified, Japanese characters ALL tested well.
Great.

3) These two links are pointing to the same page:

[[一二|+]]
[[%e4%b8%80%e4%ba%8c|+]]

Nice.

Normal news:

1) The Chinese character in page name can't be searched. I guess this
is because they are saved as %-encoding in index file.

I suggest that in index file the page name can be saved as utf-8 just
like BoltWire treat the text.

Bad news:

1) Some weird situation:

"[[{p}|+]]" can find the correct page, but

"[[~.{{p}:author}|+]]" or "[(lastmodified {p} %Y/%m%d)]" can not find
the corresponding page. The former results: "Interwiki entry not
found." the latter returns nothing at all. Both syntax used to be
working in a English named page.

2) The honor lines seems broken. I think this is the line space issue
happened several versions ago.

3) The msg after deleting still showing %-encoding page name:

Page %e4%b8%ad%e6%96%87 has been deleted.

Cheers, linly

Hans

unread,
Mar 10, 2009, 5:41:25 AM3/10/09
to bolt...@googlegroups.com
updating to the latest fix shows normal ascii text links okay. Any
utf-8 links are shown as raw % encoded text, not utf-8

Are you requiring special config switches for utf-8 handling? I have
not used that so far with all the changes i proposed and tested.

Cheers

Hans

unread,
Mar 10, 2009, 6:09:33 AM3/10/09
to bolt...@googlegroups.com
okay, adding utfpages: true to site.config renders utf-8 names now fine.

I got a few test pages under page 一
and the following markup [(search group=一 fmt="* [[{+p}]] {+p1}")]
results in output with raw % code, not utf-8
like

* %e4%b8%80.%e8%a8%b1%e5%8a%9f%e8%93%8b 一
* %e4%b8%80.%e8%a8%b1%e5%8a%9f%e8%93%8b%e4%b8%80.%e8%a8%b1%e5%8a%9f%e8%93%8b


~Hans

Linly

unread,
Mar 10, 2009, 6:15:30 AM3/10/09
to BoltWire
I have set "utfPages: true" in "site.config". Don't know this is the
reason or not.

Cheers, linly

Hans

unread,
Mar 10, 2009, 6:59:07 AM3/10/09
to bolt...@googlegroups.com
As to the error in search displayFmt I reported:

function BOLTdisplayFmt needs a change to:
..................
foreach((array)$outarray as $item) {
$item = trim($item);
if ($item === '') continue;
$fmt2 = $fmt;
$p = explode('.', $item);
$fmt2 = str_replace('{+p0}', count($p), $fmt2);
$fmt2 = str_replace('{+p}', BOLTurl2utf($item), $fmt2);
foreach ($p as $i => $ii) {
............... etc

1. The (array) declaration sets the var as array so foreach will not
report an error if an empty var is supplied (due to an empty search)

2. BOLTurl2utf($item) is needed to deliver proper {=p} values.

I suggested a simple urldecode for this before, but BOLTurl2utf may be fine.

~Hans

Hans

unread,
Mar 10, 2009, 7:00:06 AM3/10/09
to bolt...@googlegroups.com
> 2. BOLTurl2utf($item) is needed to deliver proper {=p} values.

should read as

2. BOLTurl2utf($item) is needed to deliver proper {+p} values.

Hans

unread,
Mar 10, 2009, 7:22:11 AM3/10/09
to bolt...@googlegroups.com
2009/3/10 Linly <linl...@gmail.com>:

>
> I have set "utfPages: true" in "site.config". Don't know this is the
> reason or not.

Yes, that did the trick.

I find the term 'utfpages' confusing. All web pages are utf-8 encoded
as standard via declaration in the skin HTML head. Thus using utf-8
text in the content displays okay.

But 'utfpages: true' refers to utf-8 encoded page names only.
'utfPageNames: true' would be a lot clearer as a config switch.
And i think this should be true by default, same as utf-8 character
decoding is set by default via the skin HTML head.
ASCII text in page names appears as normal, as it is the utf-8
standard way to render the lower ASCII characters as normal
characters.

The only issue is if Western European users wish to have their
diacritic characters transformed to non-diacritic lower ascii
characters. For this one should use a config variable, but not for
generally using utf-8 in page names.
Perhaps a var called simpleAsciiPageNames or basicAsciiPageNames

Cheers,
~Hans

The Editor

unread,
Mar 10, 2009, 9:18:28 AM3/10/09
to bolt...@googlegroups.com

Thanks again for the extremely fast testing and bug reporting. I'll
get to working on some fixes to what's been reported so far and try
and issue a release later today or tomorrow. I have a lot of catchup
on other stuff to do, but most of these don't seem too difficult to
resolve.

Plus, we may get better results thinking about it just a bit first.
I'm leaning toward trying to do all the conversion at the very last
minute, as it should solve many problems. But that might cause
problems with the links, and perhaps other things. Maybe none at all.
Will just take some experimentation. Fortunately a change like this
will be under the hood and not disrupt sites, which is my biggest
concern right now. Getting something stable for folks like Linly to be
able to begin using.

As for the config setting, I have no problem with changing utfpages to
utfpagenames, though I'm not convinced it adds that much semantically.
We use the variable {page} to represent the page (name), and it's
never been confusing so far as I know. Similarly, when scrolling down
my pages dir, I don't see anything difficult about seeing a page as
either ascii or utf8 depending on wheter it's name is some.page or
%e2%a7.etc. Still, it's just a question of changing the parameter.

The bigger issue, perhaps is whether or not it should be the default
setting. My original plan was to only have ascii page names, and I
only hesitantly offered minimal utf page names capabilities little by
little, never quite expecting we would have what we have now. And it
never my intention to make it the default behavior. Just an option I
figured we'd stretch as far as we could without breaking things.

But like the conversion to utf encoding, once we got the combination
right, it turned out to be simpler than expected, and quite cool. So
I'm open to considering the possibility of changing the default
setting. Just still hesitant. It does have the nice advantage of
titles not being stripped of diacritics. My concerns are:

1) Latin based alphabets, as Hans mentioned. (On an earlier issue, we
could make our config setting pageNames: ascii or utf-8 with one of
them default). My guess is most users of accented latin-based letters
would prefer to be able to read a pagename like tèst as test rather
than t%e27st (or whatever it is). Creating a plugin of alternate
actions to automatically save titles with accents might be a useful
addition to this.

2) Speed. I thought the extra handling of utf conversions might affect
speed. But on my tests with utf pages so far, they seem slighter
faster. Curious.

3) Security. I am not sure what security implications there are of
allowing % encoded text in a page. For example, what if someone
encoded a javascript snippet and then cut and paste the script to a
page. If BoltWire simply unencoded it, would we be opening ourselves
to an xss hack? Likely. (This by the way is one reason I have
everything run through a pair of central utf2url/url2utf functions--we
can easily add extra precautions if needed systemwide).

4) I've not thought through all the case sensitivity issues. In most
things BoltWire is case insensitive by design. I'm not sure how
switching over will affect this. And particularly with diacritics that
DO have case conversions. I have some code for utf case changing, but
it's not all tested etc.

5) To be honest, I personally like it in the nice safe confines of
simple ascii page names, and probably for most English sites, it is
easiest. And maybe mostly for this reason, I'm still inclined to keep
ascii the default. :)

Cheers,
Dan

Hans

unread,
Mar 10, 2009, 12:02:00 PM3/10/09
to bolt...@googlegroups.com
2009/3/10 The Editor <edi...@fast.st>:

> 1) Latin based alphabets, as Hans mentioned. (On an earlier issue, we
> could make our config setting pageNames: ascii or utf-8 with one of
> them default). My guess is most users of accented latin-based letters
> would prefer to be able to read a pagename like tèst as test rather
> than t%e27st (or whatever it is). Creating a plugin of alternate
> actions to automatically save titles with accents might be a useful
> addition to this.

I differ here. Web developers were forced to use basic ASCII in the
past. The growing support of Unicode with utf-8 encoding and decoding
makes it more and more anachronistic to use page names in urls with
stripped diacritics.
Like for instance the German word Öl (oil). A stripped page name like
Oel is just about readable to a German, but not friendly. and it fails
in Google searches totally. For Google you need the proper utf-8
encoded word, either as Öl or as % encoded url.

Also more browsers support utf-8 in their address bar: The % encoded
url http://de.wikipedia.org/wiki/%C3%96le reads in Firefox as
http://de.wikipedia.org/wiki/Öle. So it is getting more and more
convenient to use utf-8 always, as well as more practical and search
engine friendly.

Therefore I think that utf-8 page names should be the default. If you
only use basic Latin characters, without diacritics, and no characters
from other languages, you won't even notice. If you do, you will like
the default support of these. If you are a Western European who wants
to have diacritic characters transformed into basic Latin ones, you
should be able to set a config switch. And switching to such a
transformation mode should still show pages with the proper diacritics
in the titles. This should be automatic, and no plugin installation
needed for it. It really annoys me to enter something like Öl aspage
name for a new page, and then the page not only shows oel in the url,
but Oel as the page title, so that i immediately have to set the title
to Öl.

> 2) Speed. I thought the extra handling of utf conversions might affect
> speed. But on my tests with utf pages so far, they seem slighter
> faster. Curious.

Good!

> 3) Security. I am not sure what security implications there are of
> allowing % encoded text in a page. For example, what if someone
> encoded a javascript snippet and then cut and paste the script to a
> page. If BoltWire simply unencoded it, would we be opening ourselves
> to an xss hack? Likely.  (This by the way is one reason I have
> everything run through a pair of central utf2url/url2utf functions--we
> can easily add extra precautions if needed systemwide).

I don't think this is an issue. Need to do some testing though.

> 4) I've not thought through all the case sensitivity issues. In most
> things BoltWire is case insensitive by design. I'm not sure how
> switching over will affect this. And particularly with diacritics that
> DO have case conversions. I have some code for utf case changing, but
> it's not all tested etc.

There may be some issues here. Hopefully not too big ones to tackle!

> 5) To be honest, I personally like it in the nice safe confines of
> simple ascii page names, and probably for most English sites, it is
> easiest. And maybe mostly for this reason, I'm still inclined to keep
> ascii the default.  :)

Well, that sounds very US centric ;-)). Time to embrace the world!
Even Canada and most European countries needed to deviate from the US
ASCII code standard and create their own versions to accommodate
diacritic characters etc. (before Unicode was developed).

Now with Unicode we can be much broader and step outside the
confinements of basic ASCII.
Just think of basic ASCII as a specialist subset of Unicode, and write
the program code for the general not the specialist case.

If you followed my posts about this you noticed that i was rather
sceptical that it would be doable. I thought it needed a different
approach, not just fixing things here and there. Well, we got quite
far with fixing things, so why not sail with it?

cheers,
~Hans

Hans

unread,
Mar 10, 2009, 12:16:07 PM3/10/09
to bolt...@googlegroups.com
One more example for the need of utf-8 decoding:

For testing I switched off utfpages. now a page with nice Chinese name
appears as ....p=許功蓋 in Firefox's url bar,
but as %e8%a8%b1%e5%8a%9f%e8%93%8b in the Windows title bar and as
%e8%a8%b1%e5%8a%9f%e8%93%8b as the page title in the page (and on any
search result).

~Hans

Hans

unread,
Mar 10, 2009, 12:23:06 PM3/10/09
to bolt...@googlegroups.com
Another option may be to have utf-8 names as default, and
BasicLatinNames as plugin. That would offer the possibility to
restrict basic Latin names to certain page groups, while others can
enjoy the full freedom of utf-8 support..

~Hans

Linly

unread,
Mar 10, 2009, 12:48:43 PM3/10/09
to BoltWire
Hi, Dan and Hans,

Since mostly three of us walk together along the utf-8 road, I'd like
to talk a little about my feeling. I'm not programmer so the technical
problem is not the things I can discuss, but the sentimental side
however I like to share.

The people outside English world are all familiar with English
especially in url. The day Hans first time solve the problem making
true Chinese characters appeared in the url bar, I was so touched that
can't help to stop my tears felling down. It was happy tear. So rare,
waiting so many days, BoltWire achieved this. Only few other programs
can do the same. (Even PmWiki could not do it perfectly.)

This make BoltWire become the advanced one in i18n around all CMSes.
I'm not Christian but I always admire the missionaries who go to
foreign country and use the local language to touch people.

Utf-8 page name maybe not adding any function to BoltWire but it make
people outside English world LOVE BoltWire.

Cheers, linly

On Mar 11, 12:02 am, Hans <hans.brac...@googlemail.com> wrote:
> 2009/3/10 The Editor <edi...@fast.st>:
>
> > 1) Latin based alphabets, as Hans mentioned. (On an earlier issue, we
> > could make our config setting pageNames: ascii or utf-8 with one of
> > them default). My guess is most users of accented latin-based letters
> > would prefer to be able to read a pagename like tèst as test rather
> > than t%e27st (or whatever it is). Creating a plugin of alternate
> > actions to automatically save titles with accents might be a useful
> > addition to this.
>
> I differ here. Web developers were forced to use basic ASCII in the
> past. The growing support of Unicode with utf-8 encoding and decoding
> makes it more and more anachronistic to use page names in urls with
> stripped diacritics.
> Like for instance the German word Öl (oil). A stripped page name like
> Oel is just about readable to a German, but not friendly. and it fails
> in Google searches totally. For Google you need the proper utf-8
> encoded word, either as Öl or as % encoded url.
>
> Also more browsers support utf-8 in their address bar: The % encoded
> urlhttp://de.wikipedia.org/wiki/%C3%96lereads in Firefox ashttp://de.wikipedia.org/wiki/Öle. So it is getting more and more

jacmgr

unread,
Mar 10, 2009, 1:05:23 PM3/10/09
to BoltWire
I can't help or test this UTF encoding thing; I don't even understand
it, BUT, if it is achievable, then I vote it as the priority feature
above other features.

Seems like Hans and Dan are "in the zone" on this right now, and Linly
has got the testing covered, so I vote go...go...go.... as long as
that 3 man team keeps at it!

I vote it be the DEFAULT. I won't know the difference but the rest of
the world will.

Sorry I can't help, but I think you are onto a very important
milestone here, and looks like you are real close! Good Luck.

The world is actually a small place. Maybe I (caucasion anglo
philadelphian of italian irish decent) can get my Korean wife
interested in computers and wiki content! It'd be easier for her to
take the interest if it was in Korean Language, which I think Linly's
test pretty much show that will work!

Linly

unread,
Mar 10, 2009, 1:36:17 PM3/10/09
to BoltWire
Hi jacmgr,

Take a look:

http://txtray.net/jargon/?p=%ED%95%98%ED%9A%8C%EB%A7%88%EC%9D%84
http://txtray.net/jargon/?p=하회마을

Both should work. Ask your wife, she would know those Korean
characters.

(I don't know Korean, the content of that page is copied from a Korean
newspaper site.)

Cheers, linly

fans

unread,
Mar 10, 2009, 1:52:35 PM3/10/09
to BoltWire

Need utf pagename

Hans

unread,
Mar 10, 2009, 2:57:40 PM3/10/09
to bolt...@googlegroups.com
Thanks for sharing that Linly! It touched me too on an emotional
level, it filled me with surprise, joy and delight.

I was really surprised to see Chinese and other higher Unicode
characters in the Firefox address bar Internet Explorer only shows
the % encoded raw text. So the new browsers are on our side!

The only real limitations for utf-8 encoded page names I see in the
restriction for php code to use lower ASCII, and in BoltWires clever
tying of action names and markup function names and php function
names. There is no translation layer built in, so we have to restrict
markup function names like [(search ...)] to lower ASCII. Actually
action pages could be renamed to higher Unicode characters without
problems I think, if one wants to get fancy. So there should not be
any limitations as to page names, but a limitation as regards to
function names, variable names (do info variables work with higher
utf-8 characters?)

The programming challenge is to decode any utf-8 characters cleanly
for php handling, and encode any results cleanly.

Cheers,
~Hans

Linly

unread,
Mar 11, 2009, 1:05:41 AM3/11/09
to BoltWire
> (do info variables work with higher
> utf-8 characters?)

Interesting situation here.

1) How [(info)] treat Chinese page names?

If I use this form
<code>
[form]
[text target size=25]
[submit]
[session create]
[session nextpage {=target}]
[session info "field='{=target}' value='test 一二'
target='info.infotest'"]
[form]
</code>

I can create new page and save an info data to page "info.infotest".
If I create a page named "一二", it appeared in info.infotest as this:

%e4%b8%80%e4%ba%8c: test 一二

The field is % encoded, and the value is direct Chinese. Can I
retrieve the value? The answer is no. Use this markup in page "一二":

{info.infotest::{p}}

It appeared as this: "{info.infotest::一二}". No value returned.

2) Is info counter work or not?

It's weird that info counter can write direct Chinese to the field.
the "info.counter" showing this line:

一二: 111

The field name is direct Chinese, but the value part is not right.
Actually I only view the page three times. Somehow info counter can
not add the pageview by sum it up, but adding 1, 1, 1, all the time.

Cheers, linly

Linly

unread,
Mar 11, 2009, 5:04:31 AM3/11/09
to BoltWire
I've thinking that the %-encoded code should ONLY saved in one place -
the file name in file system. Other then that, inside the content of a
page, no matter it is interwiki link, info data, index or anything, it
should save as normal utf-8 texts.

While the interwiki link is rendered to html link, %-encoding is
executed at this moment.

Is my understanding right or wrong?

Cheers, linly

The Editor

unread,
Mar 11, 2009, 6:02:47 AM3/11/09
to bolt...@googlegroups.com
On Wed, Mar 11, 2009 at 5:04 AM, Linly <linl...@gmail.com> wrote:
>
> I've thinking that the %-encoded code should ONLY saved in one place -
> the file name in file system. Other then that, inside the content of a
> page, no matter it is interwiki link, info data, index or anything, it
> should save as normal utf-8 texts.
>
> While the interwiki link is rendered to html link, %-encoding is
> executed at this moment.
>
> Is my understanding right or wrong?


Thank you very much for your testing. This really helps. It also shows
as Hans pointed out, we have taken kind of a patchwork approach to
slowly adding UTF support. Mostly because I had no clue how to do it
initially. :)

The answer to your question is "probably".

First we need to really think about the best place to encode/decode
the utf chars so that we have maximum, and simplest functionality.
Switching the decoding to the final output of the markup table (as
Hans suggested) will simplify a lot of code and make sure we never see
a %encoded url anywhere (unless escaped--a whole other issue). And
while probably don't want the underlying page content %encoded (for
those who snoop around in there), there are some real possibilities if
we did that could make UTF page names work anywhere in the system. For
example, there's no reason we couldn't define some custom system var
like {二} and get it to work. Or a function like [(二 ....)]. (We might
already be able to do this with mapping). Crazy possibilities.

However, I have two concerns.

1) Security. Though I haven't yet tested, I'm concerned someone could
url_encode a XSS hack, drop it into any BoltWire comment box, and
wreak havoc. It would bypass all filters (I have had to add %'s now to
most filters to admit page names), and then if BoltWire blindly
decoded everything, it could output perfectly formed javascript to the
page. This may already be a vulnerability if you have the new utf
pagenames enabled.

2) Filters. We may have real issues when all our filters becoming
essentially meaningless, not to mention problems with markup rules.
The purpose of filters is to validate and clean user input, and block
disruptive or malicious code from being entered. But it kind of
defeats the whole idea, if anything can be entered anywhere, it gets
urlencoded, and it sails past the filters with ease. Fun but scary.

I've also been really busy, and will be all this week. So the next
release may not be out till next week. Unless I can squeeze in a
couple quick fixes. It probably won't be till next week that I can get
some time to think through a more systemic way to integrate our new
UTF capabilities into BoltWire.

Cheers,
Dan

P.S. Input, particularly on the theoretical level right now, would be
really helpful. I'm almost persuaded by the recent posts in support of
making utf default, but still dragging my feet because of the
potential risks involved.

Linly

unread,
Mar 11, 2009, 9:03:01 AM3/11/09
to BoltWire
> First we need to really think about the best place to encode/decode
> the utf chars so that we have maximum, and simplest functionality.
> Switching the decoding to the final output of the markup table (as
> Hans suggested) will simplify a lot of code and make sure we never see
> a %encoded url anywhere (unless escaped--a whole other issue). And
> while probably don't want the underlying page content %encoded (for
> those who snoop around in there), there are some real possibilities if
> we did that could make UTF page names work anywhere in the system. For
> example, there's no reason we couldn't define some custom system var
> like {二} and get it to work. Or a function like [(二 ....)]. (We might
> already be able to do this with mapping). Crazy possibilities.

I think if we came to "{{p}::二}", that would be far enough. :)

> 1) Security. Though I haven't yet tested, I'm concerned someone could
> url_encode a XSS hack, drop it into any BoltWire comment box, and
> wreak havoc. It would bypass all filters (I have had to add %'s now to
> most filters to admit page names), and then if BoltWire blindly
> decoded everything, it could output perfectly formed javascript to the
> page. This may already be a vulnerability if you have the new utf
> pagenames enabled.

I know nothing about this, but why not just put a pair of <code></
code> surrounding the comment input, forcing all of the content become
pure text? No code allowed. Many blog script use this approach to
prevent xss or other things.

Cheers, linly

The Editor

unread,
Mar 11, 2009, 11:48:17 AM3/11/09
to bolt...@googlegroups.com
On Wed, Mar 11, 2009 at 9:03 AM, Linly <linl...@gmail.com> wrote:
>
>> First we need to really think about the best place to encode/decode
>> the utf chars so that we have maximum, and simplest functionality.
>> Switching the decoding to the final output of the markup table (as
>> Hans suggested) will simplify a lot of code and make sure we never see
>> a %encoded url anywhere (unless escaped--a whole other issue). And
>> while probably don't want the underlying page content %encoded (for
>> those who snoop around in there), there are some real possibilities if
>> we did that could make UTF page names work anywhere in the system. For
>> example, there's no reason we couldn't define some custom system var
>> like {二} and get it to work. Or a function like [(二 ....)]. (We might
>> already be able to do this with mapping). Crazy possibilities.
>
> I think if we came to "{{p}::二}", that would be far enough. :)

For now, perhaps. But I'm sure someone is going to want to have some
info retrieved on a non ascii page. :) Probably you first, Linly!
Perhaps a report showing all the page summaries or something, with
some pages being utf? And being able to do {二} might be really
trivial if we get the coding/unencoding timed right. It's just the
implications that are interesting.

I just verified we can get function mapping already. For instance put
this in config.php:

$BOLTtoolmap['f']['追寻'] = 'search';

Then in a page put this :)

[(追寻 group=site)]

Though no one has done much with mapping yet, it is a very powerful
thing. I think we could develop translations where a language file not
only translates the messages, buttons, etc., but also includes some
kind of chinese.php file you simply enable and all your functions,
commands, conditions are instantly remapped to chinese equivalents.
(The default english still work, the chinese just get mapped to their
english equivalent). Cool?

>> 1) Security. Though I haven't yet tested, I'm concerned someone could
>> url_encode a XSS hack, drop it into any BoltWire comment box, and
>> wreak havoc. It would bypass all filters (I have had to add %'s now to
>> most filters to admit page names), and then if BoltWire blindly
>> decoded everything, it could output perfectly formed javascript to the
>> page. This may already be a vulnerability if you have the new utf
>> pagenames enabled.
>
> I know nothing about this, but why not just put a pair of <code></
> code> surrounding the comment input, forcing all of the content become
> pure text? No code allowed. Many blog script use this approach to
> prevent xss or other things.

That can be done, as long as you don't want to allow markup in the
comments, but some might want to allow certain markups. Actually, that
suggests a need for something like <limit rules=vars,fmt,links>
</limit> for situations like this. Easy to do... Great idea.

But beside the point, we can't count on every user to be wise. We have
to have a way to protect them from themselves. I've just verified on
my own system there are some definite vulnerabilities with the utf
pages. Hans has point out others offlist. Unfortunately, these both
occur even with utf pages turned off. So this is something we need to
be careful about, on our end, and not put that responsibility on the
user end.

Cheers,
Dan

Hans

unread,
Mar 11, 2009, 12:16:11 PM3/11/09
to bolt...@googlegroups.com
> But beside the point, we can't count on every user to be wise. We have
> to have a way to protect them from themselves. I've just verified on
> my own system there are some definite vulnerabilities with the utf
> pages. Hans has point out others offlist. Unfortunately, these both
> occur even with utf pages turned off. So this is something we need to
> be careful about, on our end, and not put that responsibility on the
> user end.

Just a quick comment about utf-8:

utfpages: false does not mean utf is turned off.
With charset=utf-8 in the skin's HTML head all wiki pages will be
treated as utf-8 encoded.
And forms allow utf-8 encoded character input, unless it is explicitly
filtered out.
Generally for content we want to allow this.
For page names we want percent encoding and forbid some characters.
For info (in-text) variables we probably want to allow, again with
disallowing certain characters.
For other variables we could be more strict and insist on
alphanumeric characters, plus a few others.
And unless input is urldecoded or otherwise filtered/restricted the
php functions deal with utf-8 encoded characters, not percent encoded
strings.

To help not get confused in my testing I use echo statements to see
the vars at various points in the process, and use testpages and test
input with Linly's nice Chinese characters, which show up in the echo,
soi know; aha, unrestricted utf-8 here!

Cheers,
Hans

Reply all
Reply to author
Forward
0 new messages