This works:
$agent->follow_link(url_regex => qr/librarians/, n => 1);
The corresponding XHTML code is:
<a href="mkbAdmin?func=librarians&lang=en">Edit Librarians</a>
I want it to work since I use HTTP::Recorder to generate the code
automatically as I surf using a proxy and it generates code of the type
that doesn't work.
This works:
$agent->follow_link(text => 'Logout', n => 1);
By the way HTTP::Recorder actually generates:
$agent->follow_link(text => 'Edit Librarians', n => '1');
HTML::TreeBuilder, or a module it's using, returns as a single
character, it might be that you have to
use the code instead.
Comment on http://johnbokma.com/perl/search-term-suggestion-tool.html
says: ( , stored as char 225)
So you might want to try: "Edit\xe1Librarians".
Wild guess.
--
John
Arachnids near Coyolillo
http://johnbokma.com/perl/
So this works:
$agent->follow_link(text => "Edit\xa0Librarians", n => 1);
> John Bokma wrote:
[..]
>> HTML::TreeBuilder, or a module it's using, returns as a single
>> character, it might be that you have to
>> use the code instead.
>>
>> Comment on http://johnbokma.com/perl/search-term-suggestion-tool.html
>> says: ( , stored as char 225)
>>
>> So you might want to try: "Edit\xe1Librarians".
>>
>> Wild guess.
>>
> Thanks! But it should be \xa0.
Yeah, but HTML::TreeBuilder returns it as 225 :-D.
[..]
> So this works:
> $agent->follow_link(text => "Edit\xa0Librarians", n => 1);
Glad my post was able to help you in the right way.
--
John
I add that I have developed these command lines to convert back and forth:
sed -i '/ /s/ /\\xa0/g;/\\xa0/s/'\''/"/g' MKBTest.pl
sed -i '/\\xa0/s/\\xa0/\ /g;/ /s/"/'\''/g' MKBTest.pl
He's after a ' ', which us a non-breaking space, which is ASCII
0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
--
szr
s/ASCII/Unicode/
--
RGB
No, it's ASCII. Extended Ascii to be precise.
My ascii chart (an old printed out list I have) lists DEC 225 as
"Lowercase 'a' with acute accent" and DEC 160 as being reserved or a
blank (which is used as a non breaking space.)
These links show the same:
http://www.ascii-code.com/
http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml
--
szr
> RedGrittyBrick wrote:
>> szr wrote:
>>>
>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>>
>>>
>> s/ASCII/Unicode/
>
> No, it's ASCII. Extended Ascii to be precise.
Extended ASCII is a general name for several incompatible extensions to
ASCII. They are NOT ASCII.
But The above IS Unicode. Which is in itself also an extension of ASCII,
BTW.
>
> My ascii chart (an old printed out list I have) lists DEC 225 as
> "Lowercase 'a' with acute accent" and DEC 160 as being reserved or a
> blank (which is used as a non breaking space.)
>
> These links show the same:
> http://www.ascii-code.com/
The "extended" ASCII shown here is the Windows extension, which in itself
is an extension of ISO-Latin-1 which is an extension of ASCII. The site
notes this, and is in itseld correct. And it does not support your idea
of extended ASCII.
> http://www.idevelopment.info/data/Programming/ascii_table/
PROGRAMMING_ascii_table.shtml
This site is plain wrong. Don't believe everything on tha Intuhnet.
M4
The old printed out list I have doesn't make this distinction, but you
are right the Unicode is -an- extension.
>> My ascii chart (an old printed out list I have) lists DEC 225 as
>> "Lowercase 'a' with acute accent" and DEC 160 as being reserved or a
>> blank (which is used as a non breaking space.)
>>
>> These links show the same:
>> http://www.ascii-code.com/
>
> The "extended" ASCII shown here is the Windows extension, which in
> itself is an extension of ISO-Latin-1 which is an extension of ASCII.
> The site notes this, and is in itseld correct. And it does not
> support your idea of extended ASCII.
I got the same output on my Linux system in it's xterm launched from KDE
as I did in Secure CRT in windows, which matches up to outpout used in
windows.
This extended ASCII set I'm refering to is what HTML (such as aka
 ) is based on, or perhaps more precisely based on ISO-Latin-1.
>> http://www.idevelopment.info/data/Programming/ascii_table/
>> PROGRAMMING_ascii_table.shtml
>
> This site is plain wrong.
In what way? It's the same list in my O'Reilly HTML Pocket Reference, as
is the previous link.
> Don't believe everything on tha Intuhnet.
I don't, but ut matches up with what things like HTML go by (again,
ISO-Latin-1 unless otherwise specified in the HEAD, META tags in the
case of HTML.)
--
szr
>>> http://www.idevelopment.info/data/Programming/ascii_table/
>>> PROGRAMMING_ascii_table.shtml
>>
>> This site is plain wrong.
>
> In what way? It's the same list in my O'Reilly HTML Pocket Reference, as
> is the previous link.
Welcome to the wonderful world of character sets. Or how to loose your
sanity in a day. Read http://en.wikipedia.org/wiki/
Western_Latin_character_sets_%28computing%29 as a good introduction.
It is wrong because it says that the table is "extended ASCII". There is
no such thing as. There's ISO-Latin-1, 2, 3, etc, the Windows character
set, the Macintosh character set, the IBM extended ASCII set, etc. And
those are actually used today (except possibly the Mac set, did they
switch?), there are many, many more that are not frequently used today.
In fact, that table seems to show the Windows character set
(Windows-1252). A character set which is actually used very little,
Windows NT and derivatives use UCS16 by preference and the Internet uses
mainly ISO-Latin-1 or UCS32, although ISO-Latin-15 is used too (it
contains the Euro sign, which ISO-Latin-1 does not).
My workstation uses ISO-Latin-15. In Windows I can enter characters by by
holding down alt and typing their IBM Extended ASCII code on the numeric
keypad. So even saying ISO-Latin-1 is by default "the extended character
set" doesn't hold water, although it probably is the widest used chacter
set besides UCS16 and UCS32.
Extended ASCII is a concept, a character set that uses the ASCII codes
for the first 127 characters. There are many extended ASCII sets. Calling
one THE extended ASCII set is just plain wrong. And calling the Windows
character set THE extended ASCII set is just ludicrous.
That is why the world is switching to Unicode. One characterset to rule
them all. But even with Unicode, which one? :-)
M4
-- I believe in standards. Everyone should have one. --
> RedGrittyBrick wrote:
>> szr wrote:
>>>
>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>>
>>
>> s/ASCII/Unicode/
>
> No, it's ASCII. Extended Ascii to be precise.
There's no such encoding as "extended ASCII." ISO/ANSI standard ASCII is
seven bits. Besides which, the document character set for HTML is clearly
stated to be Unicode in the HTML spec:
<http://www.w3.org/TR/REC-html40/charset.html#h-5.1>
sherm--
--
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net
> This extended ASCII set I'm refering to is what HTML (such as aka
>  ) is based on, or perhaps more precisely based on ISO-Latin-1.
It's clearly documented by the W3C that numeric entities in HTML refer to
Unicode code points:
<http://www.w3.org/TR/REC-html40/charset.html#h-5.1>
> I don't, but ut matches up with what things like HTML go by (again,
> ISO-Latin-1 unless otherwise specified in the HEAD, META tags in the
> case of HTML.)
For one thing, document encoding is an entirely different animal; numeric
entities always refer to Unicode, even when the document encoding is not
Unicode.
For another, the *correct* way to communicate document encoding, whether
it's for an HTML, XML, or some other ML document, is to include it as part
of the content-type HTTP header.
Lots of people make this mistake. As your first reference says, ASCII is
a 7-bit character set and does not define a character at code-point 160.
> Extended Ascii to be precise.
To be imprecise!
There are many different incompatible character sets and encodings that
claim to be "Extended ASCII"
Read http://en.wikipedia.org/wiki/Extended_ascii
Especially
http://en.wikipedia.org/wiki/Extended_ascii#Character_set_confusion
See 160 = "lowercase a acute" in these "Extended ASCII" tables:
http://www.webopedia.com/TERM/E/extended_ASCII.html
http://www.telacommunications.com/nutshell/extascii.htm
http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm
http://telecom.tbi.net/asc-ibm.html
--
RGB
You're right. Perhaps too many things competing for brain-time and some
how that got by me when I should of known better. Thanks :-)
--
szr
> John Bokma wrote:
>> "M.O.B. i L." <mik...@df.lth.se> wrote:
>>
>>> John Bokma wrote:
>>
>> [..]
>>
>>>> HTML::TreeBuilder, or a module it's using, returns as a
>>>> single character, it might be that you have to
>>>> use the code instead.
>>>>
>>>> Comment on
>>>> http://johnbokma.com/perl/search-term-suggestion-tool.html says:
>>>> ( , stored as char 225)
>>>>
>>>> So you might want to try: "Edit\xe1Librarians".
>>>>
>>>> Wild guess.
>>>>
>>> Thanks! But it should be \xa0.
>>
>> Yeah, but HTML::TreeBuilder returns it as 225 :-D.
>
> He's after a ' ',
Yes, I am aware of that. And somehow HTML::TreeBuilder or a module it uses
returns as \xe1.
--
John
Yes. The question whether this is a bug in HTML::TreeBuilder or is there
a logical reason for this? DEC 225 doesn't seem to be a space of any
kind in any ascii list I've checked, but I don't doubt I've missed one
somewhere :-)
--
szr
>> He's after a ' ', which us a non-breaking space, which is ASCII
>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>
> s/ASCII/Unicode/
Exactly. ISO-8859-* too.
--
Affijn, Ruud
"Gewoon is een tijger."
>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>
>> s/ASCII/Unicode/
>
> No, it's ASCII. Extended Ascii to be precise.
The ASCII character set is a 7-bit code and it contains 128 characters,
not more.
See also `man ascii`.
> RedGrittyBrick schreef:
>> szr:
>
>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>
>> s/ASCII/Unicode/
>
> Exactly. ISO-8859-* too.
No, no, HTML uses Unicode codepoints (which in this case coincide, but
that's beside the (code)point).
M4
>>>> He's after a ' ', which us a non-breaking space, which is
>>>> ASCII 0xA0 hex or 160 dec. ' ' can even be re-written as
>>>> ' ' .
>>>
>>> s/ASCII/Unicode/
>>
>> Exactly. ISO-8859-* too.
>
> No, no, HTML uses Unicode codepoints (which in this case coincide, but
> that's beside the (code)point).
No, no, no, no, that depends on the encoding being used. Yes, numeric
references always refer to Universal Character Set code points,
regardless of the page's encoding, but HTML is not "limited" to that.
See also http://www.xs4all.nl/~rvtol/htmlcods.html which has been
rendered in many different (so non-"standard") ways in the past 10+
years. :)
> ISO-Latin-1
Normally called "ISO 8859-1" or "ISO Latin 1" or just "Latin-1".
Though ITYM: "ISO-8859-1". (the real one with two hyphens)
> A character set which is actually used very little,
> Windows NT and derivatives use UCS16 by preference and the Internet
> uses mainly ISO-Latin-1 or UCS32, although ISO-Latin-15 is used too
> (it contains the Euro sign, which ISO-Latin-1 does not).
>
> My workstation uses ISO-Latin-15. In Windows I can enter characters
> by by holding down alt and typing their IBM Extended ASCII code on
> the numeric keypad. So even saying ISO-Latin-1 is by default "the
> extended character set" doesn't hold water, although it probably is
> the widest used chacter set besides UCS16 and UCS32.
s/ISO-Latin/ISO Latin/g
With UCS16 you probably mean "UCS-2", or "UTF-16" (which is an extension
of UCS-2).
With UCS32 you probably mean "UCS-4" (which is also called "UTF-32").
They are different... ISO Latin 1 is a character set (an unordered
collection of characters). ISO-8859-1 is a particular encoding of that
character set as 8-bit integers. There are others; in particular some
EBCDIC codepages.
Ben
--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine. b...@morrow.me.uk
--
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent? [Feynmann] b...@morrow.me.uk
>>> ISO-Latin-1
>>
>> Normally called "ISO 8859-1" or "ISO Latin 1" or just "Latin-1".
>
> They are different... ISO Latin 1 is a character set (an unordered
> collection of characters). ISO-8859-1 is a particular encoding of that
> character set as 8-bit integers. There are others; in particular some
> EBCDIC codepages.
"ISO-8859-1" wasn't mentioned in the part that you quote, so I don't see
what you mean with "They".
Quoth "Dr.Ruud" <rvtol...@isolution.nl>:
I'm confused. You said "ISO 8859-1" and "ISO Latin 1" as though they
were equivalent, which they aren't. If you're trying to make
"ISO 8859-1" (sans hyphen) equivalent to "ISO Latin 1" but "ISO-8859-1"
(with hyphen) not, then I'd call that more than a little confusing. For
a start, how would you interpret "ISO 8859-9"? As the Latin-9 character
set used by ISO-8859-15, or as the ISO-8859-9 encoding of the Latin-5
character set?
FWIW, Perl agrees with me:
~% perl -MEncode -le'print Encode::resolve_alias "ISO 8859-9"'
iso-8859-9
~% perl -MEncode -le'print Encode::resolve_alias "ISO Latin-9"'
iso-8859-15
though allowing 'Latin-N' to mean 'the usual 8859-N encoding of the
Latin-9 character set' is arguably only increasing the confusion between
the two.
Ben
--
For the last month, a large number of PSNs in the Arpa[Inter-]net have been
reporting symptoms of congestion ... These reports have been accompanied by an
increasing number of user complaints ... As of June,... the Arpanet contained
47 nodes and 63 links. [ftp://rtfm.mit.edu/pub/arpaprob.txt] * b...@morrow.me.uk
> s/ISO-Latin/ISO Latin/g
>
> With UCS16 you probably mean "UCS-2", or "UTF-16" (which is an extension
> of UCS-2).
> With UCS32 you probably mean "UCS-4" (which is also called "UTF-32").
I stand corrected, I ment UCS-2 and -4. I was indeed confused by the UTF
encodings.
M4
> Martijn Lievaart schreef:
>> Dr.Ruud:
>>> RedGrittyBrick:
>>>> szr:
>
>>>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>>>
>>>> s/ASCII/Unicode/
>>>
>>> Exactly. ISO-8859-* too.
>>
>> No, no, HTML uses Unicode codepoints (which in this case coincide, but
>> that's beside the (code)point).
>
> No, no, no, no, that depends on the encoding being used. Yes, numeric
> references always refer to Universal Character Set code points,
> regardless of the page's encoding, but HTML is not "limited" to that.
No, no, no, no, no :-) You already said it yourself, numeric references
always refer to Unicode codepoints. That's the only point I was trying to
make, and why you cannot substititute ISO-8859-* above.
M4
Your "No, no," was about a limit I didn't imply, so was not about what I
wrote, but about what you limited it to. ("Fallacy of Distribution")
[ snip ]
You're right, I'm a bad reader.
M4
> [ snip ]
> You're right, I'm a bad reader.
And I am sorry that I didn't write it clearer.
Have a Happy Queen's Day!
For me a good day to work on some pet projects. And clean the house.
I live near the Amsterdam Museumplein, so I expect loud music from wrong
bands all afternoon and evening.
http://www.koninginnedagamsterdam.nl/Radio-538-Museumplein_348.php
> Martijn Lievaart schreef:
>> Dr.Ruud:
>
>> [ snip ]
>> You're right, I'm a bad reader.
>
> And I am sorry that I didn't write it clearer.
>
> Have a Happy Queen's Day!
We did, in appropriate drunkenness.
>
> For me a good day to work on some pet projects. And clean the house. I
> live near the Amsterdam Museumplein, so I expect loud music from wrong
> bands all afternoon and evening.
> http://www.koninginnedagamsterdam.nl/Radio-538-Museumplein_348.php
Ach, arm! We were camping, the first time with our dog, so I'm replying a
bit late. (Next time a camping with Internet or HSDPA, GPRS is just to
damn slow to do anything other than read email).
M4