Problem with utf-8

775 views
Skip to first unread message

erik

unread,
Feb 1, 2011, 2:05:47 PM2/1/11
to support-querypath
I'm having a problem when loading a UTF-8 string into QueryPath. The
string is remotely fetched HTML (via CURL) which if I echo out
directly, has no problems -- quotes, hypens, etc all display
correctly. But take it one more step and I get mangled characters. For
example:


// prints correctly
echo $html;

// Quotes, etc all are ... mangled?
$QP = htmlqp($html);
echo $QP->html(); // or $QP->writeHTML()


any suggestions?

TechnoSophos

unread,
Feb 1, 2011, 2:37:56 PM2/1/11
to support-...@googlegroups.com
By default, htmlqp() tries to convert all documents to ISO-8859-1
before parsing them (plain old qp() doesn't do this). It does this
because many, many HTML documents mis-report the character set that
they use. So the best default strategy has been to normalize the
document.

That said, you can actually alter the defaults, all of which are
documented here:
http://api.querypath.org/docs/group__querypath__core.html#gaef0367b722980142efd304a5ed41fb15

So, for example, you might want to try something like this:

$options = array(
'convert_from_encoding' => 'utf-8',
);

$QP = htmlqp($html, NULL, $options);

The source code for the htmlqp() function might be useful to look at,
too. It is very short. It basically creates an $options array and then
calls qp(). So you can see exactly which options are passed to it.

Sometimes tinkering with the data using mb_convert_encoding() and
family will help you find out if there is an encoding error in the
document. However, since those are the functions QueryPath uses
already, they may not help in this case.
http://br.php.net/manual/en/function.mb-convert-encoding.php

Matt

> --
> You received this message because you are subscribed to the Google Groups "support-querypath" group.
> To post to this group, send email to support-...@googlegroups.com.
> To unsubscribe from this group, send email to support-queryp...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/support-querypath?hl=en.
>
>

--
http://technosophos.com
http://querypath.org

TechnoSophos

unread,
Feb 1, 2011, 2:43:03 PM2/1/11
to support-...@googlegroups.com
Oh, and maybe it would be worth checking to see if the multibyte
library is enabled on your PHP. If QueryPath can't find
mb_convert_encoding(), it just assumes the character encoding is all
okay.

If you do figure out what the problem is, please let me know. I'd
really like to get QueryPath to the point where nobody ever had to
worry about trivialities like what character encoding was used.

Matt

erik

unread,
Feb 1, 2011, 3:21:37 PM2/1/11
to support-querypath
$options = array(
'convert_from_encoding' => 'utf-8',
);

This did indeed fix the problem. Something you may want to consider is
there are many NEW documents that aren't well formed HTML but are
UTF-8 encoded. In this particular example it was a college sports team
website which is current, and includes a "<meta http-equiv="content-
type" content="text/html; charset=utf-8" />" in the head. Almost
everything done recently is UTF8 but often has other cruft which
causes errors if you just use qp();

TechnoSophos

unread,
Feb 1, 2011, 6:50:40 PM2/1/11
to support-...@googlegroups.com
Excellent! In the past, setting convert_from_encoding to auto has
handled UTF-8 pretty well, but I suppose it could falsely identify a
character set under certain circumstances.

Matt

Jānis Elmeris

unread,
Oct 21, 2013, 4:52:53 PM10/21/13
to support-...@googlegroups.com
Setting options for qp didn't help for me. It seems that the encoding gets broken by DOMDocument that is used by QueryPath. The problem probably is that I'm trying to parse a part of HTML, not the whole HTML document, so there is no (HTML or XML) meta tag for charset or BOM for the whole file, and DOMDocument apparently decides the encoding is ISO-8859-something, not UTF-8.

TechnoSophos

unread,
Oct 22, 2013, 11:52:16 AM10/22/13
to support-...@googlegroups.com
Since QueryPath 3, the document is converted into ISO-8859-1 for HTML DOMs (previously it was undefined, and caused many problems.)

The reason for this is that libxml's HTML library is oriented to ISO-8859-1. So we convert from UTF-8 to ISO-8859-1 when dealing with HTML documents, and then try to convert back to UTF-8 when we're done.

Now that HTML5-PHP is done, I am trying to get that integrated into QueryPath so that we can do real UTF-8 in HTML. You can actually use HTML5-PHP today, and just pass the DOM into QueryPath.

Source: https://github.com/Masterminds/html5-php
Docs: http://masterminds.github.io/html5-php/

You should be able to do something like this:

$dom = \HTML5::loadHTML($html);
$qp = qp($dom);
// Do whatever with QueryPath

print \HTML5::saveHTML($dom);


To unsubscribe from this group and stop receiving emails from it, send an email to support-queryp...@googlegroups.com.

To post to this group, send email to support-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages