Re: Problem with some characters (and encoding)

108 views

Skip to first unread message

TechnoSophos

unread,

Mar 21, 2013, 12:18:50 PM3/21/13

to support-...@googlegroups.com

Sometimes tweaking the convert_to_encoding and convert_from_encoding options that are passed into qp()/htmlqp() will help. You can also see how I handle regular expressions in QueryPath\Entities::replaceAllEntities().

If the HTML is well-formed enough to use XML, you might also try plain qp() instead of htmlqp().

At the root of the issue is the fact that libxml assumes that HTML is in ISO-8859-1 instead of UTF-8 (but it assumes XML is in UTF-8). So I have had to develop various ways of trying to force conversions.

If all else fails, using the PHP function strtr() allows you to change all of the characters at once, without having to do that foreach loop.

Thanks,

Matt

TechnoSophos

Blog: http://technosophos.com

Twitter: @technosophos

Sent with Sparrow

On Wednesday, March 20, 2013 at 6:41 PM, Mattia Borini wrote:

Hi,
I'm using QueryPath to parse a series of very large html files (the Holy Bible) translated into a particular language of my country.
Here is the situation:
- the files are in html format but contain only the body element
- I'm loading the files in php with fopen and fread (the workflow led me to this, up to now)
- all files I'm working with are saved in UTF-8 encoding
- I use the htmlqp function with no options (I tried different combinations but were not useful to solve my problem)

The characters which I'm having problems with are “, ”, ’ and –. All these are replaced with a ? (question mark) on output via the html() method.

My temporary solution ("temporary" because I'd like to know a more orthodox method) was to replace these characters with corresponding entities before using htmlqp:

$entities = array("“", "”", "’", "–");
foreach ($entities as $entity) {
$content = str_replace($entity, htmlentities($entity), $content);
}
$content = htmlqp($content)-> ...some QueryPath code here... ->html();

Anybody has a suggestion?
I'll attach a very small piece of text which contains some of the problematic characters.
Tnx,
Mat.

PS: QP looks very cool!

--
You received this message because you are subscribed to the Google Groups "support-querypath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to support-queryp...@googlegroups.com.
To post to this group, send email to support-...@googlegroups.com.
Visit this group at http://groups.google.com/group/support-querypath?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Attachments:

- small.htm

Reply all

Reply to author

Forward

0 new messages