On Wednesday, March 20, 2013 at 6:41 PM, Mattia Borini wrote:
Hi,I'm using QueryPath to parse a series of very large html files (the Holy Bible) translated into a particular language of my country.Here is the situation:- the files are in html format but contain only the body element- I'm loading the files in php with fopen and fread (the workflow led me to this, up to now)- all files I'm working with are saved in UTF-8 encoding- I use the htmlqp function with no options (I tried different combinations but were not useful to solve my problem)The characters which I'm having problems with are “, ”, ’ and –. All these are replaced with a ? (question mark) on output via the html() method.My temporary solution ("temporary" because I'd like to know a more orthodox method) was to replace these characters with corresponding entities before using htmlqp:$entities = array("“", "”", "’", "–");foreach ($entities as $entity) {$content = str_replace($entity, htmlentities($entity), $content);}$content = htmlqp($content)-> ...some QueryPath code here... ->html();Anybody has a suggestion?I'll attach a very small piece of text which contains some of the problematic characters.Tnx,Mat.PS: QP looks very cool!--
You received this message because you are subscribed to the Google Groups "support-querypath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to support-queryp...@googlegroups.com.
To post to this group, send email to support-...@googlegroups.com.
Visit this group at http://groups.google.com/group/support-querypath?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
Attachments:- small.htm