How to force html.parse to parse a String not an uri?

3 views
Skip to first unread message

Luis Miguel Morillas

unread,
Jun 12, 2012, 6:49:23 AM6/12/12
to akar...@googlegroups.com
I must use amara.bindery.html.parse with html pages, but sometimes
(depending on the encoding) it raises a ValueError

File "/home/lm/workspace/scrapinmo/src/dataextractor.py", line 94, in __init__
self.page = html.parse(content)
File "/home/lm/entornos/scraping/local/lib/python2.7/site-packages/amara/bindery/html.py",
line 250, in parse
doc = parser.parse(inputsource(source, None).stream, encoding=encoding)
File "/home/lm/entornos/scraping/local/lib/python2.7/site-packages/amara/lib/_inputsource.py",
line 84, in __new__
raise ValueError("Does not appear to be well-formed XML")
ValueError: Does not appear to be well-formed XML


_inputsource.py source indicates at this line a warning:

else:
#FIXME L10N
raise ValueError("Does not appear to be well-formed XML")

How can I use with parse the sourcetype param?

Regards,

-- luismiguel  (@lmorillas)

Luis Miguel Morillas

unread,
Jun 12, 2012, 6:56:29 AM6/12/12
to akar...@googlegroups.com
2012/6/12 Luis Miguel Morillas <mori...@gmail.com>:
I think I've found it:

source = inputsource(arg=content, sourcetype=1)
page = html.parse(source)


Saludos,

-- luismiguel  (@lmorillas)

Uche Ogbuji

unread,
Jun 12, 2012, 11:05:55 AM6/12/12
to akar...@googlegroups.com
On Tue, Jun 12, 2012 at 4:56 AM, Luis Miguel Morillas <mori...@gmail.com> wrote:
2012/6/12 Luis Miguel Morillas <mori...@gmail.com>:
> I must use amara.bindery.html.parse with html pages, but sometimes
> (depending on the encoding) it raises a ValueError

...
 
I think I've found it:

source = inputsource(arg=content, sourcetype=1)
page = html.parse(source)

Yep, that's it. It's bitten me as well. I did add the above pattern to the html.parse docstring but I guess it should be added to the wiki docs as well. Also, I've long wanted to simplify inputsource and all that.


--
Uche Ogbuji                       http://uche.ogbuji.net
Weblog: http://copia.ogbuji.net
Poetry ed @TNB: http://www.thenervousbreakdown.com/author/uogbuji/
Founding Partner, Zepheira        http://zepheira.com
Linked-in: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
Friendfeed: http://friendfeed.com/uche
Twitter: http://twitter.com/uogbuji
http://www.google.com/profiles/uche.ogbuji

Luis Miguel Morillas

unread,
Jun 12, 2012, 2:42:16 PM6/12/12
to akar...@googlegroups.com
2012/6/12 Uche Ogbuji <uc...@ogbuji.net>:
>
>
> On Tue, Jun 12, 2012 at 4:56 AM, Luis Miguel Morillas <mori...@gmail.com>
> wrote:
>>
>> 2012/6/12 Luis Miguel Morillas <mori...@gmail.com>:
>> > I must use amara.bindery.html.parse with html pages, but sometimes
>> > (depending on the encoding) it raises a ValueError
>
>
> ...
>
>>
>> I think I've found it:
>>
>> source = inputsource(arg=content, sourcetype=1)
>> page = html.parse(source)
>
>
> Yep, that's it. It's bitten me as well. I did add the above pattern to the
> html.parse docstring but I guess it should be added to the wiki docs as
> well. Also, I've long wanted to simplify inputsource and all that.
>
>
Ok. I'll add to the wiki docs.

Could you create a ticket about simplify inputsource?


Regards,


-- luismiguel  (@lmorillas)
> --
> You received this message because you are subscribed to the Google Groups
> "Akara Developers" group.
> To post to this group, send email to akar...@googlegroups.com.
> To unsubscribe from this group, send email to
> akara-dev+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/akara-dev?hl=en.

Uche Ogbuji

unread,
Jun 12, 2012, 2:55:34 PM6/12/12
to akar...@googlegroups.com
On Tue, Jun 12, 2012 at 12:42 PM, Luis Miguel Morillas <mori...@gmail.com> wrote:

Could you create a ticket about simplify inputsource?

There has been one for a long time:


I suppose all those old Trac tix should be migrated to GitHub


Uche Ogbuji

unread,
Jun 12, 2012, 2:59:13 PM6/12/12
to akar...@googlegroups.com
On Tue, Jun 12, 2012 at 12:55 PM, Uche Ogbuji <uc...@ogbuji.net> wrote:
On Tue, Jun 12, 2012 at 12:42 PM, Luis Miguel Morillas <mori...@gmail.com> wrote:

Could you create a ticket about simplify inputsource?

There has been one for a long time:


I suppose all those old Trac tix should be migrated to GitHub

Reply all
Reply to author
Forward
0 new messages