Re: Improve performance

567 views
Skip to first unread message

Leonard Richardson

unread,
Jun 21, 2012, 6:54:35 PM6/21/12
to beauti...@googlegroups.com
I tried this out and it's a really impressive improvement. (330x in some cases.)

I'm not going to change Beautiful Soup yet because the installation
process for cchardet is so complicated, but once it's better packaged,
I'll make Beautiful Soup use cchardet if it's installed. In the
meantime, I can recommend cchardet for anyone who needs the Unicode
conversion process to run faster.

Leonard

On Wed, Jun 20, 2012 at 7:17 PM, PyYoshi.Jpn <renap...@gmail.com> wrote:
> Hi, everyone.
>
> I am developing cChardet.
> cChardet is faster than chardet because this is Python binding of C library.
> If you want to improve performance of bs4, you try to use this patch.
>
> Thanks,
>
> PyYoshi
>
> P.S.: Sorry for my poor English.
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/5nGPDC5suNgJ.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.

PyYoshi.Jpn

unread,
Jun 26, 2012, 4:24:50 AM6/26/12
to beauti...@googlegroups.com
Hi, Leonard.

Thank you for trying this library. And I'm sorry that I wrote title of a commanding tone. I wanted to write "I improved performance of bs4."

I have good news! I changed build method for the better. It's very simple!
$cd cChardet
$python setup.py build
$python setup.py install
or
$pip install -U cchardet
I am testing it on Win7 HP x64 Python2.7.3 and Ubuntu12.04 x64 Python2.7.3.
Feedback is welcomed.

Leonard Richardson

unread,
Jul 3, 2012, 5:50:32 PM7/3/12
to beauti...@googlegroups.com
PyYoshi,

I have integrated cchardet support into Beautiful Soup and it will be
in the next release. You can see my code in revision 246:

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/246

Speaking as a user, I have some feedback about the installation
instructions. I've never installed a Cython extension before, and it
took me a while to discover that the installation process requires two
dependencies other than Cython.

I'm running Ubuntu, and in addition to the 'cython' package I had to
install the "python-dev" package and the "g++" package before 'pip
install -U cchardet' would run to completion. It would be helpful to
have this information alongside the Cython dependency.

My only feedback about the library itself is that it's unfortunate
that it doesn't have the same API as chardet. In particular,
chardet.detect() returns a dictionary and cchardet.detect() returns a
string.

I understand that the API is different because cchardet is based on
the libcharsetdetect library, and not a C implementation of the
chardet algorithm. But since the library is called 'cchardet' I
thought it would look more like chardet.

Thanks again for creating this outstanding library.

Leonard

PyYoshi.Jpn

unread,
Jul 7, 2012, 2:13:15 AM7/7/12
to beauti...@googlegroups.com
Leonard,

Thank you for your feedback.

I improved that cchardet is not similar to chardet.
I renewed cchardet.detect(). It returns a dictionary. (https://github.com/PyYoshi/cChardet/blob/master/src/cchardet/__init__.py)
Please, reinstall it.
$pip install -U cchardet

Thanks,

PyYoshi

2012年7月4日水曜日 6時50分32秒 UTC+9 Leonard Richardson:
Reply all
Reply to author
Forward
0 new messages