Fwd: Fixing Cheese Shop search

26 views
Skip to first unread message

David Wilson

unread,
Apr 23, 2013, 5:31:03 AM4/23/13
to pypa...@googlegroups.com
Hi guys,

Just saw this group, forwarding the mail below for discussion.

I'm free this coming Sunday, and ideally the day could be spent fixing
search, but there's no point in working blindly when there is no
consensus. Basically the current search is so slow (and the solution
so simple) that this seems a no-brainer. So who needs convincing this
is a good idea before it can happen?* :-)

It seems what's in the demo repo below is almost good enough already,
it wouldn't take much work to deploy as an XML-RPC server internally
on the same machine running PyPI, and a 15 line patch to call out to
it from the PyPI source.

Again PyPI has been growing organically for a very long time, and
dumping even more features in there doesn't seem a great idea. I
looked at retrofitting PyPI with Flask, but there is simply too much
custom code to be sure things won't be broken by doing it in a hurry.

Thanks,


David

* Ideally before the end of summer.

---------- Forwarded message ----------

http://pypi.h1.botanicus.net/ is the same demo running behind Apache
with mod_gzip on a Core i7 920.

On 22 April 2013 02:11, David Wilson <d...@botanicus.net> wrote:
> Hi there,
>
> In a fit of madness caused by another 30 seconds-long PyPI search I
> decided to investigate the code, in the hopes of perhaps finding
> something simple that would alleviate the extremely long search time.
>
> I discovered what appears to be a function that makes 6 SQL queries
> for each term provided by the search user, in turn those queries
> expand to what appear to be %SUBSTRING% table scans across the
> releases table, which appears to contain upwards of half a gigabyte of
> strings.
>
> Now the root cause has been located, what to do about it? I looked at
> hacking on the code, but it seems webui.py is already too massive for
> its own good, and in any case PostgreSQL's options for efficient
> search are quite limited. It might be all-round good if the size of
> that module started to drop..
>
> I wrote a crawler to pull a reasonable facsimile of the releases table
> on to my machine via the XML-RPC API, then arranged for Xapian to
> index only the newest releases for each package. The resulting
> full-text index weighs in at a very reasonable 334mb, and searches
> complete almost immediately, even on my lowly Intel Atom colocated
> server.
>
> I wrote a quick hack Flask app around it, which you can see here:
>
> http://5.39.91.176:5000/
>
> The indexer takes as input a database produced by the crawler, which
> is smart enough to know how to use PyPI's exposed 'changelog' serial
> numbers. Basically it is quite trivial and efficient to run this setup
> in an incremental indexing mode.
>
> As you can see from the results, even my lowly colo is trouncing what
> is currently on PyPI, and so my thoughts tend toward making an
> arrangement like this more permanent.
>
> The crawler code weighs in at 150 lines, the indexer a meagre 113
> lines, and the Flask example app is 74 lines. Implementing an exact
> replica of PyPI's existing scoring function is already partially done
> at indexing time, and the rest is quite easy to complete (mostly
> cutpasting code).
>
> Updating the Flask example to provide an XML-RPC API (or similar),
> then *initially augmenting* the old search facility seems like a good
> start, with a view to removing the old feature entirely. Integrating
> indexing directly would be pointless, the PyPI code really doesn't
> need anything more added to it until it gets at least reorganized a
> little.
>
> So for the cost of 334mb of disk, a cron job, and a lowly VPS with
> even just 700MB RAM, PyPI's search pains might be solved permanently.
> Naturally I'm writing this mail because it bothers me enough to
> volunteer help. :)
>
> Prototype code is here: https://bitbucket.org/dmw/pypi-search (note:
> relies on a pre-alpha quality DB library I'm hacking on)
>
> Thoughts?

Richard Jones

unread,
Apr 23, 2013, 5:48:41 AM4/23/13
to David Wilson, pypa-dev
Hi David,

I saw your email but I have no time for PyPI beyond handling users yelling at me to help them reset their passwords (that's not to say there's a huge number, it's more that I just don't have time for anything else right now).


    Richard
Reply all
Reply to author
Forward
0 new messages