I was chatting with oalders on IRC the other day and I decided to join the group and help with the coding. I got very enthusiastic and instead oflearning for my exams I rewrote large parts of the code.
I had a good look at the wiki articles and tried to implement as much as possible.
My fork lives at https://github.com/monken/cpan-api
(version described here is 29492103226d1c6dc3a1b43eca30bde9417ca4b9)
Here are the changes:
== ElasticSearch::Document ==
This is a Moose class I wrote, which helps to define elasticsearch
documents.
This is how a distribution looks like:
package MetaCPAN::Document::Distribution;
use Moose;
use ElasticSearch::Document;
has name => ( id => 1 );
has ratings => ( isa => 'Int', default => 0 );
has rating => ( required => 0, isa => 'Num' );
has [qw(pass fail na unknown)] => ( isa => 'Int', default => 0 );
__PACKAGE__->meta->make_immutable;
All attributes are required and ro by default. The name is used as ID and so
on. MetaCPAN::Script::Mapping will then fetch all classes and generate the
proper mapping for it. Based on the type constraint they get the correct
elasticsearch type and indexing and other stuff like boost can be controlled
by other attributes.
You can index a document by calling
MetaCPAN::Document::Distribution->new( ... )->index($es);
That's the first major change. Right now there are Author, Dependency,
Distribution, File, Module and Release classes. This means we can store
all versions of a distribution and access them as well.
I probably release ElasticSearch::Document to CPAN as separate distribution
since it looks useful to other projects, too.
== bin/metacpan ==
Instead of having numerous different files which look different
and behave different, I tried to streamline the scripts in /elasticsearch.
bin/metadbic will try to load the first argument as class from the
MetaCPAN::Script:: namespace.
# bin/metacpan author
This will instantiate MetaCPAN::Script::Author and call run() on it.
This is basically the former elasticsearch/index_author.pl script.
Some more commands:
Import tarball or tarballs included in folder
# bin/metacpan release /PATH/TO/CPAN/FOLDER/OR/TARBALL
Recreate (delete and create) the cpan index (it's a node actually?!)
# bin/metacpan index --recreate
Put mappings to server
# bin/metacpan mapping
== MetaCPAN::Script::Relase ==
This module does the heavy work of iindexing a whole release. It has some
code from MetaCPAN::Dist and some code of a metacpan-like project
(https://github.com/monken/p5-pad). The results seem to be consistent
with MetaCPAN::Dist but there are a few dists that cannot be parsed
and throw some kind of error.
== bin/metacpan server ==
This is worth an extra chapter. I integrated source/app.psgi and
a proxy for elasticsearch in one twiggy server. The command
above will fire up the server and it is accessible via
http://localhost:5000/.
The following endpoints exist:
/module/Path::Class
/author/KWILLIAMS
/distribution/Path-Class
/file/KWILLIAMS/Path-Class-0.23/Changes
/pod/Path::Class::File
/source/KWILLIAMS/Path-Class-0.23/Changes
Please look for yourself. The meta data has been expanded as well.
E.g. the file index contains file size, sloc, mode etc.
The logic for those endpoints is located in the MetaCPAN::Plack::
namespace.
Whenever there is a /_search request on one of the endpoints,
the request is passed directly to elasticsearch. In any other case
the endpoint tries to DWIW.
Due to a (confirmed) bug in Twiggy, GET requests with a body cause
elasticsearch to stall. Just use POST instead if you are passing a
body.
# curl -XPOST http://localhost:5000/module/_search \
-d '{"query":{"wildcard":{"name":"Path*"}}}'
The implementation of the endpoints is completely non-blocking.
== Dist::Zilla ==
I also added a dist.ini file to the distribution. This will make
installing the deps much easier. Run this to get all prereqs installed:
# dist listdeps | cpanm
(requires Dist::Zilla::PluginBundle::JQUELIN)
Instead of start_fresh.sh run
# bin/metacpan index --create # or --recreate if cpan already exists
# bin/metacpan mapping
# bin/metacpan author
# bin/metacpan release t/var/cpan/authors/id/K/KW/KWILLIAMS/Path-Class-0.23.tar.gz
# bin/metacpan server
Twiggy: Accepting connections at http://0.0.0.0:5000/ # YAY!
This will give you some data to play with.
An instance of this setup is running at http://api.netcubed.de:5000/,
so please have a go at that server and test it.
Now, I know my rewrite imposes a lot of changes to the api and some fields
have a different name and so on. But I think we can make my code even more
consistent with the current code if there is a need for it (search.cpanmeta?)
Please consider my rewrite as a replacement for the current code.
Cheers,
Moritz
(moonk on irc.freenode.net, mo on irc.perl.org)
> Hi there,
>
> I was chatting with oalders on IRC the other day and I decided to join the group and help with the coding. I got very enthusiastic and instead oflearning for my exams I rewrote large parts of the code.
> I had a good look at the wiki articles and tried to implement as much as possible.
Hi Moritz,
I haven't had a chance to have a close look at your code as I'm wrapped up in baby-related things right now. Having said that, I have made you an owner of the CPAN-API organization, so you'll be able to merge your own changes into the master branch when you see fit. Also, what I have seen looks quite good.
Having said that, before we merge anything I think we need to make sure that none of your changes breaks search.metacpan.org You should be able to test this relatively easily on your end by cloning the search site and running it in your browser:
https://github.com/CPAN-API/search-metacpan-org
Check out the nginx config files in /conf to see what sort of proxying you'd need to set up on your end in order to point the search site at the correct API etc. If that all looks good, we just need Mark (ioncache) to sign off on it as he may be aware of some issues which the rest of us are not aware of regarding the search site.
Also, I think it would merit a look to see if there is any difference in the speed of queries with your more expansive index as opposed to our /cpan index which currently contains only the latest module/dist etc
Please let me know what you think. Also, what are the differences in disk space required etc between our index and the one you've created? Have you been able to index all of CPAN? How long does it take? I assume there is going to be some lack of coverage as it's very hard to parse all of the packages, but if you could give us some idea, that would be great.
Best,
Olaf
Hi Olaf,
I was talking to Clinton on the elasticsearch irc channel and asked for some advice regarding our layout. He said that we will be fine to stick everything in one index. CPAN has currently around 60,000 releases of around 20,000 distributions. He said this is considered small, elasticsearch will handle that just fine and we shouldn't worry about performance so much.
I haven't indexed cpan yet. I'm still trying to fix all those nasty releases with crappy version numbers (Math-Expr-LATEST.tar.gz, cmmtalk-ye2000.tar.gz or NIS-a2.tar.gz). However, running the indexer for about 2 hours generated 2.1 GB worth of data. This includes 1,700 distributions, 44,301 modules, 218,936 files and 22,617 dependencies. From that I would estimate a full run to take roughly one day and to generate 25 GB of data. One possibility to reduce this, is to store the htmlified pod to a file on the disk. Right now there is a pure text pod and the html version stored in elasticsearch. But one is sufficient for full-text searching. The html-version can be generated on demand and cached, like we do it for the source code.
I'd like to stress that we need to run this only once since new uploads to cpan will be handled in near-realtime by a different process. Besides few packages with version numbers that break Module::Metadata there is complete coverage.
I will be off until saturday so don't expect any updates before next weeks.
I wish you all the best for the baby and hope you still get some sleep :-)
Cheers,
Moritz
> I was talking to Clinton on the elasticsearch irc channel and asked for some advice regarding our layout. He said that we will be fine to stick everything in one index. CPAN has currently around 60,000 releases of around 20,000 distributions. He said this is considered small, elasticsearch will handle that just fine and we shouldn't worry about performance so much.
Excellent.
>
> I haven't indexed cpan yet. I'm still trying to fix all those nasty releases with crappy version numbers (Math-Expr-LATEST.tar.gz, cmmtalk-ye2000.tar.gz or NIS-a2.tar.gz). However, running the indexer for about 2 hours generated 2.1 GB worth of data. This includes 1,700 distributions, 44,301 modules, 218,936 files and 22,617 dependencies. From that I would estimate a full run to take roughly one day and to generate 25 GB of data.
OK. That's helpful to know. We'll likely have to move up to a bigger RackSpace package to accommodate this. :) We've also thought about moving this to actual hardware, so that's an option as well.
> One possibility to reduce this, is to store the htmlified pod to a file on the disk. Right now there is a pure text pod and the html version stored in elasticsearch. But one is sufficient for full-text searching. The html-version can be generated on demand and cached, like we do it for the source code.
I think that's also a good idea. Mark would need to change how he fetches the HTML for search.metacpan.org, but that shouldn't be a big deal.
>
> I'd like to stress that we need to run this only once since new uploads to cpan will be handled in near-realtime by a different process. Besides few packages with version numbers that break Module::Metadata there is complete coverage.
However, if we make a change which will affect all distros (like adding some new type of metadata), we'll likely need to reindex everything (or a lot of data), so that could slow things down. Right now the cloud instance is low on RAM, so if I have to reindex from scratch, I run it off proper hardware and then copy the new index over. We'd have to rethink that, which is fine, since it was never supposed to be a permanent solution.
>
> I will be off until saturday so don't expect any updates before next weeks.
Enjoy your break!
> I wish you all the best for the baby and hope you still get some sleep :-)
Thanks! So far, I'm surviving. :)
Best,
Olaf