search.api2.metacpan.org

Moritz Onken

unread,

Mar 5, 2011, 11:26:52 AM3/5/11

to cpan...@googlegroups.com

Hi,

I was finally able to bring the new api to life on the server and imported OALDERS cpan folder in elasticsearch. search-metacpan-org has been adapted to support the new api and is available at http://search.api2.metacpan.org/.

The application server can be reached at http://metacpan.org:5001/ and the es instance resides at http://metacpan.org:9201/ (different cluster name so it doesn't interfere with the other one).

Before we go on and move to the new api I'd say that we should have a thorough look at the elasticsearch mappings. Changes to the mapping won't be possible after indexing cpan. The mapping can be found at http://metacpan.org:9201/_mapping?pretty=1 (old mapping is at http://metacpan.org:9200/_mapping?pretty=1). I hope to get clintongormley to have a look at it too, since he is the elasticsearch expert.

Cheers,
mo

Mark Jubenville

unread,

Mar 5, 2011, 7:01:14 PM3/5/11

to cpan...@googlegroups.com

Thanks for all your work on this,

I'm just curious with your new index how one can do a search to get only the
latest version of each dist/module?

Essentially with the search site, I don't ever want to get anything but the
most recent version of a module or dist when doing a search except when I'm
on the Distribution info page, at which point I'd like to be able to get a
list of all versions.

For instance:

http://search.api2.metacpan.org/#/author/OALDERS

In Olaf's distribution list, it now shows all versions of each Distribution
when it should only show the most recent. Also the name of the Distribution
shouldn't really have the version number in it like that either I don't
think.

Basically the most common use case for search.cpan.org is to look up the
latest version of a module, so I'd like to continue in that vein if
possible.

Also, one last thing, is your new index fully indexed at this point?

When I search:

http://search.api2.metacpan.org/#/search/module/dbix::class

I only get a single module of Olaf's. Also I'm not sure the search is
working since the module that is shown is:

DBIx::MySQL::Replication::Slave

Which doesn't have DBIx::Class in the name at all.

Actually it appears that the index currently is only showing modules/dists
for OALDERS. So maybe you are using his modules just as a sample set.

Anyhow it's a nice piece of work getting all the versions of each release
into the index.

Mark Jubenville
ionc...@gmail.com

Olaf Alders

unread,

Mar 6, 2011, 12:45:27 AM3/6/11

to cpan...@googlegroups.com

Hi Moritz,

On 2011-03-05, at 11:26 AM, Moritz Onken wrote:

> Hi,
>
> I was finally able to bring the new api to life on the server and imported OALDERS cpan folder in elasticsearch. search-metacpan-org has been adapted to support the new api and is available at http://search.api2.metacpan.org/.

Thanks for doing this!

>
> The application server can be reached at http://metacpan.org:5001/ and the es instance resides at http://metacpan.org:9201/ (different cluster name so it doesn't interfere with the other one).

Did you define a different cluster name or did this get set by default?

>
> Before we go on and move to the new api I'd say that we should have a thorough look at the elasticsearch mappings. Changes to the mapping won't be possible after indexing cpan.

I suppose this is because the update is a hassle and not because things are etched in stone? I don't mind additive changes down the line, but we really should firm things up now as you've suggested.

> The mapping can be found at http://metacpan.org:9201/_mapping?pretty=1 (old mapping is at http://metacpan.org:9200/_mapping?pretty=1).

I didn't realize you could actually fetch the mappings this way, but it makes sense now that you've pointed it out. :) We should really look this over in detail. I see that you've got the CPAN mirrors included now, which is great. I also see that you've mapped some fields with "include_in_all" : false. That's new to me, but again, a very good idea.

We'll need to firm up the author mappings for sure. I wanted people to be creative with that, and I believe they have done that. I haven't gotten an author pull request in some time, so this is a good time to freeze the mapping. I think we should add something like "coding_perl_since" => 'YYYY'. Down the line I see a Perl jobs site which allows you to find coders based on complex searches via the API. Knowing how many years of experience a developer has would be a key part of that.

I do see that the dynamic mapping has screwed up a few times and added PM names as field names.

Also, I think we should look at how to approach the META.yml (and META.json) files. There's a lot of info in those files which would be great to have in ES. I'm thinking specifically of the resource URLs, repository type etc. Down the line we'll want to know how many open bugs distribution X has in it's Github repo, so having that info handy would be good. Since the META files are so flexible, it's hard to find one mapping to rule them all. So, we may want to choose the fields we care about and add those to the mapping. I'm no expert on mappings, so there may be a way of telling ES to include all fields but only to care about X fields when it comes to mapping. At any rate, being able to search on a lot of the fields found in the META files will make this service that much more helpful.

> I hope to get clintongormley to have a look at it too, since he is the elasticsearch expert.

Excellent idea.

Best,

Olaf

Moritz Onken

unread,

Mar 6, 2011, 6:07:15 AM3/6/11

to cpan...@googlegroups.com

Hi Mark,

Am 06.03.2011 um 01:01 schrieb Mark Jubenville:

> Thanks for all your work on this,
>
> I'm just curious with your new index how one can do a search to get only the
> latest version of each dist/module?

the status property of each module/release/file can be either "latest" or "cpan".
If you add a term search on this field you can always get the latest version
of a release/file/module.

I compiled a list of searches against the new api at
https://github.com/monken/p5-pad/tree/master/jslib/06-lib/Pad/API. You might
find this useful.

>
> Essentially with the search site, I don't ever want to get anything but the
> most recent version of a module or dist when doing a search except when I'm
> on the Distribution info page, at which point I'd like to be able to get a
> list of all versions.
>
> For instance:
>
> http://search.api2.metacpan.org/#/author/OALDERS
>
> In Olaf's distribution list, it now shows all versions of each Distribution
> when it should only show the most recent. Also the name of the Distribution
> shouldn't really have the version number in it like that either I don't
> think.
>
> Basically the most common use case for search.cpan.org is to look up the
> latest version of a module, so I'd like to continue in that vein if
> possible.
>
> Also, one last thing, is your new index fully indexed at this point?
>

No, I only added OALDERS folder to have some sample data. Actually I never
did a full index scan just yet.

> When I search:
>
> http://search.api2.metacpan.org/#/search/module/dbix::class
>
> I only get a single module of Olaf's. Also I'm not sure the search is
> working since the module that is shown is:
>
> DBIx::MySQL::Replication::Slave

This definitely needs some fine tuning. While I think that ElasticSearch
is doing kind of the right thing. It breaks down DBIx::Class in the terms
dbix and class and searches the (analyzed) name field for them. The
score is quite low so my guess is that anything that contains both terms
will be rated much higher. If you want an exact match, you can use
name.raw instead. This field is not analyzed and allows for wildcard
searches like DBIx::Class*

>
> Which doesn't have DBIx::Class in the name at all.
>
> Actually it appears that the index currently is only showing modules/dists
> for OALDERS. So maybe you are using his modules just as a sample set.
>
> Anyhow it's a nice piece of work getting all the versions of each release
> into the index.

Thanks! Catch me on irc if you have time. We can do some _search'ing
together :-)

Moritz Onken

unread,

Mar 6, 2011, 6:19:22 AM3/6/11

to cpan...@googlegroups.com

Am 06.03.2011 um 06:45 schrieb Olaf Alders:

> Hi Moritz,
>
> On 2011-03-05, at 11:26 AM, Moritz Onken wrote:
>
>> Hi,
>>
>> I was finally able to bring the new api to life on the server and imported OALDERS cpan folder in elasticsearch. search-metacpan-org has been adapted to support the new api and is available at http://search.api2.metacpan.org/.
>
> Thanks for doing this!
>
>>
>> The application server can be reached at http://metacpan.org:5001/ and the es instance resides at http://metacpan.org:9201/ (different cluster name so it doesn't interfere with the other one).
>
> Did you define a different cluster name or did this get set by default?
>

When I first fired up the instance it tried to join the other cluster. A quick glance at the docs showed that you can simply run bin/elastichsearch -Des.cluster.name metacpan -f and it becomes a completely separate instance.

>>
>> Before we go on and move to the new api I'd say that we should have a thorough look at the elasticsearch mappings. Changes to the mapping won't be possible after indexing cpan.
>
> I suppose this is because the update is a hassle and not because things are etched in stone? I don't mind additive changes down the line, but we really should firm things up now as you've suggested.
>

There is really a lot to the analyzers and unfortunately you cannot change an analyzer without reindexing. One example is that if you want to sort on the distribution name, you need an analyzer that has no tokenizer (i.e. "keyword") because you cannot sort on fields that have more than one term. Also you need a filter that lowercases the distribution to have a case insensitive search. If you also want some kind of full text search on the distribution field (e.g. search for class and find DBIx-Class) you need a multi_field. Right now only distribution has this analyzer applied, but we should definitely look into that and make sure that we have the right analyzer set on the right fields. We should probably meet on IRC some time and discuss this.

>> The mapping can be found at http://metacpan.org:9201/_mapping?pretty=1 (old mapping is at http://metacpan.org:9200/_mapping?pretty=1).
>
> I didn't realize you could actually fetch the mappings this way, but it makes sense now that you've pointed it out. :) We should really look this over in detail. I see that you've got the CPAN mirrors included now, which is great. I also see that you've mapped some fields with "include_in_all" : false. That's new to me, but again, a very good idea.
>

include_in_all was actually added by elasticsearch by default, so credits go to them :)

> We'll need to firm up the author mappings for sure. I wanted people to be creative with that, and I believe they have done that. I haven't gotten an author pull request in some time, so this is a good time to freeze the mapping. I think we should add something like "coding_perl_since" => 'YYYY'. Down the line I see a Perl jobs site which allows you to find coders based on complex searches via the API. Knowing how many years of experience a developer has would be a key part of that.
>

I added a new sample authors file in my repo: https://github.com/monken/cpan-api/commit/75b35ede3e210f0b1663cecad9960ffd462efa24 and added also some explanations. Maybe we can start from there.

> I do see that the dynamic mapping has screwed up a few times and added PM names as field names.

I saw that too. ES allows to set an index as strict, which will cause the indexer to fail if it adds a field with no mapping. We should consider this.

>
> Also, I think we should look at how to approach the META.yml (and META.json) files. There's a lot of info in those files which would be great to have in ES. I'm thinking specifically of the resource URLs, repository type etc. Down the line we'll want to know how many open bugs distribution X has in it's Github repo, so having that info handy would be good. Since the META files are so flexible, it's hard to find one mapping to rule them all. So, we may want to choose the fields we care about and add those to the mapping. I'm no expert on mappings, so there may be a way of telling ES to include all fields but only to care about X fields when it comes to mapping. At any rate, being able to search on a lot of the fields found in the META files will make this service that much more helpful.

If you look at http://search.api2.metacpan.org:5001/release/_search?q=* you can see that some releases have a "resources" field. That is extracted from META and has no mapping yet. I'm using CPAN::Meta to extract the metadata.

>
>> I hope to get clintongormley to have a look at it too, since he is the elasticsearch expert.
>
> Excellent idea.
>
> Best,
>
> Olaf

Cheers,
mo

Olaf Alders

unread,

Mar 13, 2011, 11:19:46 PM3/13/11

to cpan...@googlegroups.com

On 2011-03-06, at 6:07 AM, Moritz Onken wrote:

> Hi Mark,
>
> Am 06.03.2011 um 01:01 schrieb Mark Jubenville:
>
>> Thanks for all your work on this,
>>
>> I'm just curious with your new index how one can do a search to get only the
>> latest version of each dist/module?
>
> the status property of each module/release/file can be either "latest" or "cpan".
> If you add a term search on this field you can always get the latest version
> of a release/file/module.

Hi Moritz,

Do you have the middleware in place which was going to mimic the way the API currently operates? It makes total sense to me that "latest" would give you just the newest data if you're going to port 9200 directly. What we had initially intended with the proxied REST API was to keep the URLs very simple. So, if we could get the same sort of results with the new index (ie just the latest) via the "convenience" URLs which we had set up via proxy, I think that would be very helpful. It's not a dealbreaker, but I really like having a sane set of defaults applied to the URLs which beginners would use to get started.

Best,

Olaf

Olaf Alders

unread,

Mar 13, 2011, 11:37:38 PM3/13/11

to cpan...@googlegroups.com

Hi Moritz,

On 2011-03-06, at 6:19 AM, Moritz Onken wrote:

>>> Before we go on and move to the new api I'd say that we should have a thorough look at the elasticsearch mappings. Changes to the mapping won't be possible after indexing cpan.
>>
>> I suppose this is because the update is a hassle and not because things are etched in stone? I don't mind additive changes down the line, but we really should firm things up now as you've suggested.
>>
>
> There is really a lot to the analyzers and unfortunately you cannot change an analyzer without reindexing. One example is that if you want to sort on the distribution name, you need an analyzer that has no tokenizer (i.e. "keyword") because you cannot sort on fields that have more than one term. Also you need a filter that lowercases the distribution to have a case insensitive search. If you also want some kind of full text search on the distribution field (e.g. search for class and find DBIx-Class) you need a multi_field. Right now only distribution has this analyzer applied, but we should definitely look into that and make sure that we have the right analyzer set on the right fields. We should probably meet on IRC some time and discuss this.

I'll likely defer to you on this mostly as you've got a better handle on the ES docs than I have at this point. :) I do know, however, that if we want to change things and need to re-index, we can do this more easily if we have a cluster set up. We could update one of the nodes on the cluster and then restart the others. The other nodes should then, upon reconnection, sync to the re-indexed machine and propagate the changes. We don't do this now as we only have the one node in our cluster.

>
>>> The mapping can be found at http://metacpan.org:9201/_mapping?pretty=1 (old mapping is at http://metacpan.org:9200/_mapping?pretty=1).
>>
>> I didn't realize you could actually fetch the mappings this way, but it makes sense now that you've pointed it out. :) We should really look this over in detail. I see that you've got the CPAN mirrors included now, which is great. I also see that you've mapped some fields with "include_in_all" : false. That's new to me, but again, a very good idea.
>>
>
> include_in_all was actually added by elasticsearch by default, so credits go to them :)

Nice! I'll likely add any comments on the mappings to the commits themselves. That may be an easier way to track discussions.

>
>> We'll need to firm up the author mappings for sure. I wanted people to be creative with that, and I believe they have done that. I haven't gotten an author pull request in some time, so this is a good time to freeze the mapping. I think we should add something like "coding_perl_since" => 'YYYY'. Down the line I see a Perl jobs site which allows you to find coders based on complex searches via the API. Knowing how many years of experience a developer has would be a key part of that.
>>
>
> I added a new sample authors file in my repo: https://github.com/monken/cpan-api/commit/75b35ede3e210f0b1663cecad9960ffd462efa24 and added also some explanations. Maybe we can start from there.

Indeed. I commented on this commit earlier this evening.

>
>
>> I do see that the dynamic mapping has screwed up a few times and added PM names as field names.
>
> I saw that too. ES allows to set an index as strict, which will cause the indexer to fail if it adds a field with no mapping. We should consider this.

Sounds like a good idea. Let's consider the free-for-all author mapping to have been concluded now. Any new fields either don't get included or go to the unmapped portion of the document.

>
>>
>> Also, I think we should look at how to approach the META.yml (and META.json) files. There's a lot of info in those files which would be great to have in ES. I'm thinking specifically of the resource URLs, repository type etc. Down the line we'll want to know how many open bugs distribution X has in it's Github repo, so having that info handy would be good. Since the META files are so flexible, it's hard to find one mapping to rule them all. So, we may want to choose the fields we care about and add those to the mapping. I'm no expert on mappings, so there may be a way of telling ES to include all fields but only to care about X fields when it comes to mapping. At any rate, being able to search on a lot of the fields found in the META files will make this service that much more helpful.
>
> If you look at http://search.api2.metacpan.org:5001/release/_search?q=* you can see that some releases have a "resources" field. That is extracted from META and has no mapping yet. I'm using CPAN::Meta to extract the metadata.

I'm getting a gateway error there right now. I did chat with David Golden on IRC about meta parsing. This may be helpful info for you -- it was for me: https://gist.github.com/868704

Best,

Olaf

Moritz Onken

unread,

Mar 14, 2011, 5:30:49 AM3/14/11

to cpan...@googlegroups.com

>>
>
> I'm getting a gateway error there right now. I did chat with David Golden on IRC about meta parsing. This may be helpful info for you -- it was for me: https://gist.github.com/868704
>
> Best,
>
> Olaf
>

Actually the main elasticsearch server is down, too. I was looking into it, but I have no idea which version to start.

Cheers,
moritz

Moritz Onken

unread,

Mar 14, 2011, 5:31:58 AM3/14/11

to cpan...@googlegroups.com

>
>
> Do you have the middleware in place which was going to mimic the way the API currently operates? It makes total sense to me that "latest" would give you just the newest data if you're going to port 9200 directly. What we had initially intended with the proxied REST API was to keep the URLs very simple. So, if we could get the same sort of results with the new index (ie just the latest) via the "convenience" URLs which we had set up via proxy, I think that would be very helpful. It's not a dealbreaker, but I really like having a sane set of defaults applied to the URLs which beginners would use to get started.

Yes, that is possible. Have a look at http://api.netcubed.de/module/DBIx::Class and http://api.netcubed.de/pod/DBIx::Class . They DWYM. The backend looks for the latest version of the Module and returns it.

Moritz

Olaf Alders

unread,

Mar 14, 2011, 2:59:32 PM3/14/11

to cpan...@googlegroups.com

It was running inside a screen session under my login. :) I restarted it last night. The process had been killed. I think we need to bump up to a virtual server with more RAM. I'll likely do that this evening.

Olaf

Olaf Alders

unread,

Mar 15, 2011, 10:40:42 AM3/15/11

to cpan...@googlegroups.com

Hi Folks,

I just upgraded the server instance to the 1 GB of RAM offering. Still probably nearly not enough, but it appears to be more responsive right now and I think it will have enough disk space for the work which Moritz is doing. The server required a reboot as part of the resize, so some services need a kick start. Moritz, I think you may need to fire up your ES instance? Also, cpanvote doesn't appear to be working for me, so I think Yanick may need to look at that.