Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Data about modules
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  6 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Zbigniew Lukasiak  
View profile  
 More options Sep 19 2009, 5:25 pm
From: Zbigniew Lukasiak <zzb...@gmail.com>
Date: Sat, 19 Sep 2009 23:25:05 +0200
Local: Sat, Sep 19 2009 5:25 pm
Subject: Data about modules

Hi,

I've forked the CpanHQ project at github: http://github.com/zby/cpanhq
.  I am working on a search.  I decided to use the database native
search features and try out what that can offer and I start with
SQLite.
I have some prototype working (but not yet pushed to github) - this is
very simple, input fields for full text search on package name and
abstract, author cpanid and date (for searching only modules that have
a relase after that date).

Now I am pondering simultaneously on several things - unfortunately
it's all kind of connected:

1. Database schema - I've done some exploratory benchmarking - I am
attaching the test code and results.  I have to admit I have never
really optimized databases beside the obvious indexing stuff - so
perhaps I am a bit naive here.  I think I'll need to denormalize it a
bit if I want to let the user try out different ordering and jumping
to far away pages of the resultset - it seems that joining together
with ordering and high offset is expensive.  On the other hand when
the joined table helps to narrow the query enough (i.e. no high
offsets) - then the searches are always fast - so for example full
text search works - in SQLite it requires joining another table but it
also narrows the search enough, but searching by ratings (assuming a
uniform distribution of them) of distributions (i.e. a joined table)
can be too slow - so perhaps we need to have the ratings in the
primary table (i.e. denormalize the schema).

2. What are the most useful searches?

And the most important:

3. Where to take the data from?  Currently I just use what is
available from the packages themselves, plus 00whois.xml (author data)
and 02packages.details.txt.gz .  But sure one of the most important
search terms would be the ratings.

--
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/

  benchmark3
3K Download

  a.pl
8K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shawn H Corey  
View profile  
 More options Sep 19 2009, 6:11 pm
From: Shawn H Corey <shawnhco...@gmail.com>
Date: Sat, 19 Sep 2009 18:11:55 -0400
Local: Sat, Sep 19 2009 6:11 pm
Subject: Re: [rethinking-cpan] Data about modules

Zbigniew Lukasiak wrote:
> 2. What are the most useful searches?

> And the most important:

For both:  what the modules do.

When someone searches CPAN, they are looking for modules to help them
with a task.  They would choose search terms that are relevant to task
they want to accomplish.

--
Just my 0.00000002 million dollars worth,
   Shawn

Programming is as much about organization and communication
as it is about coding.

I like Perl; it's the only language where you can bless your
thingy.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zbigniew Lukasiak  
View profile  
 More options Sep 20 2009, 6:17 am
From: Zbigniew Lukasiak <zzb...@gmail.com>
Date: Sun, 20 Sep 2009 12:17:17 +0200
Local: Sun, Sep 20 2009 6:17 am
Subject: Re: [rethinking-cpan] Re: Data about modules
On Sun, Sep 20, 2009 at 12:11 AM, Shawn H Corey <shawnhco...@gmail.com> wrote:

> Zbigniew Lukasiak wrote:
>> 2. What are the most useful searches?

>> And the most important:

> For both:  what the modules do.

> When someone searches CPAN, they are looking for modules to help them
> with a task.  They would choose search terms that are relevant to task
> they want to accomplish.

OK - agreed, now what are the consequences?
One thing is that perhaps it does not really make sense to search by
rating - because when searching you first need to identify modules
that do what you need, only after you found a list of them you need to
compare their quality.  Fit for purpose is first - quality is second.

Second maybe we need some categories.  I remember I used to use the
categories at the front page at http://search.cpan.org, but for some
time I just use the search box for finding stuff, what are your
experiences?  Another idea is maybe to use the recommended modules
pages at the p5p wiki
http://www.perlfoundation.org/perl5/index.cgi?recommended_cpan_modules.
 If we use both - how align them?  As separate facets or import it
into one tag space?  How to import that data - how identify cases like
Class::DBI - which is mentioned at
http://www.perlfoundation.org/perl5/index.cgi?recommended_database_mo...
only to say that it is replaced by DBIx::Class - how can I detect this
automatically if I load this data by web scraping?

--
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shawn H Corey  
View profile  
 More options Sep 20 2009, 9:59 am
From: Shawn H Corey <shawnhco...@gmail.com>
Date: Sun, 20 Sep 2009 09:59:54 -0400
Local: Sun, Sep 20 2009 9:59 am
Subject: Re: [rethinking-cpan] Re: Data about modules

Zbigniew Lukasiak wrote:
> On Sun, Sep 20, 2009 at 12:11 AM, Shawn H Corey <shawnhco...@gmail.com> wrote:
>> Zbigniew Lukasiak wrote:
>>> 2. What are the most useful searches?

>>> And the most important:
>> For both:  what the modules do.

>> When someone searches CPAN, they are looking for modules to help them
>> with a task.  They would choose search terms that are relevant to task
>> they want to accomplish.

> OK - agreed, now what are the consequences?

The first consequence is that you can't rely on the module's name to
provide all the search terms.  The question then becomes do you classify
the module by a set of pre-determined categories or do you allow the
author to enter his/her own tags (or both)?

The problem with categories is they are not very flexible.  Can a module
be placed into more than one category?  Can modules be placed into any
category in the hierarchy or just the leaf nodes?

The problem with tags is their variations.  One author may use the
acronym, another the full name, another the full name separated by
hyphens, another the name with slashes, etc.  If you use pre-determined
tags, you are just using non-hierarchical categories.

> One thing is that perhaps it does not really make sense to search by
> rating - because when searching you first need to identify modules
> that do what you need, only after you found a list of them you need to
> compare their quality.  Fit for purpose is first - quality is second.

Searching by rating may not make sense but sorting the results by rating
does.

One of the problems with ratings is when do they go out-of-date?
Everything the author updates the module?  Or only when there is a
significant amount of change?

--
Just my 0.00000002 million dollars worth,
   Shawn

Programming is as much about organization and communication
as it is about coding.

I like Perl; it's the only language where you can bless your
thingy.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andy Lester  
View profile  
 More options Sep 20 2009, 11:44 am
From: Andy Lester <a...@petdance.com>
Date: Sun, 20 Sep 2009 10:44:52 -0500
Local: Sun, Sep 20 2009 11:44 am
Subject: Re: [rethinking-cpan] Re: Data about modules

On Sep 20, 2009, at 8:59 AM, Shawn H Corey wrote:

> The problem with categories is they are not very flexible.  Can a  
> module
> be placed into more than one category?  Can modules be placed into any
> category in the hierarchy or just the leaf nodes?

Especially when it's something like Catalyst::DBIx::TagCloud::Spanish.

--
Andy Lester => a...@petdance.com => www.theworkinggeek.com => AIM:petdance


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zbigniew Lukasiak  
View profile  
 More options Sep 20 2009, 4:36 pm
From: Zbigniew Lukasiak <zzb...@gmail.com>
Date: Sun, 20 Sep 2009 22:36:31 +0200
Local: Sun, Sep 20 2009 4:36 pm
Subject: Re: [rethinking-cpan] Re: Data about modules

On Sun, Sep 20, 2009 at 5:44 PM, Andy Lester <a...@petdance.com> wrote:

> On Sep 20, 2009, at 8:59 AM, Shawn H Corey wrote:

>> The problem with categories is they are not very flexible.  Can a
>> module
>> be placed into more than one category?  Can modules be placed into any
>> category in the hierarchy or just the leaf nodes?

> Especially when it's something like Catalyst::DBIx::TagCloud::Spanish.

:)

To bootstrap the whole thing we need to use the existing data - that
is 03modlist.data.gz and the categories as they are at
http://search.cpan.org/.  But sure tags seem to be the most flexible
way for marking sets and then retrieving them.  So we can import these
categories as the initial tags - and then let people assign more
categories to the same modules.  I think that to solve the problem
with too_many-Tag VERSIONS we'll need to use a predefined list of tags
(maybe later also a mechanism for defining a new tag and voting for
it).

Another question is authorship - for registered modules there is the
author that registered the name space, but for those that never got
registered we only have the cpanid of the person uploading the
release.  Which one to use?  The name-space author seems more
important (and also more stable as it does not change with releases,
at least not commonly) - but it is available for only a minority of
modules.

--
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »