static site indexer filter

tomcloyd

unread,

Apr 22, 2008, 7:16:09 PM4/22/08

to Webby

In lieu of a dynamic search facility, some sites work better (I think)
with a site index, although these are not so often seen. It's an older
metaphor, but still a very familiar one, so users ought not to have
difficulty with it. It's also conceptually easier to set up, and lower
cost to run.

I don't know of any existing routine which could be used as a final
filter in Webby, to set up this index - is there one? I'm amusing
myself thinking about the fun of writing one. As a beginning Ruby
programmer (I have more experience with several other languages), it
looks like a fairly easy, and useful project.

Any thoughts, anyone?

Tim Pease

unread,

Apr 24, 2008, 10:51:19 PM4/24/08

to webby...@googlegroups.com

There was this little challenge that I threw a week or two ago.

<http://groups.google.com/group/webby-forum/browse_thread/thread/b34699b71bb3e079
>

Don't know if a sitemap is the same concept of your site index, but
they sound very similar. If you feel like coding this up, I'm sure
others would find it useful, too.

Blessings,
TwP

Bruce Williams

unread,

Apr 25, 2008, 12:10:04 AM4/25/08

to webby...@googlegroups.com

Incidently, I had to do this just today for some documentation at work:
http://pastie.caboo.se/186595

Cheers,
Bruce

---
Bruce Williams
http://codefluency.com
twitter: wbruce

tomcloyd

unread,

Apr 27, 2008, 7:49:40 AM4/27/08

to Webby

On Apr 24, 9:10 pm, "Bruce Williams" <br...@codefluency.com> wrote:

> On Thu, Apr 24, 2008 at 9:51 PM, Tim Pease <tim.pe...@gmail.com> wrote:
>
> > On Apr 22, 2008, at 5:16 PM, tomcloyd wrote:
>
> > > In lieu of a dynamic search facility, some sites work better (I think)
> > > with a site index, although these are not so often seen. It's an older
> > > metaphor, but still a very familiar one, so users ought not to have
> > > difficulty with it. It's also conceptually easier to set up, and lower
> > > cost to run.
>
> > > I don't know of any existing routine which could be used as a final
> > > filter in Webby, to set up this index - is there one? I'm amusing
> > > myself thinking about the fun of writing one. As a beginning Ruby
> > > programmer (I have more experience with several other languages), it
> > > looks like a fairly easy, and useful project.
>
> > > Any thoughts, anyone?
>
> > There was this little challenge that I threw a week or two ago.
>

> > <http://groups.google.com/group/webby-forum/browse_thread/thread/b3469...>

>
> > Don't know if a sitemap is the same concept of your site index, but they
> > sound very similar. If you feel like coding this up, I'm sure others would
> > find it useful, too.
>
> > Blessings,
> > TwP
>
> Incidently, I had to do this just today for some documentation at work:
> http://pastie.caboo.se/186595
>
> Cheers,
> Bruce
>
> ---
> Bruce Williamshttp://codefluency.com
> twitter: wbruce

This looks interesting, and useful, BUT it's not a site index, it's a
site *map*. I will likely make use of it - thanks!

What I have in mind would work like this:

1. All pages files in a target directory would be processed, except
for those one an "ignore" list.
2. All content between a list of tags would be processed. Such a list
might look like
<h1>
<div id="maincontent">
<div id="sidebarRight">
etc...
3. Each word in the target areas on each page processed would, if not
already there become a key in the index hash. The associated value for
each key would be an array containing the relative URL and title of
each page where the word is found. Obviously, one would need to create
and increment, over time, a "stop list" of words which do NOT go into
the index (because they are trivial, irrelevant, etc.
4. The index hash is then output in HTML as an alphabetized list of
words, with associated page title links.

I would expect to run this routine, and aggressively move indexed
words to the stop list, leaving a selected list of important words to
be in the index.

The advantages of this is that it's simple, could be set up on any
site (no server database needed), and it uses a metaphor (a book
index) with which people are familiar, can easily be updated at any
time, and one has complete control over content, both via the "stop
list" and manual editing.

So...if I don't find something that does this, or someone beats me to
it, I'll probably take a stab at doing this myself. It's within my
capability, which I certainly cannot say about many things I'd like to
do with Ruby (which is one of may reason I really, really like Webby -
'cause it's so much better than anything I might have even conceived
of doing).

This sort of thing would work best on a site which is focused on
written content - the kind of sites I run and build for others. It
wouldn't be appropriate for all sites, surely.

Any thoughts or reactions, anyone?

Tom

Bruce Williams

unread,

Apr 27, 2008, 11:07:14 AM4/27/08

to webby...@googlegroups.com

Sounds like a good idea :-)

I'd use an attribute on each page you'd like to ignore to flag it (vs
maintaining a separate ignore list), Hpricot to yank out content to
process, etc. Looks like a fun little project!

tomcloyd

unread,

Apr 27, 2008, 11:59:51 AM4/27/08

to Webby

Thanks very much for the suggestions - I'm not familiar with Hpricot,
though I've heard of it. I'll check into it.

Can you explain what you mean by using an "attribute" to flag a page?
Would that be something like inserting  in the <head>
tag? That's all I can imagine you might mean. That *would* appear to
be a simpler way to stop page indexing than what I'd proposed.

Tom

Bruce Williams

unread,

Apr 27, 2008, 12:15:37 PM4/27/08

to webby...@googlegroups.com

Tom,

I'm talking about the metadata at the top of each page (in
content/**); I wouldn't process the output files in output/**
directly.

For example you could do something like the following:

---
title: Foo Bar
created_at: 2008-04-18 22:40:00 -06:00
ignore: true
filter:
- textile

and simply check for the `ignore' attribute on page objects.

Also, rather than just writing a script that processed content/**
files directly, I'd try to do it programmatically (probably in a Rake
task; Tim might have some tips here) by loading webby and using
Webby::Resources::DB#find to grab all the pages (see
http://webby.rubyforge.org/rdoc/classes/Webby/Resources/DB.html#M000056),
and checking for page.ignore -- and you could get the HTML output of
each page for processing by calling page.render and the URL by calling
page.url (see http://webby.rubyforge.org/manual/#h2_1_1).

tomcloyd

unread,

Apr 27, 2008, 9:35:38 PM4/27/08

to Webby

> Webby::Resources::DB#find to grab all the pages (seehttp://webby.rubyforge.org/rdoc/classes/Webby/Resources/DB.html#M000056),

> and checking for page.ignore -- and you could get the HTML output of
> each page for processing by calling page.render and the URL by calling

> page.url (seehttp://webby.rubyforge.org/manual/#h2_1_1).

>
> Cheers,
> Bruce
>
> ---
> Bruce Williamshttp://codefluency.com
> twitter: wbruce

Wow - that's a far more interesting approach than I had in mind. I
tend to keep things very simple - I often have no choice. Often for me
the question is not HOW to do something in ruby but can I do it at
all. I don't have much time to work on things, and have to learn WHILE
I'm trying to get some piece of work accomplished. It's a luxery to
have time to read other people's code, or to study some aspect of the
language simple to learn more about it. Just a fact of my life.

So, I'm fascinated with your suggestions, as they open up who new
paths of exploration and learning for me, and will likely result in
better results as well.

What I originally had in mind was simply a routine which would act
directly on a set of HTML files, regardless of origin. That would be
usable by all sorts of folks, should they wish.

At this point, I think I'd like to have it operate as you suggest,
from within Webby, because this will assist me in learning Webby more
quickly. Later, I can write another version which can use some of the
same code to realize my original concept.

Thanks again so much for your suggestions. I benefit greatly from
them.

Tom

Ana Nelson

unread,

Jun 18, 2008, 5:18:09 PM6/18/08

to Webby

Sort of related to this thread, I have added

index: false

to the metadata of some of my pages when I don't want them to be
indexed by google, but where I can't use a robots.txt file to exclude
the entire directory. For example, I don't want google to index the
year and month archive pages of a blog, I only want the blog posts
themselves to show up.

In my layout I have:

<% unless p['index'] || p['index'].nil? -%>


<meta name="robots" content="noindex, follow" />
<% end -%>

I use this same code in my RSS feed
@pages.find(...).each do |p|
next unless p['index'] || p['index'].nil?

to skip things which aren't blog posts.

This might be a useful convention if someone wants to write a Sitemap
(http://www.sitemaps.org/) generator before I get around to it. :-)

On Apr 28, 2:35 am, tomcloyd <t...@tomcloyd.com> wrote:
> On Apr 27, 9:15 am, "Bruce Williams" <br...@codefluency.com> wrote:
>
>
>
> > On Sun, Apr 27, 2008 at 10:59 AM, tomcloyd <t...@tomcloyd.com> wrote:
>
> > > On Apr 27, 8:07 am, "Bruce Williams" <br...@codefluency.com> wrote:
>
> > > > On Sun, Apr 27, 2008 at 6:49 AM, tomcloyd <t...@tomcloyd.com> wrote:
>
> > > > > On Apr 24, 9:10 pm, "Bruce Williams" <br...@codefluency.com> wrote:
>
> > > > > > On Thu, Apr 24, 2008 at 9:51 PM, Tim Pease <tim.pe...@gmail.com> wrote:
>
> > > > > > > On Apr 22, 2008, at 5:16 PM, tomcloyd wrote:
>
> > > > > > > > In lieu of a dynamic search facility, some sites work better (I think)
> > > > > > > > with a site index, although these are not so often seen. It's an older
> > > > > > > > metaphor, but still a very familiar one, so users ought not to have
> > > > > > > > difficulty with it. It's also conceptually easier to set up, and lower
> > > > > > > > cost to run.
>
> > > > > > > > I don't know of any existing routine which could be used as a final
> > > > > > > > filter in Webby, to set up this index - is there one? I'm amusing
> > > > > > > > myself thinking about the fun of writing one. As a beginning Ruby
> > > > > > > > programmer (I have more experience with several other languages), it
> > > > > > > > looks like a fairly easy, and useful project.
>
> > > > > > > > Any thoughts, anyone?
>
> > > > > > > There was this little challenge that I threw a week or two ago.
>
> > > > > > > <http://groups.google.com/group/webby-forum/browse_thread/thread/b3469...>
>

> > > > > > > Don't know if asitemapis the same concept of your site index, but they

Ana Nelson

unread,

Jun 18, 2008, 5:29:01 PM6/18/08

to Webby

And of course that should be @page['index'] not p['index'] in my
layout.

(This is SO ironic. I should have generated my email text using webby
and live code.)

Tim Pease

unread,

Jun 22, 2008, 9:21:38 AM6/22/08

to webby...@googlegroups.com

On Jun 18, 2008, at 3:18 PM, Ana Nelson wrote:

>
> This might be a useful convention if someone wants to write a Sitemap
> (http://www.sitemaps.org/) generator before I get around to it. :-)
>

A site indexer would be fantastic! If you're willing to share the code
when you're done, I'll gladly include it with the next release of webby.

Blessings,
TwP

Ana Nelson

unread,

Jun 23, 2008, 4:56:03 PM6/23/08

to Webby

I've had a first go at a sitemap:
http://github.com/ananelson/webby/commit/5f4a64df3f48479a5e6448e6253ef28a3b43fa9b

Results look something like this:
http://pastie.org/220691

I've tried :sort_by => 'path' but that doesn't seem to do much. It
would be nice, for us people at least, to have the site's root come
first and the rest be sorted alphabetically.

Just now, though, I see if I sort by created_at this would delete
pages where this is nil and I wouldn't have to skip them manually, so
maybe that's a better way to do it since the robots don't care about
the ordering.

Tom Cloyd

unread,

Jun 23, 2008, 5:59:18 PM6/23/08

to webby...@googlegroups.com

I'm way, way below you folks in skills, but I just have to say that I do
NOT grasp the idea of an autogenerated site map. I don't see how that in
formation is contained in the sparse matrix of hyperlinks IN a set of
pages, and it cannot reliably be obtained from directory structure,
since many of us don't use that notion for site organization.

But beyond all that, when I do a site map, I want the page groupings
listed in MY order, not alphabetical order, and I often want some kind
of brief description accompanying each page listing. I can envision how
that might be all set up with metadata, but it seems easier to me to
just keep a running outline of the conceptual organization of your site,
and expand that into a site map.

I keep wondering if I'm missing something here. Must be.

FINALLY - when I started this threat, what I was referring to was the
production of something akin to a book index, but for a website. Static
search output, if you will, but browsable. Tim had some comments about
how best to do this, and I liked them (and need to review them). Here's
a description I recently wrote to one of my website design customers
(and I expect to start this this week - ASAP) - it describes a
standalone program, but I can see this as a part of Webby, easily enough:

"One sets up, as an option and not a necessity, a set of tags (keywords,
we call them in other contexts) which are associated with a page, and
which are put IN the page, but styled to be invisible in a browser. Burt
my program can find them. The point of the tags is to call special
attention to principal content. The tag words will appear in the index
output in bold font, indicating a MAIN source of information - the first
place a user might want to browse to.

"Regardless of whether or not a given page is tagged, all other words on
the page are indexed. The results are reviewed, and meaningless words
are put on a "stop" list, which causes them NOT to appear in the index.

"The output then generated shows main entries (the tags aforementioned),
and all others, alphabetically, grouped by letter. Following each entry
is a link to the page where this entry appears.

"It's that simple. The webmaster can direct the output by use of the
tags, or not. Either way, the site user can see better with this tool
than with any other way the range of topics available, all on one page.
Browsable. Formatted as the webmaster desires."

It might be feasible to set this up as a rake task. That'd be cool,
but it's hardly my first priority, and besides I don't yet know how to
do that.

So...if someone beats me to this, cool. If not, I'll be happy to put my
code out for massaging by some more capable hands, if they so wish. I
just want the bloody functionality, yesterday.

t.

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Cloyd, MS MA, LMHC
Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< t...@tomcloyd.com >> (email)
<< TomCloyd.com >> (website & psychotherapy weblog)
<< sleightmind.wordpress.com >> (mental health issues weblog)
<< DirectPathDesign.TomCloyd.com >> (web site design & consultation)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Denis Defreyne

unread,

Jun 24, 2008, 4:54:46 AM6/24/08

to webby...@googlegroups.com

On 23 Jun 2008, at 23:59, Tom Cloyd wrote:

> I'm way, way below you folks in skills, but I just have to say that
> I do NOT grasp the idea of an autogenerated site map. I don't see
> how that in formation is contained in the sparse matrix of
> hyperlinks IN a set of pages, and it cannot reliably be obtained
> from directory structure, since many of us don't use that notion for
> site organization.
>
> But beyond all that, when I do a site map, I want the page groupings
> listed in MY order, not alphabetical order, and I often want some
> kind of brief description accompanying each page listing. I can
> envision how that might be all set up with metadata, but it seems
> easier to me to just keep a running outline of the conceptual
> organization of your site, and expand that into a site map.

Hi,

Such an XML-based sitemap is actually meant to be used by search
engines. In addition to proving a complete list of all pages on a web
site (which makes hard-to-discover pages easy to find), it also allows
you to set priorities for pages and can also give a hint about a
page's update frequency, so spiders can fine-tune their crawl rates
for a site with an XML sitemap.

My site has an auto-generated XML sitemap (meant for spiders) as well
as an (auto-generated) HTML sitemap (meant for humans), and they're
generated in quite different ways (they have different purposes after
all).

Hope this helps!

Denis

--
Denis Defreyne
denis.d...@stoneship.org

Tom Cloyd

unread,

Jun 24, 2008, 8:39:51 AM6/24/08

to webby...@googlegroups.com

Thanks a bunch. I now officially 'have a clue'. Guess I'm still in the game!

Tom

Reply all

Reply to author

Forward