Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

sucking up to google

1 view
Skip to first unread message

Kirk Is

unread,
Oct 30, 2002, 6:32:12 PM10/30/02
to
ObHack:
I noticed Google was declining to archive the archives of my blog,
http://kisrael.com . I've noticed that Google seems reluctant to archive
other things of mine where it's heuristically obvious that the content is
dynamically generated (I assume a ".cgi" and/or CGI elements like ? and &
and = might be a tipoff.) This makes a certain amount of sense, a Google
'bot wouldn't want to be caught in a script that might be able to generate
an arbitrary amount of content.

That was my working assumption, anyone have an opinion on its correctness?

So anyway, my idea was to set up my archive using index.cgi's in a
directory structure, so that it shouldn't be as obvious that it's a
script, ala http://www.kisrael.com/arch/2002/10/
this seems to be a much better idea than my first though of mirroring my
entire content as static HTML...to further reduce the redunancy, each
index.cgi is just a few lines of script that "require"s a single perl
script that does the work (symlinks can't run scripts on my webhost.)

I dunno if it works. For all I know, Google is declining to follow my
archive 'cause the content is two clicks away from my front page, or maybe
there's some secret header magic that Google pays more attention to, and
it's not the URL. So we'll find out next time Google sniffs out my site.

ObHTMLhack:
Not much of a hack, more of an easy HTML effect I just discovered:
I decided a flat list like my old archive http://kisrael.com/viewblog.cgi
was getting unweildy, so I decided http://www.kisrael.com/arch/ should
put years in different columns. I cranked up the cellspacing, intending
to remove the border once I was happy with it, but I ended up liking the
effect of leaving the border at 1. A friend describes the effect as
"Escher-esque"

--
QUOTEBLOG: http://kisrael.com SKEPTIC MORTALITY: http://kisrael.com/mortal
"I would have made a good pope" --Richard M. Nixon

Eli the Bearded

unread,
Oct 30, 2002, 8:06:54 PM10/30/02
to
In alt.hackers, Kirk Is <kirk...@alienbill.com> wrote:
> ObHack:
> I noticed Google was declining to archive the archives of my blog,
> http://kisrael.com . I've noticed that Google seems reluctant to archive
> other things of mine where it's heuristically obvious that the content is
> dynamically generated (I assume a ".cgi" and/or CGI elements like ? and &
> and = might be a tipoff.) This makes a certain amount of sense, a Google
> 'bot wouldn't want to be caught in a script that might be able to generate
> an arbitrary amount of content.
>
> That was my working assumption, anyone have an opinion on its correctness?

I have been told (it was probably two years ago) by a lead programmer
at Google that the googlebot will not index pages that look like CGIs
to avoid stupid robot tricks like accidently voting, etc. Get rid of the
? and & in the URL and I think things will work fine.

> So anyway, my idea was to set up my archive using index.cgi's in a
> directory structure, so that it shouldn't be as obvious that it's a
> script, ala http://www.kisrael.com/arch/2002/10/
> this seems to be a much better idea than my first though of mirroring my
> entire content as static HTML...to further reduce the redunancy, each
> index.cgi is just a few lines of script that "require"s a single perl
> script that does the work (symlinks can't run scripts on my webhost.)

Use a CGI script without an extension (or with, for a more obvious
situation) and then parse the $PATH_INFO variable to find out what
page to show. Eg:

httpd.conf: (tested)
ScriptAlias /archive /usr/local/httpd/cgi-bin/archive
or
.htaccess: (untested)
<Location archive>
ForceType application/x-httpd-cgi
</Location>
or maybe:
.htaccess: (untested)
<File archive>
SetHandler cgi-script
</File>

(I rarely use .htaccess and don't know its quirks well.)

Then

http://site/archive/2002/10/index.html

Will run the archive file as a CGI and it will have $PATH_INFO (in
the environment) set to "/2002/10/index.html". You can combine this
with query strings, too.

ObHack:
Sayeth http://www.kisrael.com/viewblog.cgi?date=2002.08.07:
: the gas pumps have that little doohickey so you don't have to hold it
: the whole time...

I typically just stick the gas cap in to hold the lever in place.
Crude ascii-art: the box is the enframe arount the lever, the *-line
is the lever, the %-object is the gas cap. This has the double
advantage of making it harder to forget to put the cap back. It does
not stop the auto-off feature of the pump. Only once have I
encountered a handle this didn't work on, I think that handle was
just defective since other ones at that station work.


.-~~~~~~~~~~~~~~~~~~~-.
| **** |
| **** |
| ******** |
| ****** |
| %%%%%%% |
| %%%%%%% |
`-_ %%%%%%%%%%% __-'
~~~~~~~~~~~~~~~~

Elijah
------
doesn't but gas often

Kirk Is

unread,
Oct 30, 2002, 11:40:52 PM10/30/02
to
Eli the Bearded <*@eli.users.panix.com> wrote:
> In alt.hackers, Kirk Is <kirk...@alienbill.com> wrote:
>> ObHack:
> I have been told (it was probably two years ago) by a lead programmer
> at Google that the googlebot will not index pages that look like CGIs
> to avoid stupid robot tricks like accidently voting, etc. Get rid of the
> ? and & in the URL and I think things will work fine.

Glad to hear that the algorithm might be that simple.

[snip]

> Use a CGI script without an extension (or with, for a more obvious
> situation) and then parse the $PATH_INFO variable to find out what
> page to show. Eg:

[snip]


> (I rarely use .htaccess and don't know its quirks well.)

> Then

> http://site/archive/2002/10/index.html

> Will run the archive file as a CGI and it will have $PATH_INFO (in
> the environment) set to "/2002/10/index.html". You can combine this
> with query strings, too.

That's pretty much what I'm doing already (in terms of being a cgi script
that reads an environmental variable to figure out which part of the
archive to load) I think just using index.cgi's works well enough, since
I doubt google would bother to try to throw in an index.html (since it
wouldn't know if it was index.htm or default.htm instead of index.html)

Also, with my webspace, I'm not sure how much .htaccess I have anyway.

To cap it off, the idea of a .html actually being a script is disturbing
to me aesthetically wise speaking.

obHack:
I think since this has been conversation about a hack, an ObHack is not so
Ob. So let that excuse the lameness of what I have to offer:
http://kisrael.com/viewblog.cgi?date=2002.10.21
is my most recent "small gif cinema". I timelapsed the cool view out my
office workspace window (a wharf by an ocean harbor) with a borrowed
QuickCam and converted the result to Real Media and animated Gif.
Animated GIFs are a real favorite of mine, arguably making 40x30 pixel
displays as little viewscreens is a bit hackish in itself.
( http://kisrael.com/viewblog.cgi?date=2002.10.14 has a particularly
cinematic one.)


--
QUOTEBLOG: http://kisrael.com SKEPTIC MORTALITY: http://kisrael.com/mortal

Love is two crickets hopping in the same direction --W.T.Vollmann

Kirk Is

unread,
Oct 30, 2002, 11:47:10 PM10/30/02
to
Figuring two semi-hacks might add up to a whole hack, I'll add in the
image at http://www.kisrael.com/viewblog.cgi?date=2002.10.31 , a
jack-o-lantern's eye view of my wife. I stuck my tiny canon digital camera
in a plastic baggie so it wouldn't get icky, and actually had Mo tell me
how to aim, otherwise the angle was too hard to judge. I'm pleased with
the result.

--
QUOTEBLOG: http://kisrael.com SKEPTIC MORTALITY: http://kisrael.com/mortal

"The time has come," the walrus said, / "To speak of manic things,
Of shots and shouts, and sealing dooms / Of commoners and kings."--Thurber

Eli the Bearded

unread,
Oct 31, 2002, 2:58:12 PM10/31/02
to
In alt.hackers, Kirk Is <kirk...@alienbill.com> wrote:
> That's pretty much what I'm doing already (in terms of being a cgi script
> that reads an environmental variable to figure out which part of the
> archive to load) I think just using index.cgi's works well enough, since
> I doubt google would bother to try to throw in an index.html (since it
> wouldn't know if it was index.htm or default.htm instead of index.html)

No, I doubt Google would do that. Incidentally, you can make any file
the directory default the DirectoryIndex directive. I include index.php
along with the more common index.html and index.cgi. The reasons I would
avoid lots of index.cgi files are maintenance and (on an ISP) file
count quotas.

> To cap it off, the idea of a .html actually being a script is disturbing
> to me aesthetically wise speaking.

HA! I have compiled C that uses PATH_INFO with ".html", no mere scripts.
I also use bogus (ignored by the script) PATH_INFO strings to set nice
filenames for things I expect the user to save to disk.

> Ob. So let that excuse the lameness of what I have to offer:
> http://kisrael.com/viewblog.cgi?date=2002.10.21

Hmmm. Better than I expected. I guess I see too many junk animated GIFs.

http://www.googlism.com/index.htm?ism=eli+the+bearded&type=1
"eli the bearded is not a frequent writer of erotica"

It's true, but I was hoping it could find more...

ObHack:
I get some data files mailed to me from a cron job. Often times there
are problems with the data and it gets cleaned up and sent again.
Then I process it and generate an summary. Originally I would just
save the file in the right spot by hand and then run the summary tool.
But that's not automatic enough. So I wrote a little script that the
summary tool runs just before starting. The script scans my email
attachment directory backwards (cronologically) and finds the most
recent copy of the data file, then copies it to the right place. This
is simpler than gettting procmail to extract an attachment to a
particular spot and simpler than copying the file by hand.

Elijah
------
does not equate extension with filetype

Kirk Is

unread,
Oct 31, 2002, 7:35:50 PM10/31/02
to
Eli the Bearded <*@eli.users.panix.com> wrote:
> In alt.hackers, Kirk Is <kirk...@alienbill.com> wrote:
[snip]

>> To cap it off, the idea of a .html actually being a script is disturbing
>> to me aesthetically wise speaking.

> HA! I have compiled C that uses PATH_INFO with ".html", no mere scripts.

Eww...

For the record, my webhost doesn't feed my scripts PATH_INFO, I use
REQUEST_URI instead. (And to find out what variables where available, I
iterated over the keys of %ENV )

>> Ob. So let that excuse the lameness of what I have to offer:
>> http://kisrael.com/viewblog.cgi?date=2002.10.21

> Hmmm. Better than I expected. I guess I see too many junk animated GIFs.

Well, glad to exceed your low expectations, then!

John Munch asked if I had any more of those, maybe I should make
a page that indexes them. I redirected him to my homebrew search engine,
but I had thought of making a gallery page before, and the spurred me to
go through with it:
http://kisrael.com/features/sgc.html


> http://www.googlism.com/index.htm?ism=eli+the+bearded&type=1
> "eli the bearded is not a frequent writer of erotica"
> It's true, but I was hoping it could find more...

(for people who were wondering, that's based offa today's blog entry...

> Elijah
> ------
> does not equate extension with filetype

Heh.

ObHack:
I should probably save this, 'cause I'm having trouble remembering my
hacks when I need 'em, but what the heck, since I just assembled that
other page, there's this one:
http://kisrael.com/features/gb.html

Javascript Gamebuttons (observant readers may notice a similarity to the
previous one button animations,
http://groups.google.com/groups?selm=erOG8.303%24641.21825%40news.tufts.edu&oe=UTF-8&output=gplain
)
Fairly playable games where the entire input and output is confined to a
single 'normal' javascript button. I've always been interested in games
with minimal control schemes, and these push that to near a limit...
I was thinking about sponsoring a contest to make more of these, but never
quite got a round to it.


--
QUOTEBLOG: http://kisrael.com SKEPTIC MORTALITY: http://kisrael.com/mortal

"If I knew who Godot was, I would have said so in the play" --Beckett

Ilmari Karonen

unread,
Oct 31, 2002, 10:10:10 PM10/31/02
to
In article <eli$02103...@qz.little-neck.ny.us>, Eli the Bearded wrote:
>
>HA! I have compiled C that uses PATH_INFO with ".html", no mere scripts.
>I also use bogus (ignored by the script) PATH_INFO strings to set nice
>filenames for things I expect the user to save to disk.

This might be a good time to post one of my latest silly PHP tricks. I
used to have a cron job to take incremental backups of a site I maintain
by SSHing to the host and using find, tar and gzip to get a compressed
tarball of all new or changed files. (I've posted the script here once,
I think.)

Anyway, both me and the site in question have moved since, and the old
setup didn't work very well anymore. What I needed was some way to do
the same thing with just a dumb browser (M$IE, in fact) on the client side.

The result was this:

<?php
$root = $_SERVER['DOCUMENT_ROOT'];

$logfile = "$root/error_log.txt";

$attemptfile = "$root/.last-backup-attempt";
$successfile = "$root/.last-backed-up";

$type = ($_GET['full'] ? "full" : "incr");

$pathinfo = $_SERVER['PATH_INFO'];
if (isset($pathinfo) && ereg('-([0-9]{8}T[0-9]{6})-', $pathinfo, $m)) {
$timestamp = $m[1];
} else {
$timestamp = "none";
}
$lastattempt = gmdate('Ymd\\THis', @filemtime($attemptfile));

if ($timestamp != $lastattempt) {
touch($attemptfile); // BUG: touch() could fail

$lastattempt = gmdate('Ymd\\THis', filemtime($attemptfile));

$host = $_SERVER['SERVER_NAME'];
$self_uri = "http://$host" . $_SERVER['SCRIPT_NAME'];

$pathinfo = "/$host-$lastattempt-$type.tgz";

$query = ($_GET ? '?'.$_SERVER['QUERY_STRING'] : '');

header('Status: 302 Redirect');
header('Location: ' . $self_uri . $pathinfo . $query);
return;
}

set_time_limit(3600);

header('Content-type: application/x-just-save-it-please;');

$root = escapeshellarg($root);
$logfile = escapeshellarg($logfile);
$attemptfile = escapeshellarg($attemptfile);
$successfile = escapeshellarg($successfile);

$cond = ($type == "incr" ? "-newer $successfile" : "");

passthru("exec 2>>$logfile " .
" find $root -type f $cond -print0 | tar -cf - -T - --null | gzip -9 " .
" && mv $attemptfile $successfile");

return;
?>

Note the weird self-redirection trick to suggest the correct filename to
the browser. There's also some interesting juggling with timestamps to
guarantee that no files are ever missed and that each tarball will
always contain all files modified between its nominal time and that of
the previous tarball.


Ps. Did I mention already that PHP is a pain-in-the-ass of a language?
Still, it could be worse. And if PHP it be, then PHP I'll write. Yet I
rather get the feeling that, compared to Perl for example, a lot of the
PHP core developers must have rather more enthusiasm than expertise, to
put it diplomatically. I mean, PHP does come with quite a lot of useful
functions built in. But when Perl provides a built-in way to solve some
problem, one can usually rely on the solution being correct. Alas...

--
Ilmari Karonen - http://www.sci.fi/~iltzu/
"In other words, you've been replaced by a very small shell script."
-- sjc to Dave Hinz on rec.humor.oracle.d

Kirk Is

unread,
Nov 2, 2002, 12:27:39 AM11/2/02
to
Ilmari Karonen <il...@sci.invalid> wrote:
> Ps. Did I mention already that PHP is a pain-in-the-ass of a language?
> Still, it could be worse. And if PHP it be, then PHP I'll write. Yet I
> rather get the feeling that, compared to Perl for example, a lot of the
> PHP core developers must have rather more enthusiasm than expertise, to
> put it diplomatically. I mean, PHP does come with quite a lot of useful
> functions built in. But when Perl provides a built-in way to solve some
> problem, one can usually rely on the solution being correct. Alas...

Very true. There are many things that seems the way they are because it
was easier for the implementors *of* the language, not *in* the
language... iterators using something that's internal to an array and that
needs to be reset() if you want to loop again come to mind. And the
godawful scoping rules. And the way a nested loop I was trying to use just
wouldn't work... it's cool that they have so many builtin useful
functions, and I guess pretty decent performance v. Perl (unless you do
modperl) and actually a much more reasonable object model than Perl, but
still, it doesn't always feel quite ready for primetime to me.

ObHack:
The gave us nice photo frames at work...a curve of metal behind two panes
of glass. Co-worker discovered that they make ok cd holders, keeping them
up off the desk surface (either face down or face up) and out of the way.

--
QUOTEBLOG: http://kisrael.com SKEPTIC MORTALITY: http://kisrael.com/mortal

"And isn't sanity really just a one trick pony anyway? I mean all you get
is one trick, rational thinking, but when you're good and crazy, oooh
oooh oooh, the sky is the limit!" --The Tick

Sean Conner

unread,
Dec 20, 2002, 7:06:37 AM12/20/02
to
On Fri, 01 Nov 2002 00:35:50 GMT, it was thus said that the Great Kirk Is wrote:
>Eli the Bearded <*@eli.users.panix.com> wrote:
>> In alt.hackers, Kirk Is <kirk...@alienbill.com> wrote:
>[snip]
>>> To cap it off, the idea of a .html actually being a script is disturbing
>>> to me aesthetically wise speaking.
>
>> HA! I have compiled C that uses PATH_INFO with ".html", no mere scripts.
>
>Eww...
>
>For the record, my webhost doesn't feed my scripts PATH_INFO, I use
>REQUEST_URI instead. (And to find out what variables where available, I
>iterated over the keys of %ENV )

Why not just put one copy of the script in /arch as you have it now
(index.cgi) but have it parse out REQUEST_URI for the rest of the path and
go from there? I put my test program in "$DOCUMENT_ROOT/test/index.cgi",
called it (http://www.example.net/test/foobar/baz") and got:

REQUEST_URI="/test/foobar/baz"

So how hard can it be to strip off "/test" (or even "/test/")?

Alternatively, if your provider supports mod_rewrite you can have the
webserver redirect (internally---the client never sees any 300 codes) to the
script, rewriting the URL to put the CGI parameters back in, which is how I
do it for my blog (http://boston.conman.org/):

RewriteEngine on
RewriteBase /

RewriteRule ^([0-9][0-9])(.*) nph-boston.cgi/$1$2 [L]

basically, starting from the root, any URL starting with at least two
numbers is passed directly to the blogging software.

As far as your working assumption on Google (about it not following CGIs
or URLs with certain characters in them) is probably correct. That's one
reason why I wrote my blogging software the way I did (another was that I
wanted certain features, which comprises my ObBlogHack below). Another
thing is that I do have a robots.txt file
(http://www.robotstxt.org/wc/norobots.html) and also use the robots <META>
tags (http://www.robotstxt.org/wc/exclusion.html#meta) on each page.

It also hasn't hurt that every single URL on my site has been valid from
'98 (technically from at least '95---I was fortunate in that I could set up
permanent redirects under the old domain to the new domain when I switched
in '98).

ObBlogHack:

Wanting to use a sane URL scheme, I wrote my blogging software to parse a
date (given in YYYY/MM/DD format) to retrieve the proper entries and display
them. So that "1999" would return all the entries in 1999, "1999/12" would
return entries in December of 1999 and "1999/12/4" would return entries on
December 4th, 1999 and "1999/12/5.2" would return the second entry on
December 5th, 1999. You can even specify a range:

2000/8/11-15

would return the entries between August 11th throught the 15th inclusive
(which happens to be a trip I took through northern Florida---the actual
specification would be "2000/8/10.2-15.5", i.e., from the second entry on
August 10th to the fifth entry on August 15th).

More information can be found at

http://boston.conman.org/about/technical.html

-spc (Been awhile since I last posted here ... )


Kirk Is

unread,
Dec 21, 2002, 1:18:51 PM12/21/02
to
Sean Conner <se...@area51.slab.conman.org> wrote:
> Why not just put one copy of the script in /arch as you have it now
> (index.cgi) but have it parse out REQUEST_URI for the rest of the path and
> go from there? I put my test program in "$DOCUMENT_ROOT/test/index.cgi",
> called it (http://www.example.net/test/foobar/baz") and got:

> REQUEST_URI="/test/foobar/baz"

> So how hard can it be to strip off "/test" (or even "/test/")?

Well, obviously, basic character manipulations are no sweat (in fact
that's simpler than what I do now)...the trouble is on my server,
it looks for a directory foobar/baz under test, bypassing index.cgi
entirely.

> As far as your working assumption on Google (about it not following CGIs
> or URLs with certain characters in them) is probably correct. That's one
> reason why I wrote my blogging software the way I did (another was that I
> wanted certain features, which comprises my ObBlogHack below). Another
> thing is that I do have a robots.txt file

I dunno. Google has browsed my site since then, but still won't follow
into /arch/ and from there to individual months ala /arch/2002/12. Is the
link structure too deep? Dang.

obLameHack:
The restrooms where I work have this nasty aerosol that autosprays every
few minutes. If you happen to be chewing gum, you suddenly taste the
chemicals. At first I held my breath, then I learned to stuff the gum
between my upper gums and the front of my mouth...


--
QUOTEBLOG: http://kisrael.com SKEPTIC MORTALITY: http://kisrael.com/mortal

Indeed, the Russians' predisposition for quiet reflection followed by
sudden preventive action explains why they led the field for many
years in both chess and ax murders. --Marshall Brickman, Playboy 4/73

Sean Conner

unread,
Dec 21, 2002, 11:14:25 PM12/21/02
to
On Sat, 21 Dec 2002 18:18:51 GMT, it was thus said that the Great Kirk Is wrote:
>Sean Conner <se...@area51.slab.conman.org> wrote:
>
>> I put my test program in "$DOCUMENT_ROOT/test/index.cgi",
>> called it (http://www.example.net/test/foobar/baz") and got:
>
>> REQUEST_URI="/test/foobar/baz"
>
>> So how hard can it be to strip off "/test" (or even "/test/")?
>
>Well, obviously, basic character manipulations are no sweat (in fact
>that's simpler than what I do now)...the trouble is on my server,
>it looks for a directory foobar/baz under test, bypassing index.cgi
>entirely.

That's definitely an odd Apache setup then. I wonder if it's the
FrontPage extensions that are doing that?

>I dunno. Google has browsed my site since then, but still won't follow
>into /arch/ and from there to individual months ala /arch/2002/12. Is the
>link structure too deep? Dang.

Odd. Your robots.txt file looks fine to me, and my regular web site has
been indexed every which way, and it's pretty deep:

/people/spc/writings/murphy/

I don't know why Google isn't going into /arch on your site, unless it just
hasn't gotten there yet. Can you check your logs files? Look for requests
from a user-agent (if it's stored) from googlebot.

ObCanWeIncludeHTMLHacksHereNow?Hack:

Earlier this year I sold my condo in Florida. One potential buyer wanted
to see what it looked like so I whipped up this site:

http://www.conman.org/people/spc/writings/hypertext/condo/

You'll see a layout of the condo with circled numbers. Each number
represents where I took a picture and the arrow next to it the direction the
photograph was taken. If you click on a number, it will show you that
picture and if you click in a room, it will take you to a list of
photographs taken in that room.

I think (it's been awhile) I wrote some quick scripts to generate most of
the pages just to keep me from going insane. I think it's a pretty decent
interface to a virtual tour of a house, given it was done in a day.

-spc (And if you have Mozilla, turn on the site navigation bar when
viewing the pages ... )

0 new messages