Crawling basics for Gallery2

6 views
Skip to first unread message

Mark A Walker

unread,
Dec 20, 2009, 4:28:50 AM12/20/09
to SOFTplus GSiteCrawler
Hello guys, been reading this for a long time and have been amazed at
the GSitecrawler info here. I am using the beta version as last I
looked and also use a very large gallery2 with many images and
keywords. It takes almost 24 to 40 hrs to crawl and the results are
usually very good with the amount of urrls it produces yet I fee I can
cut down on many of the useless ones. if anyone ut there has some tis
on configuring gallery2 with Gsitecrawler, I'd sure appreciate it like
loosing the session id's or possibly navid and login. I've looked at
the basic uses they gave us with a simlar exclusions via the sample
forum but not sure just what I can leave out of it.

webado

unread,
Dec 20, 2009, 9:01:48 AM12/20/09
to SOFTplus GSiteCrawler
It's not GSiteCrawler you need worry about, it's any search engine
robot.
So whatever changes you make you have to keep those in mind.

Session IDs are a problem for robots. If you can configure the
Gallery2 software to get rid of them for robots and other non-
authenticated visitors, this will be a vast improvement.

You should also disallow robots from various types of uris for
functions such as login, print, enlarge, comment, different sorting
methos, etc.
Without your website url I am guessing.

Mark A Walker

unread,
Dec 20, 2009, 5:41:43 PM12/20/09
to SOFTplus GSiteCrawler
The url is http://www.godragracing.org/gallery/main.php which in
itself has really no printing or any of the more advanced stuff just
the basic config. I was thinking of the mod url rewrite activation
that would make the actual pages have a friendly url ending like .html
but it would be implemented later since there are so many websites
already linking to it. I have yet to configure a decent robots.txt for
any of this originating from my home website, there is so much
confusion about wildcards and bots. I feel on the main parts or really
all but the gallery I allow just the basics but again, the robots text
does contain alot. Please feel free to look and add your tips.
Sincerely Mark

> > forum but not sure just what I can leave out of it.- Hide quoted text -
>
> - Show quoted text -

Christina S

unread,
Dec 20, 2009, 8:32:43 PM12/20/09
to gsitec...@googlegroups.com
I don't advise rewriting urls now until and unless you implement
simultaneously 301 redirections from old (current) urls to the new ones.
AN issue I see right away is the overabundance of 302 redirections.

For instance a url like this:
http://www.godragracing.org/gallery/main.php?g2_view=keyalbum.KeywordAlbum&g2_keyword=Joe+Newsham&g2_highlightId=6036
gets 302 redirected to
http://www.godragracing.org/gallery/main.php?g2_view=keyalbum.KeywordAlbum&g2_keyword=Joe+Newsham

Basically the last parameter in the query string, g2_highlightId=6036, gets
removed through that redirection.
But 302 redirection has the result of preserving both urls, albeit both with
the content on the destination url.
So you have instant content duplication across 2 urls.

Multiply this by however many urls the site has.

It also makes it harder for robots to crawl if at every step they get
redirected (that regardless of which kind of redirection is used).
Basically for smooth crawling and indexing there should not be any kinds of
redirections in navigation.

While you certainly have many "keywords", you don't actually have that much
text content to justify all those keywords.

But that's besides the point.

The login uri should be disallowed in robots.txt . Easiest would be with a
prefix:

Disallow: /gallery/main.php?g2_view=core.UserAdmin
Disallow: /gallery/main.php?g2_view=core.UserAdmin


All images wherever they may be should have alt text.


I'm crawling your site now with GSC.

It is set to include only web page urls and nothing else - so no image urls
or any media files. They don't belong in a general web sitemap.

So far I've had to ban urls containing these items as they produce duplicate
and some are virtually empty pages (no text, just an image that's already
been shown on another page).

/gallery/main.php?g2_view=core.DownloadItem
/gallery/main.php?g2_view=core.UserAdmin
/gallery/main.php?g2_view=keyalbum.KeywordAlbum


The above simulates the corresponding robots.txt directives

Also I used Remove Parameter for:
g2_highlightId

You will need to fix the software somehow to get rid of that and of the 302
redirection from urls containing that to others. No idea how you can manage
this in Gallery.

Crawling is excruciatingly slow due to all the redirections.

Also even after filtering out a lot of useless urls, it's found over 5000
and still going.

Just how many pages do you think you have? Just because you have over 10000
images doesn't mean there should be that many distinct pages indexed.

--

You received this message because you are subscribed to the Google Groups
"SOFTplus GSiteCrawler" group.
To post to this group, send email to gsitec...@googlegroups.com.
To unsubscribe from this group, send email to
gsitecrawler...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/gsitecrawler?hl=en.


Christina
www.webado.net

Mark A Walker

unread,
Dec 21, 2009, 1:36:05 AM12/21/09
to SOFTplus GSiteCrawler
Dear Christina,

This was exactly what I was looking for and very detialed also thank
you. I finished the crawl after an excruciating 30 plus hrs and it
sent me 3 xml sitemaps in the standard and gzipped form. I believe it
found 467,000+ and indexed so far a little over 4,700. I know this is
wrong and felt it is hurting more than helping just like the 302 that
G2 adds to immediately for some reason also which I know google hates.
I thought the keywords would be needed for searching the types of cars
people were looking for and I don't think I can add any type of "alt"
to the images unles it's by descriptions only. Ifeel you have
definately found the right way to go about this for me and will
recrawl with your added parameters involved after I see what Gallery 2
has to say about the redirect.

On Dec 20, 8:32 pm, "Christina S" <web...@gmail.com> wrote:
> I don't advise rewriting urls now until and unless you implement
> simultaneously 301 redirections from old (current) urls to the new ones.
> AN issue I see right away is the overabundance of 302 redirections.
>

> For instance a url like this:http://www.godragracing.org/gallery/main.php?g2_view=keyalbum.Keyword...
> gets 302 redirected tohttp://www.godragracing.org/gallery/main.php?g2_view=keyalbum.Keyword...

> The url ishttp://www.godragracing.org/gallery/main.phpwhich in

> Christinawww.webado.net- Hide quoted text -

Christina S

unread,
Dec 21, 2009, 1:44:14 AM12/21/09
to gsitec...@googlegroups.com
You don't need to recrawl. Just add those parameters and use the Filter Now
option.
It will remove all the extra urls from the sitemap.

In the end there are about 10000 which I found. Seems like a lot to me even
then.
The crawl took maybe an hour or two, but I don't really remember.


Dear Christina,

--

Mark A Walker

unread,
Dec 21, 2009, 12:17:47 PM12/21/09
to SOFTplus GSiteCrawler
Again Christina, I thank you for looking into this and finding these
parameters, you must be very well versed in the making of these. I
heavily relied on Gallery2 since I felt it was so much better than
others and I create some large gallery's with it and it is stable. I
am also an avid image browser not just text as of late since I feel
the blogs are unstoppable for SEO and making it to the top of a search
with irrelevant subject, maybe one answer or so compared to a full
story on a website. My many websites I produce all use the sitecrawler
with mixed results and I have yet to figure it out since it's all
basic HTML and well written, I'm not real good with PHP like in the
gallery and what needs to be done. Google has really indexed all the
pages well and if going to site:www,godragracing.org you can see there
are 30 or fourty pages of results but only in text. I will begin going
through the motions you supplied and I have still yet to fnd a
definitive answer on the 302 redirects in the gallery's admin forums
but what you have done so far is outstanding, like showng me the not
to recrawl, which I would have but use the filter instead. I wish you
the best and thank you again.

Sincerely Mark

> > Christinawww.webado.net-Hide quoted text -

Christina S

unread,
Dec 21, 2009, 12:24:11 PM12/21/09
to gsitec...@googlegroups.com
You're welcome, Mark.

Good luck.

Personally my early attempts at using Gallery for anything have failed. Too
many options for me, can't see the forest for the trees LOL
I gave up long ago on that.

Sincerely Mark

--

Reply all
Reply to author
Forward
0 new messages