"With a database this large the crawlers will be disabled"

Patrick Kinney

unread,

Jun 19, 2009, 2:00:25 PM6/19/09

to gsitec...@googlegroups.com

When I opened GsiteCrawler today it told me:
"Warning!
The size of your database file is over 900mb - please compact it.
With a database this large the crawlers will be disabled."
I have no idea how to do that. Any suggestions? I don't really have a very big site, but it is an osCommerce site, so maybe that is different?
Thanks for any help,
Patrick

Patrick Kinney
Kinney's Shooting Supply, LLC
kinneysshootingsupply.com

"......the right of the people to keep and bear arms shall not be infringed."

webado2

unread,

Jun 19, 2009, 8:45:10 PM6/19/09

to SOFTplus GSiteCrawler

You can compact from the File menu.

Do you have any clue how big the site should be and how many urls a
robot will actually find?

I have a hunch you need to disallow a bunch of things in robtos.txt or
risk spawining unlimited number of urls.

Also is it possible there are session ids that get added to the urls?
GSC can remove those but the better solution woudl be to prevent them
from being added when robots crawl.

A page like http://kinneysshootingsupply.com/fajen-m-6.html?sort=2d&page=1&filter_id=13&sort=5a
are only a differently sorted and filtered version of
http://kinneysshootingsupply.com/fajen-m-6.html, so it should not be
separately indexed.

Sebastian Lutz

unread,

Jun 19, 2009, 9:11:54 PM6/19/09

to gsitec...@googlegroups.com

Hi,

yoz can forbidden URL's with a "?" at the robots.txt

By Basti

-----Ursprüngliche Nachricht-----
Von: gsitec...@googlegroups.com [mailto:gsitec...@googlegroups.com] Im
Auftrag von webado2
Gesendet: Samstag, 20. Juni 2009 02:45
An: SOFTplus GSiteCrawler
Betreff: [GSiteCrawler] Re: "With a database this large the crawlers will be
disabled"

Patrick Kinney

unread,

Jun 20, 2009, 7:00:16 AM6/20/09

to gsitec...@googlegroups.com

It appears that the pages that include the "?" , like in the example below are the extra pages. Can I just filter out the "html?" pages and still have all of my product be found?
I got rid of about a third of the urls by adding one other filter in Gsite.
Should I be disallowing all of this in the robots.txt?
Thanks,
Patrick

webado2

unread,

Jun 20, 2009, 7:16:49 AM6/20/09

to SOFTplus GSiteCrawler

Ok, then yes, you should disallow anything that has a query string
after the .html. Not just in GsiteCrawler's filter but also in
robots.txt

Add this to the robots.txt file, under User-agent: *

Disallow: /*html?sort

If ther eare other query strings where the first param is other than
sort, add other lines. Not sure if the simpler, more general
Disallow: /*html?
would work.

Whatever else you added to the filter in GSC should also be added to
robots.txt.

Then in GSC import the robots.txt file again and refilter URL List and
Crawler queue.

Then start the crawl again.

On Jun 20, 7:00 am, Patrick Kinney <patr...@kinneys.net> wrote:
> It appears that the pages that include the "?" , like in the example
> below are the extra pages. Can I just filter out the "html?" pages
> and still have all of my product be found?
> I got rid of about a third of the urls by adding one other filter in Gsite.
> Should I be disallowing all of this in the robots.txt?
> Thanks,
> Patrick
>
> At 08:45 PM 6/19/2009, you wrote:
>
>
>
>
>
> >You can compact from the File menu.
>
> >Do you have any clue how big the site should be and how many urls a
> >robot will actually find?
>
> >I have a hunch you need to disallow a bunch of things in robtos.txt or
> >risk spawining unlimited number of urls.
>
> >Also is it possible there are session ids that get added to the urls?
> >GSC can remove those but the better solution woudl be to prevent them
> >from being added when robots crawl.
>
> >A page like

> >http://kinneysshootingsupply.com/fajen-m-6.html?sort=2d&page=1&filter...

> >are only a differently sorted and filtered version of
> >http://kinneysshootingsupply.com/fajen-m-6.html, so it should not be
> >separately indexed.
>
> >On Jun 19, 2:00 pm, Patrick Kinney <patr...@kinneys.net> wrote:
> > > When I opened GsiteCrawler today it told me:
> > > "Warning!
> > > The size of your database file is over 900mb - please compact it.
> > > With a database this large the crawlers will be disabled."
> > > I have no idea how to do that. Any suggestions? I don't really have a
> > > very big site, but it is an osCommerce site, so maybe that is different?
> > > Thanks for any help,
> > > Patrick
>
> > > Patrick Kinney
> > > Kinney's Shooting Supply, LLC
> > > kinneysshootingsupply.com
>
> > > "......the right of the people to keep and bear arms shall not be
> > infringed."
>
> Patrick Kinney
> Kinney's Shooting Supply, LLC
> kinneysshootingsupply.com
>

> "......the right of the people to keep and bear arms shall not be infringed." - Hide quoted text -
>
> - Show quoted text -

Patrick Kinney

unread,

Jun 20, 2009, 12:30:46 PM6/20/09

to gsitec...@googlegroups.com

Thanks a million. It looks like I have plenty to do, now.
Patrick

Reply all

Reply to author

Forward