When I opened GsiteCrawler today it told me: "Warning! The size of your database file is over 900mb - please compact it. With a database this large the crawlers will be disabled." I have no idea how to do that. Any suggestions? I don't really have a very big site, but it is an osCommerce site, so maybe that is different? Thanks for any help, Patrick
Patrick Kinney Kinney's Shooting Supply, LLC kinneysshootingsupply.com
"......the right of the people to keep and bear arms shall not be infringed."
Do you have any clue how big the site should be and how many urls a
robot will actually find?
I have a hunch you need to disallow a bunch of things in robtos.txt or
risk spawining unlimited number of urls.
Also is it possible there are session ids that get added to the urls?
GSC can remove those but the better solution woudl be to prevent them
from being added when robots crawl.
> When I opened GsiteCrawler today it told me:
> "Warning!
> The size of your database file is over 900mb - please compact it.
> With a database this large the crawlers will be disabled."
> I have no idea how to do that. Any suggestions? I don't really have a
> very big site, but it is an osCommerce site, so maybe that is different?
> Thanks for any help,
> Patrick
> Patrick Kinney
> Kinney's Shooting Supply, LLC
> kinneysshootingsupply.com
> "......the right of the people to keep and bear arms shall not be infringed."
yoz can forbidden URL's with a "?" at the robots.txt
By Basti
-----Ursprüngliche Nachricht-----
Von: gsitecrawler@googlegroups.com [mailto:gsitecrawler@googlegroups.com] Im
Auftrag von webado2
Gesendet: Samstag, 20. Juni 2009 02:45
An: SOFTplus GSiteCrawler
Betreff: [GSiteCrawler] Re: "With a database this large the crawlers will be
disabled"
You can compact from the File menu.
Do you have any clue how big the site should be and how many urls a
robot will actually find?
I have a hunch you need to disallow a bunch of things in robtos.txt or
risk spawining unlimited number of urls.
Also is it possible there are session ids that get added to the urls?
GSC can remove those but the better solution woudl be to prevent them
from being added when robots crawl.
On Jun 19, 2:00 pm, Patrick Kinney <patr...@kinneys.net> wrote:
> When I opened GsiteCrawler today it told me:
> "Warning!
> The size of your database file is over 900mb - please compact it.
> With a database this large the crawlers will be disabled."
> I have no idea how to do that. Any suggestions? I don't really have a
> very big site, but it is an osCommerce site, so maybe that is different?
> Thanks for any help,
> Patrick
> Patrick Kinney
> Kinney's Shooting Supply, LLC
> kinneysshootingsupply.com
> "......the right of the people to keep and bear arms shall not be
It appears that the pages that include the "?" , like in the example below are the extra pages. Can I just filter out the "html?" pages and still have all of my product be found?
I got rid of about a third of the urls by adding one other filter in Gsite.
Should I be disallowing all of this in the robots.txt?
Thanks,
Patrick
>Do you have any clue how big the site should be and how many urls a
>robot will actually find?
>I have a hunch you need to disallow a bunch of things in robtos.txt or
>risk spawining unlimited number of urls.
>Also is it possible there are session ids that get added to the urls?
>GSC can remove those but the better solution woudl be to prevent them
>from being added when robots crawl.
>On Jun 19, 2:00 pm, Patrick Kinney <patr...@kinneys.net> wrote:
> > When I opened GsiteCrawler today it told me:
> > "Warning!
> > The size of your database file is over 900mb - please compact it.
> > With a database this large the crawlers will be disabled."
> > I have no idea how to do that. Any suggestions? I don't really have a
> > very big site, but it is an osCommerce site, so maybe that is different?
> > Thanks for any help,
> > Patrick
Ok, then yes, you should disallow anything that has a query string
after the .html. Not just in GsiteCrawler's filter but also in
robots.txt
Add this to the robots.txt file, under User-agent: *
Disallow: /*html?sort
If ther eare other query strings where the first param is other than
sort, add other lines. Not sure if the simpler, more general
Disallow: /*html?
would work.
Whatever else you added to the filter in GSC should also be added to
robots.txt.
Then in GSC import the robots.txt file again and refilter URL List and
Crawler queue.
Then start the crawl again.
On Jun 20, 7:00 am, Patrick Kinney <patr...@kinneys.net> wrote:
> It appears that the pages that include the "?" , like in the example
> below are the extra pages. Can I just filter out the "html?" pages
> and still have all of my product be found?
> I got rid of about a third of the urls by adding one other filter in Gsite.
> Should I be disallowing all of this in the robots.txt?
> Thanks,
> Patrick
> At 08:45 PM 6/19/2009, you wrote:
> >You can compact from the File menu.
> >Do you have any clue how big the site should be and how many urls a
> >robot will actually find?
> >I have a hunch you need to disallow a bunch of things in robtos.txt or
> >risk spawining unlimited number of urls.
> >Also is it possible there are session ids that get added to the urls?
> >GSC can remove those but the better solution woudl be to prevent them
> >from being added when robots crawl.
> >On Jun 19, 2:00 pm, Patrick Kinney <patr...@kinneys.net> wrote:
> > > When I opened GsiteCrawler today it told me:
> > > "Warning!
> > > The size of your database file is over 900mb - please compact it.
> > > With a database this large the crawlers will be disabled."
> > > I have no idea how to do that. Any suggestions? I don't really have a
> > > very big site, but it is an osCommerce site, so maybe that is different?
> > > Thanks for any help,
> > > Patrick
>Ok, then yes, you should disallow anything that has a query string
>after the .html. Not just in GsiteCrawler's filter but also in
>robots.txt
>Add this to the robots.txt file, under User-agent: *
>Disallow: /*html?sort
>If ther eare other query strings where the first param is other than
>sort, add other lines. Not sure if the simpler, more general
>Disallow: /*html?
>would work.
>Whatever else you added to the filter in GSC should also be added to
>robots.txt.
>Then in GSC import the robots.txt file again and refilter URL List and
>Crawler queue.
>Then start the crawl again.
>On Jun 20, 7:00 am, Patrick Kinney <patr...@kinneys.net> wrote:
> > It appears that the pages that include the "?" , like in the example
> > below are the extra pages. Can I just filter out the "html?" pages
> > and still have all of my product be found?
> > I got rid of about a third of the urls by adding one other filter in Gsite.
> > Should I be disallowing all of this in the robots.txt?
> > Thanks,
> > Patrick
> > At 08:45 PM 6/19/2009, you wrote:
> > >You can compact from the File menu.
> > >Do you have any clue how big the site should be and how many urls a
> > >robot will actually find?
> > >I have a hunch you need to disallow a bunch of things in robtos.txt or
> > >risk spawining unlimited number of urls.
> > >Also is it possible there are session ids that get added to the urls?
> > >GSC can remove those but the better solution woudl be to prevent them
> > >from being added when robots crawl.
> > >On Jun 19, 2:00 pm, Patrick Kinney <patr...@kinneys.net> wrote:
> > > > When I opened GsiteCrawler today it told me:
> > > > "Warning!
> > > > The size of your database file is over 900mb - please compact it.
> > > > With a database this large the crawlers will be disabled."
> > > > I have no idea how to do that. Any suggestions? I don't really have a
> > > > very big site, but it is an osCommerce site, so maybe that is > different?
> > > > Thanks for any help,
> > > > Patrick