Exporting uncrawled URLs

7 views
Skip to first unread message

dafalcon

unread,
Sep 30, 2009, 6:06:17 AM9/30/09
to SOFTplus GSiteCrawler
Does anyone know any way to export URLs that have been found but are
waiting to be crawled?

I am crawling a huge site with millions of URLs, filtering out huge
numbers of them. I have crawled loads but there is a similar number
waiting to be crawled. This could take weeks/months so I'm looking for
a way to get hold of the URLs that are still in the queue to speed
things up.

Any suggestions would be most appreciated. Thanks in advance for any
help :-)

webado2

unread,
Sep 30, 2009, 8:10:16 AM9/30/09
to SOFTplus GSiteCrawler
Sorry ... No.

You probably need a scripted sitemap generator, that runs off the same
algorithm that has generated the urls of the site.

On Sep 30, 6:06 am, dafalcon <richard.falcone...@googlemail.com>
wrote:

dafalcon

unread,
Sep 30, 2009, 8:43:02 AM9/30/09
to SOFTplus GSiteCrawler
Thanks anyway, webado2.

Unfortunately that's not going to happen in the short term. Medium to
long term perhaps, once I've proven the value of a decent sitemap.

It's really frustrating to see 700,000 URLs in queued in GSite Crawler
and not be able to use them until they've been crawled. I'm convinced
there must be some way to extract them, possibly from the full project
file.

Christina S

unread,
Sep 30, 2009, 8:56:08 AM9/30/09
to gsitec...@googlegroups.com
Well they are possibly in the Access database being used by GSC, maybe you
can export them from there yourself. But they would not have been filtered
by whatever is blocked in robots.txt or by on-page robots meta tags. They
are also not going to be ALL the urls the site has because to find all urls
GSC must first crawl all urls.

A bad sitemap (which includes urls that would be disallowed) is a lot worse
than no sitemap at all.

I don't know how your site is structured, but if it has a very clear
hierarchy, maybe you can build a sitemap of only the first couple of levels
in the hierarchy (by using filters in GSC, not in robots.txt) and produce
that sitemap, and let Googlebot find the lower level of the urls, as it
would anyway.

Christina
www.webado.net

Joe Germann

unread,
Sep 30, 2009, 9:02:28 AM9/30/09
to gsitec...@googlegroups.com
I am a newbie to this stuff so maybe my simplistic approach to the same problem is helpful for your situation. Then maybe again, I have approached this all wrong.

I have an OSC based eCommerce site and there is a community contribution that generates a site map, or actually several of them being pointed to by sitemapindex.xml.  This gets run once a day and is FTP'd to my server.  Upon examination it did not appear that Google liked indexing these sitemaps, so I started playing with gSiteCrawler.

The first thing I noticed about gSiteCrawler was that it took forever to crawl my site. The next thing I noticed was a mess of URL's that were totally useless.  Things like "sort" lists, "buy_now", and "review" things that pointed to basically garbage pages and lots and lots of product duplicates that were using different URL structures to point to the same (duplicate) page.

Figuring that this was not only a waste of gSiteCrawler's time, but also that of the SE Bots, I quickly started to play with both filers and robots.txt files.  It was a pain in the neck, but whenever I would find a junk sequence of URL's, I would stop gStiteCrawler, edit up a revised robots.txt file, filter the list, clear the crawl queue, and resume gSiteCrawler crawling.  It took a while, it was work not magic, but I now have the OSC generated sitemaps, and the gSiteCrawler generated sitemaps, and a finely tuned robots.txt file.

Why two differnet sitemap generators?  Well, like I said in the beginning, I am a newbie and I want to make sure that the sitemap information I present is the best, most concise, and most efficient for any crawler to play with.  By combining a good robots.txt file into the equation, this can only seem to help the SE Bots by letting them crawl what is important and to not waste their precious time on my useless junk.

This was my approach and so far it seems that the SE Bots are liking it better.  I'll know with more analysis and tuning.

Regards,
Joe Germann

MOTORHEAD extraordinaire
Professional Storage and Workspace Solutions
79 Park Road - Chelmsford, MA - 01824
Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
Visit our web site at www.MotorheadExtraordinaire.com and
for our latest specials, sign up for our Newsletter

webado2

unread,
Sep 30, 2009, 9:14:28 AM9/30/09
to SOFTplus GSiteCrawler
That is good advise for all sites, Joe, big and small. Especially
useful for very big sites.

Still, even when all set up perfectly, with all the filters in place,
the number of urls left to crawl may still end up being huge. Assuming
all efforts have been made to optimize the site. Not have a new url
for every tiny variation of the same item, but group them in one url
(for instance 20 otherwise identical widgets varying only by size,
color and price should be one a single page rather than have their own
individual pages). Things like that.

There's no advantage in having a million urls for merely 10 000
distinct products, on the contrary.

But as I said, I have no idea of the OP's website and whether any of
that applies.



On Sep 30, 9:02 am, Joe Germann <motorheadextraordina...@gmail.com>
wrote:
> <https://www.motorheadextraordinaire.com/create_account.php>sign up
> for our Newsletter- Hide quoted text -
>
> - Show quoted text -

dafalcon

unread,
Sep 30, 2009, 12:17:06 PM9/30/09
to SOFTplus GSiteCrawler
Thanks for all your replies! Much appreciated.

The site is huge and I mean huge - over 10 million pages of unique
content. I'm filtering out all the stuff I don't want (all
duplication) but this has really slowed GSite down.

I'd rather have as many good URLs (i.e. including the ones in the
queue) in the sitemap as possible whilst I continue to crawl the site,
hence my desire to export the ones that haven't yet been crawled. I
plan to keep crawling them too to uncover more.

This sitemap is part of an ongoing strategy of improving the way the
site is crawled - so this is a very specific question about extracting
the uncrawled URLs rather than a request for sitemap help - thanks for
the advice anyway. I've looked into various ways of crawling the site
and sadly this has been the best way possible.

If anyone has ever managed to extract the uncrawled URLs I'd love to
hear how it was done.

Christina S

unread,
Sep 30, 2009, 8:12:04 PM9/30/09
to gsitec...@googlegroups.com
10 million pages? Yikes!

Heck, what kind of site is it? Are you replicating the internet?

Christina
www.webado.net

----- Original Message -----
From: "dafalcon" <richard.f...@googlemail.com>
To: "SOFTplus GSiteCrawler" <gsitec...@googlegroups.com>
Sent: Wednesday, September 30, 2009 12:17 PM
Subject: [GSiteCrawler] Re: Exporting uncrawled URLs


>
Reply all
Reply to author
Forward
0 new messages