Sitemaps for large catalogs

78 views
Skip to first unread message

Charlie Morris

unread,
Oct 24, 2019, 11:38:38 AM10/24/19
to Blacklight Development
I see that there was some talk on this group about sitemaps several years ago (circa 2011). At the time it seemed like the best advice for large catalogs was to have a friendly robots.txt. Sitemap generation over 50,000 has a lot of issues like processing time for generating the sitemap(s) and keeping it them up-to-date.

Does anyone have any advice or feedback for strategies in helping Google and others discover the items in a large Blacklight index? Is it worth attempting to generate a new sitemap or should a permissive robots.txt be probably the best bet? Or?

Thanks,
Charlie (Penn State)

Esmé Cowles

unread,
Oct 24, 2019, 11:46:51 AM10/24/19
to blacklight-...@googlegroups.com
I definitely think it's worth the effort to get a sitemap working and kept fresh when you have a large index.

Our robots.txt (https://catalog.princeton.edu/robots.txt) bans basically everything that does a query to try to discourage robots from walking our site instead of using the sitemap, e.g.:

> User-agent: *
> Crawl-delay: 10
> Sitemap: https://catalog.princeton.edu/sitemap.xml.gz
> Disallow: /?q=*
> Disallow: /?f*
> Disallow: /*?q=*
> Disallow: /*?f*

With that in place, we see Google/Bing/Yandex/etc. pretty much just fetching our individual resource pages. We've seen some badly-behaved bots continue to do queries and we've done some thing to block that:

1. Blocking deep paging (normal users don't page hundreds of pages into search results in our experience)
2. Blocking deep paging into facet lists (ditto)
3. Blocking any paging without a query

These have helped reduce the load from robots, and I think we've had only a single user report of an issue (one user was trying to download metadata).

-Esmé
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/2702740b-86cb-4ea6-87e5-acee1ed2adab%40googlegroups.com.

Jack Reed

unread,
Oct 24, 2019, 11:48:34 AM10/24/19
to blacklight-...@googlegroups.com

Charlie,

 

While not answering your question specifically, one of the outcomes from the Blacklight-LD meeting was a proof of concept of  “on the fly” sitemap generation. This approach aims to solve both the “up to date problem” and performance time.

 

It provides on-demand sitemaps generations that are dynamically built using a performant selection of Solr docs based off of a hash of the “id”.

 

A proof of concept is here: https://github.com/sul-dlss/SearchWorks/pull/2351

 

This was worked on together between myself, @magibney, @agazzarini, and @netsensei.  We haven’t rolled this out yet into production but am curious about other’s response to this solution.

 

Best,

Jack

--

Charlie Morris

unread,
Oct 24, 2019, 12:02:06 PM10/24/19
to blacklight-...@googlegroups.com
Esmé, we are blocking deep paging and deep paging in facets already, which is why I was feeling a bit easier about opening things up to crawlers. We also are adding a 10 second delay. So, feeling okay about respectful bots not causing unintended harm. Disrespectful bots are another story.

Can I ask about your sitemap generation? Are you generating hundreds of sitemaps on full indexes at index time? And then keeping them up-to-date as part of the add/update/delete process on your catalog? Looks like you all are using the sitemap_generator gem which I did find in simple Google searching prior to this email thread. Good to hear some affirmative feedback that it's worth it.

Thanks for the feedback.

-Charlie

Charlie Morris

unread,
Oct 24, 2019, 12:14:32 PM10/24/19
to blacklight-...@googlegroups.com
Thanks Jack. I vaguely remember this work of yours (all of yours) and am glad to hear you chime in. Seems amazing that these sitemaps could be generated on-the-fly performantly! Have you tried it out yet locally with a big index? Is there a Solr version dependency? Also, it seems to me that this type of functionality would be a very nice add to Blacklight itself (rather than a plugin) in my opinion.

-Charlie

Esmé Cowles

unread,
Oct 24, 2019, 12:26:35 PM10/24/19
to blacklight-...@googlegroups.com
Charlie-

Yes, we update our sitemap weekly and we're currently at 294 sitemap segments. We're not doing anything to try to preserve existing ones, so we just write a whole new set of sitemap segments every week. In theory, I could imagine trying to keep older segments, and that might reduce the amount of robot traffic fetching records that hadn't changed, but that would be a lot more work (and deleting records would be harder to handle).

-Esmé
> To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/CAM1CjzTFHqTzw0xupfXJyhQirtbMpjfbL5D84-OcP8T5ixEu_g%40mail.gmail.com.

Charlie Morris

unread,
Nov 19, 2019, 10:51:46 AM11/19/19
to Blacklight Development
Hello all,

I just wanted to report back that we implemented and deployed the on-demand sitemap solution that Jack brought up earlier in the thread, see https://github.com/psu-libraries/psulib_blacklight/commit/b536577e043cd2be3eea7a37008021a4b0839e77 if you are curious. We discovered a couple optimizations and bugfixes along the way relevant to our local implementation like specifying the lucene solr engine (rather than say edismax) for the search at the show level because it uses localParams.

We have been visited by the Googlebot and "leaf" renders happened in around 200-300ms, and all 7M+ records were discovered. Actual Google searches have been showing around 1000 to 2000 of our items thus far, but it had only 6 hits a week ago. There has been a small increase in traffic acquisition from Google already (read very small).

We went with a "chunk target" (think of that like a "docs per leaf" number) of about 4,000 just to guarantee we'll always show 4,096 leaves. In reality each leaf ends up being between 1700 and 2000 docs referenced. There hasn't been any major change in latency on the server either so far. We'll keep an eye on it, but so far so good, thanks to everyone for the advice and feedback.

Best,
Charlie
> > To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-development+unsub...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/2702740b-86cb-4ea6-87e5-acee1ed2adab%40googlegroups.com.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-development+unsub...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/385D3ECB-67C6-4ACD-9A58-E0C7DABC4246%40ticklefish.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-development+unsub...@googlegroups.com.

Tom Cramer

unread,
Nov 19, 2019, 11:28:56 AM11/19/19
to blacklight-...@googlegroups.com, Tom Cramer
Thanks for reporting back on PSU’s implementation, Charlie. If/when you see the number of Google referrals climb, it would be very interesting to hear about. 

Cheers, 

- Tom


To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/b578c0c9-98d8-4c79-b3fb-00ce24df66a3%40googlegroups.com.

Jack Reed

unread,
Jan 13, 2020, 9:29:17 AM1/13/20
to blacklight-...@googlegroups.com

Charlie – all,

 

Based off of this work and the previous proof of concept work, we extracted this into a gem, blacklight_dynamic_sitemap (https://github.com/sul-dlss/blacklight_dynamic_sitemap) and just released v0.1.0. We are planning to ship this in our catalog SearchWorks and GeoBlacklight application EarthWorks in the near future.

 

Thanks to everyone who helped develop this collaborative solution. I wrote up a little blog post about this https://www.jack-reed.com/2020/01/10/sitemaps-that-scale.html

 

Best,
Jack

From: <blacklight-...@googlegroups.com> on behalf of Charlie Morris <cdmor...@gmail.com>


Reply-To: "blacklight-...@googlegroups.com" <blacklight-...@googlegroups.com>
Date: Tuesday, November 19, 2019 at 8:51 AM
To: Blacklight Development <blacklight-...@googlegroups.com>

> > To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.

> > To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/2702740b-86cb-4ea6-87e5-acee1ed2adab%40googlegroups.com.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/385D3ECB-67C6-4ACD-9A58-E0C7DABC4246%40ticklefish.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/b578c0c9-98d8-4c79-b3fb-00ce24df66a3%40googlegroups.com.

cdmor...@gmail.com

unread,
Jan 6, 2021, 12:10:07 PM1/6/21
to blacklight-...@googlegroups.com, Tom Cramer
In case there's still interest, just thought I'd mention that we're seeing a bit more substantial number of referrals from Google and search engines overall now (Bing, duck and ecosia are the next biggest 3). Search has become our second highest source of traffic after "website" referrals (mostly entranties to our catalog come from our bento search results page) - direct traffic was our second highest before. Originally we saw on the order of 20-40 visits a day originating from search engines, and now it's typically around 250. Our index size is being reported as 106k by Google (which is still a very small percentage of our catalog). It's funny how it grows, basically little change for weeks and then all of a sudden a spike of 5-20k.

-Charlie

> > To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/2702740b-86cb-4ea6-87e5-acee1ed2adab%40googlegroups.com.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/385D3ECB-67C6-4ACD-9A58-E0C7DABC4246%40ticklefish.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/CAM1CjzTFHqTzw0xupfXJyhQirtbMpjfbL5D84-OcP8T5ixEu_g%40mail.gmail.com.


--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blacklight-development/b578c0c9-98d8-4c79-b3fb-00ce24df66a3%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-develo...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages