How to prevent Google Search Appliance from over-generating?

51 views
Skip to first unread message

Louise Gruenberg

unread,
Jun 14, 2016, 3:55:51 PM6/14/16
to DSpace Technical Support
The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level? 

If you could also respond to lgrue...@ala.org (my work email), that would be great.

Luiz dos Santos

unread,
Jun 14, 2016, 4:13:01 PM6/14/16
to DSpace Technical Support
Hi Louise,

    Actually the Google bot do not use to be the problem, they  respect what you have in the robots.txt(https://wiki.duraspace.org/display/DSDOC4x/Search+Engine+Optimization) , However in the worst scenario you can implement a servlet filter to block bots.

Best regards
Luiz

 

On Tue, Jun 14, 2016 at 3:55 PM, Louise Gruenberg <grue...@gmail.com> wrote:
The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level? 

If you could also respond to lgrue...@ala.org (my work email), that would be great.

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Stuart A. Yeates

unread,
Jun 14, 2016, 4:21:29 PM6/14/16
to Louise Gruenberg, DSpace Technical Support
We solved a related problem with:


cheers
stuart

--
...let us be heard from red core to black sky

On Wed, Jun 15, 2016 at 7:55 AM, Louise Gruenberg <grue...@gmail.com> wrote:
The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level? 

If you could also respond to lgrue...@ala.org (my work email), that would be great.

--

Andrea Schweer

unread,
Jun 15, 2016, 4:38:05 AM6/15/16
to Louise Gruenberg, DSpace Technical Support, lgrue...@ala.org
Hi,

we've had similar issues with GSA. There are two things that helped here:
Crawlers on that list in sitemap.xmap get special treatment with regards to the 'last modified' date of files (if-modified-since); if gsa isn't on the list then it will re-request the same file over and over and over again.

I just made a pull request for this: https://github.com/DSpace/DSpace/pull/1435

cheers,
Andrea


On 14/06/16 20:55, Louise Gruenberg wrote:
The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level? 

If you could also respond to lgrue...@ala.org (my work email), that would be great.
--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

-- 
Dr Andrea Schweer
Lead Software Developer, ITS Information Systems
The University of Waikato, Hamilton, New Zealand
+64-7-837 9120
Reply all
Reply to author
Forward
0 new messages