How to prevent Google Search Appliance from over-generating?

Louise Gruenberg

unread,

Jun 14, 2016, 3:55:51 PM6/14/16

to DSpace Technical Support

The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level?

If you could also respond to lgrue...@ala.org (my work email), that would be great.

Luiz dos Santos

unread,

Jun 14, 2016, 4:13:01 PM6/14/16

to DSpace Technical Support

Hi Louise,

Actually the Google bot do not use to be the problem, they respect what you have in the robots.txt(https://wiki.duraspace.org/display/DSDOC4x/Search+Engine+Optimization) , However in the worst scenario you can implement a servlet filter to block bots.

Best regards

Luiz

On Tue, Jun 14, 2016 at 3:55 PM, Louise Gruenberg <grue...@gmail.com> wrote:

The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level?

If you could also respond to lgrue...@ala.org (my work email), that would be great.

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Stuart A. Yeates

unread,

Jun 14, 2016, 4:21:29 PM6/14/16

to Louise Gruenberg, DSpace Technical Support

We solved a related problem with:

http://researcharchive.vuw.ac.nz/robots.txt

cheers

stuart

--
...let us be heard from red core to black sky

On Wed, Jun 15, 2016 at 7:55 AM, Louise Gruenberg <grue...@gmail.com> wrote:

The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level?

If you could also respond to lgrue...@ala.org (my work email), that would be great.

--

Andrea Schweer

unread,

Jun 15, 2016, 4:38:05 AM6/15/16

to Louise Gruenberg, DSpace Technical Support, lgrue...@ala.org

Hi,

we've had similar issues with GSA. There are two things that helped here:

like others have said, tweak robots.txt. In particular, stop crawlers (either all of them or just gsa) from crawling navigational pages (search, browse, facets)
make sure gsa is included in the list of crawler user agents here: https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace-xmlui/src/main/webapp/sitemap.xmap#L136 (add <browser name="spider" useragent="gsa-crawler"/>)

Crawlers on that list in sitemap.xmap get special treatment with regards to the 'last modified' date of files (if-modified-since); if gsa isn't on the list then it will re-request the same file over and over and over again.

I just made a pull request for this: https://github.com/DSpace/DSpace/pull/1435

cheers,
Andrea

On 14/06/16 20:55, Louise Gruenberg wrote:

The American Library Association has an archive ( https://alair.ala.org/) powered by DSpace Direct. We are attempting to implement federated search across about 100 web properties that shouldn't have so much content that they exceed our license limit. The ALAIR has about 5,000 digital records but the search appliance crawls about a quarter of a million on the site. The /handler director is responsible for 183,000+. Does anyone know what we would need to restrict to knock the crawl back to a reasonable level?

If you could also respond to lgrue...@ala.org (my work email), that would be great.

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

-- 
Dr Andrea Schweer
Lead Software Developer, ITS Information Systems
The University of Waikato, Hamilton, New Zealand
+64-7-837 9120

Reply all

Reply to author

Forward