Google search yields results in Spanish or French for English-language content described in AtoM

85 views
Skip to first unread message

GR Mulcaster

unread,
Nov 15, 2017, 11:13:16 PM11/15/17
to AtoM Users
While UTAS has been impressed with the increased discovery of items in AtoM, our project has thrown up some perplexing results with the AtoM user interface defaulting sometimes to Spanish or French.
This extends to Google crawlers that yield links.  And also to links that can be captured from AtoM and sent to email. 

It also extends to Google crawlers that indexed the UTAS archival institution at a time when the server had been down. So, the error msgs are in French or Spanish. 

The intermittent outages experience several weeks ago have been resolved, but the language response is a curious one. 

We do have a small number digital objects in the archive that are in Italian, and full transcriptions may be published at some later date once Copyright issues are resolved, but the Archival Descriptions remain entirely in English. 
Obviously, we do not want to mislead the potential visitors or researchers about the language capabilities of our archival material. 
 

Is there a way of locking down the UI language to English?
Or does the Google crawler have a method of bypassing the language settings?

One Archival Description for which this occurs is:


Would it be possible that when this Archival Description was created, that upon input, the UI language may have inadvertently toggled to Spanish and French, when first created?


regards
Glenn Mulcaster



Librarian (Access and Discovery)
University of Tasmania Library

Phone: +61 3 6324 3061

Email: Glenn.M...@utas.edu.au

Mail: Locked Bag 1312, Launceston TAS 7250 

Dan Gillean

unread,
Nov 16, 2017, 12:07:38 PM11/16/17
to ICA-AtoM Users
Hi Glenn, 

This is interesting! When I went to the description example you provided, it was displaying in English for me. I also tried using the new public CSV export from the clipboard on your site, and when I looked at the item-level record, the culture listed was en.  I thought at first that the page served up might in part be determined by the default culture settings in a user's browser (our accesstomemory.org website does this, for example) - but after checking with our team, it seems that AtoM is not performing any end-user language detection at present. So.... it's not that. 

In terms of Google crawlers, I think the best short-term solution for you will be to add a robots.txt file to your AtoM instance with rules disallowing the culture URL extensions. If you're unfamiliar with the robots.txt protocol, here's some basic information: 
You'll find many more resources like this with a simple web search. 

You should be able to add a disallow rule for certain URL parameters - this thread should give you a starting place: 
This is untested, but for this particular use case, I think the rule would be something along the lines of: 
  • Disallow: /*?sf_culture=*
You'll probably need to do a bit more reading and testing to make sure my best guess is accurate, but that should get you started in the right direction. Hopefully this way, Google and other crawlers that respect the protocol will stop indexing other languages. Keep in mind that some crawlers are jerks - there's nothing that compels them to respect a robots.txt directive - but most of the big ones like Google will. 

Some other things that may help - I'm not sure, but it's worth a try: 

First, in Admin > Settings > i18n Languages, you could remove all the other cultures that you are not actively using in your application. You'll want to re-index after any changes you make here. See:
If desired, you can even disable the Language menu completely if you are not using it, via the Default page elements settings. See:
Neither of these options will prevent someone from manually adding ?sf_culture=fr or another culture variable to the end of a URL. However, it may prevent more web crawlers (and users) from stumbling across these pages?

Finally, this is not directly related to your issue, but it might still be of interest - did you know that AtoM has a CLI task to help improve search engine optimization? See: 
If you make the changes above and add a robots.txt file to your site, then you might consider running this task using the --ping option. Essentially, this option will alert Google and Bing to the new sitemap files and ask for your site to be reindexed. Hopefully this will replace the current results that you are seeing, and those crawlers should respect any directives found in your robots file. 

There are ways we could improve this in the application itself, to avoid this kind of issue - for example, adding some kind of custom filter in the application that ignores ?sf_culture= parameter in certain circumstances - however, such a solution will require development. If UTAS is interested in pursuing this option, please feel free to contact me off-list, and I can coordinate with our developers to prepare some estimates for you. 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

Mail: Locked Bag 1312, Launceston TAS 7250 

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/7efb9253-e685-45b9-ac41-bfd46f4f46e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

GR Mulcaster

unread,
Nov 29, 2017, 6:42:15 PM11/29/17
to AtoM Users
Thanks Dan 

Tim Hutchinson

unread,
Nov 30, 2017, 5:41:51 PM11/30/17
to AtoM Users
I remember this being reported at least once before. Another suggestion at the time, to try to deal with pages that have already been indexed, was to use mod_rewrite to redirect the requests for the non-English pages, with a 301 status code (moved permanently).

Tim
Thanks Dan 
Reply all
Reply to author
Forward
0 new messages