Configuring Google Mini to properly crawl MediaWiki?
12 views
Skip to first unread message
sprak
unread,
Oct 3, 2008, 11:56:53 AM10/3/08
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
We currently have a Google Mini setup to crawl our MediaWiki sites;
however, the Mini
is not keeping all the pages in its index. Here is the behavior we
are seeing; a page gets
crawled by the Mini and is placed in the index. The crawler moves
along and eventually
comes to a MediaWiki page that links back to the first page.
This page is flagged as 'nofollow' in its meta data; the Mini does not
crawl this page and
add it to the index, and it removes the first page from the index
because of the link to it.
This is not what we want to happen and have tried to setup patterns to
exclude the
nofollow pages.
We are clearly not catching them all though, and I have not seen a
guide floating around
that outlines the best patterns/practices for getting a Mini to play
nice and index the
content of a MediaWiki site. We could edit the MW source and just
have it not
output the nofollow meta data, but this is not an optimal solution
overall.
Does anyone know of a good guide or can post their best patterns/
practices. For reference,
here are the patterns we are currently using:
#wiki exclusion list
contains:/Special:
contains:/User:
contains:/Talk:
contains:_Talk
contains:_talk
contains:=history
contains:&diff=
contains:action=edit
contains:Recentchanges&
contains:=Talk
contains:skins
#end wiki exclusions
Thanks.
- Luis Cruz
bria...@gmail.com
unread,
Oct 6, 2008, 3:14:20 AM10/6/08
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi Luis,
If I understand this issue correctly (please correct me if I am
wrong), you have a page that you want in the index. This page is
crawled but later on, when the Mini crawls pages that have a link and
a NOFOLLOW tag on them, when looking in Crawl Diagnostics, you see an
error saying that this page is rejected by robots? Is this correct? If
this is the case, there is a known issue where this is incorrectly
reported in the Crawl Diagnostics but the page should still be
crawled. The problem was that this is a reporting issue and the page
should still correctly show up in search. Can you verify if these
pages are are actually in the index (i.e. they return in searches?) If
so, then you can ignore those messages.
Brian
piggybankpie
unread,
Oct 9, 2008, 2:43:21 PM10/9/08
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Here's my exception rules for Media wiki... what a mess!