For similar 'URL to crawl' configure different users (regexp)

Florian

unread,

Sep 30, 2008, 12:01:06 PM9/30/08

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

account 1(look for compID=...) search for OPTION
http://intranet.company.com/catalogue/?Search=1&AccID=&ErpAccID=&Account=&PostCode=&ReplaceLine=&CtrID=&compID=1&filterCat=OPTION

account 2(look for compID=...) , search for OPTION
http://intranet. company.com/catalogue/default.aspx?
Search=1&AccID=&ErpAccID=&Account=&PostCode=&ReplaceLine=&CtrID=&compID=2&filterCat=OPTION

Is this possible? :

Crawler access account 1 for URL of type:
http://intranet. company.com/catalogue/?*wildcard*compID=1*wildcard*

Crawler access account 2 for URL of type:
http://intranet. company.com/catalogue/?*wildcard*compID=2*wildcard*

For wild card I mean a regexp that you could help me on.

bria...@gmail.com

unread,

Sep 30, 2008, 9:47:08 PM9/30/08

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Hi Florian,

If I understand you correctly, it sounds like you are trying to set
Crawl and Index -> Crawler Access patterns using two different
accounts depending on the compID? For example, account1 has access to
URLs that have compID=1 and account2 has access to URLs that have
compID=2 in the URL? Is this correct?

If so, then what you could use is something like

contains:compID=1 account1 domain password
contains:compID=2 account2 domain password

If my understanding is not correct, can you give a couple examples of
what you are trying to do?

Brian

On Oct 1, 1:01 am, Florian <florian.bl...@gmail.com> wrote:
> account 1(look for compID=...) search for OPTIONhttp://intranet.company.com/catalogue/?Search=1&AccID=&ErpAccID=&Acco...
>
> account 2(look for compID=...) , search for OPTIONhttp://intranet. company.com/catalogue/default.aspx?

Florian

unread,

Oct 1, 2008, 3:15:34 AM10/1/08

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Thanks for you answer Brian.
I tried something like this:

regexpIgnoreCase:http://intranet\\.company\\.com/
catalogue/.*(compID=X).*

and instead of X I put each compID and created 3 crawler access
entries and 3 collections.
I uploaded three starting URL in a web feed with NTLM auth.
The idea is that there are 3 types of users who are seeing different
content depending on credentials.
The problem is that a cookie it's needed and having so similar URL you
don't know what cookie it's used at some time.

I keep getting errors:
Retrying URL: Host unreachable while trying to fetch
robots.txt. Even if there is a robots.txt with allow all.
When testing, I don't have the possibility to delete the web feed,
there is no button.
I'm going crazy here.

Reply all

Reply to author

Forward