Specify a port number - Crawl and Index

Thomas Henry

unread,

Aug 31, 2010, 1:02:56 PM8/31/10

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

I read in another forum that the Mini automatically assumes standard
HTTP ports unless you specify otherwise in the Crawl URLs. I have a
folder inside of a Tomcat 5.5 instance that I would like to crawl and
index, and that folder is SSL secured.

When I try and enter the specific port number, and then select Save
URLs to Crawl, the Mini strips the port number from the URL and then
the folder can not be resolved. I.E.:

https://www.mydomain.com:443/googlemini/

I need to be able to specify the port number, is there another way to
do this?

Also, if I do get the Crawl URL to point to the proper port, will the
results of a query have the proper port in their links as well?

Thanks.

JMarkham

unread,

Aug 31, 2010, 1:20:51 PM8/31/10

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

The reason :443 is stripped from that particular URL is because it's
redundant. Port 443 is the standard HTTPS port.

If your server MUST have the port number for some odd reason, then try
eliminating the protocol from your crawl entry, as in www.mydomain.com:443/googlemini/.

Jeff

Dave Watts

unread,

Aug 31, 2010, 1:42:38 PM8/31/10

to google-search-...@googlegroups.com

> I read in another forum that the Mini automatically assumes standard
> HTTP ports unless you specify otherwise in the Crawl URLs. I have a
> folder inside of a Tomcat 5.5 instance that I would like to crawl and
> index, and that folder is SSL secured.
>
> When I try and enter the specific port number, and then select Save
> URLs to Crawl, the Mini strips the port number from the URL and then
> the folder can not be resolved. I.E.:
>
> https://www.mydomain.com:443/googlemini/
>
> I need to be able to specify the port number, is there another way to
> do this?

As Jeff said, TCP/443 is the default port for HTTPS, so you don't have
to mention it. If you're using a non-standard HTTPS port, you should
be able to specify it the same way, though.

> Also, if I do get the Crawl URL to point to the proper port, will the
> results of a query have the proper port in their links as well?

Yes.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.

Thomas Henry

unread,

Aug 31, 2010, 3:14:47 PM8/31/10

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Thank you both. You were correct.

As an additional question, if I changed the shared folder's name, say
from googlemini to googlemini2, and the Crawl Diagnostics reports that
it has found, crawled and indexed all documents in the new shared
folder, why do all of my search results still refer to googlemini?

Is this a caching issue?

On Aug 31, 1:42 pm, Dave Watts <dwa...@figleaf.com> wrote:
> > I read in another forum that the Mini automatically assumes standard
> > HTTP ports unless you specify otherwise in the Crawl URLs. I have a
> > folder inside of a Tomcat 5.5 instance that I would like to crawl and
> > index, and that folder is SSL secured.
>
> > When I try and enter the specific port number, and then select Save
> > URLs to Crawl, the Mini strips the port number from the URL and then
> > the folder can not be resolved. I.E.:
>
> >https://www.mydomain.com:443/googlemini/
>
> > I need to be able to specify the port number, is there another way to
> > do this?
>
> As Jeff said, TCP/443 is the default port for HTTPS, so you don't have
> to mention it. If you're using a non-standard HTTPS port, you should
> be able to specify it the same way, though.
>
> > Also, if I do get the Crawl URL to point to the proper port, will the
> > results of a query have the proper port in their links as well?
>
> Yes.
>

> Dave Watts, CTO, Fig Leaf Softwarehttp://www.figleaf.com/http://training.figleaf.com/

brianb

unread,

Sep 1, 2010, 1:44:56 AM9/1/10

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

If it is the same data, then they are probably getting filtered since
it would be duplicate data. I would remove one of those hosts from the
Follow and Crawl URLs, give it about an hour for the GSA to remove the
old URLs and try again.

Brian

Reply all

Reply to author

Forward