I know about robots.txt ... I have robots.txt on this server and I've verified through the Google Webmaster tools that my robots.txt has been read. What I would like to know, and stop, is why googlebot is downloading files, through FTP, from my ftp server: ftp.utexas.edu/ ftp.the.net.
Here's some sample from my xferlog:
Fri Mar 9 10:07:39 2007 8 66.249.72.107 966294 /ftp2/ubuntu/pool/ universe/a/axiom/axiom-databases_20050901-1ubuntu1_all.deb b _ o a [email address] ftp 0 * c Fri Mar 9 10:07:57 2007 1 66.249.66.196 13687 /ftp2/freebsd/ports/ amd64/packages-7-current/All/jpegoptim-1.2.2.tbz b _ o a [email address] ftp 0 * c Fri Mar 9 10:08:12 2007 1 66.249.72.107 897 /ftp1/slackware/slackware- current/source/n/netwatch/netwatch.SlackBuild b _ o a [email address] ftp 0 * c Fri Mar 9 10:08:32 2007 1 66.249.66.196 149444 /ftp2/ubuntu/pool/ universe/x/xbanner/xbanner_1.31-23_powerpc.deb b _ o a [email address] ftp 0 * c Fri Mar 9 10:08:46 2007 1 66.249.66.196 7924 /ftp1/opendarwin/ darwinsource/projects/other/postfix-147/postfix/src/util/safe_open.c b _ o a [email address] ftp 0 * c Fri Mar 9 10:08:52 2007 1 66.249.72.107 16928 /ftp2/freebsd/ports/ amd64/packages-5-stable/All/whirlgif-3.04.tbz b _ o a [email address] ftp 0 * c Fri Mar 9 10:09:32 2007 1 66.249.72.107 46050 /ftp2/freebsd/ports/ amd64/packages-5-stable/All/ripit-3.5.1.tbz b _ o a [email address] ftp 0 * c Fri Mar 9 10:09:40 2007 1 66.249.66.196 11476 /ftp1/opendarwin/ darwinsource/projects/apsl/diskdev_cmds-143/fsck_hfs.tproj/dfalib/ SKeyCompare.c b _ o a [email address] ftp 0 * c
Doesn't seem to be a rhyme or reason for what the googlebots are downloading. Any information would be appreciated.
> I know about robots.txt ... I have robots.txt on this server and I've > verified through the Google Webmaster tools that my robots.txt has > been read. What I would like to know, and stop, is why googlebot is > downloading files, through FTP, from my ftp server: ftp.utexas.edu/ > ftp.the.net.
> Here's some sample from my xferlog:
> Fri Mar 9 10:07:39 2007 8 66.249.72.107 966294 /ftp2/ubuntu/pool/ > universe/a/axiom/axiom-databases_20050901-1ubuntu1_all.deb b _ o a > [email address] ftp 0 * c > Fri Mar 9 10:07:57 2007 1 66.249.66.196 13687 /ftp2/freebsd/ports/ > amd64/packages-7-current/All/jpegoptim-1.2.2.tbz b _ o a > [email address] ftp 0 * c > Fri Mar 9 10:08:12 2007 1 66.249.72.107 897 /ftp1/slackware/slackware- > current/source/n/netwatch/netwatch.SlackBuild b _ o a > [email address] ftp 0 * c > Fri Mar 9 10:08:32 2007 1 66.249.66.196 149444 /ftp2/ubuntu/pool/ > universe/x/xbanner/xbanner_1.31-23_powerpc.deb b _ o a > [email address] ftp 0 * c > Fri Mar 9 10:08:46 2007 1 66.249.66.196 7924 /ftp1/opendarwin/ > darwinsource/projects/other/postfix-147/postfix/src/util/safe_open.c b > _ o a [email address] ftp 0 * c > Fri Mar 9 10:08:52 2007 1 66.249.72.107 16928 /ftp2/freebsd/ports/ > amd64/packages-5-stable/All/whirlgif-3.04.tbz b _ o a > [email address] ftp 0 * c > Fri Mar 9 10:09:32 2007 1 66.249.72.107 46050 /ftp2/freebsd/ports/ > amd64/packages-5-stable/All/ripit-3.5.1.tbz b _ o a > [email address] ftp 0 * c > Fri Mar 9 10:09:40 2007 1 66.249.66.196 11476 /ftp1/opendarwin/ > darwinsource/projects/apsl/diskdev_cmds-143/fsck_hfs.tproj/dfalib/ > SKeyCompare.c b _ o a [email address] ftp 0 * c
> Doesn't seem to be a rhyme or reason for what the googlebots are > downloading. Any information would be appreciated.
These aren't HTTP requests, they're FTP. The log samples are from / var/log/xferlogs, which is kept by vsftpd. Looking at the past week I see these googlebot addresses downloading files:
> Sure these are HTTP requests? Your robots.txt disallows Googlebot > correctly. The first IP I've checked is indeed a crawler, not a > Googler.
> On Mar 9, 5:18 pm, O wrote:
> > I know about robots.txt ... I have robots.txt on this server and I've > > verified through the Google Webmaster tools that my robots.txt has > > been read. What I would like to know, and stop, is why googlebot is > > downloading files, through FTP, from my ftp server: ftp.utexas.edu/ > > ftp.the.net.
> > Here's some sample from my xferlog:
> > Fri Mar 9 10:07:39 2007 8 66.249.72.107 966294 /ftp2/ubuntu/pool/ > > universe/a/axiom/axiom-databases_20050901-1ubuntu1_all.deb b _ o a > > [email address] ftp 0 * c > > Fri Mar 9 10:07:57 2007 1 66.249.66.196 13687 /ftp2/freebsd/ports/ > > amd64/packages-7-current/All/jpegoptim-1.2.2.tbz b _ o a > > [email address] ftp 0 * c > > Fri Mar 9 10:08:12 2007 1 66.249.72.107 897 /ftp1/slackware/slackware- > > current/source/n/netwatch/netwatch.SlackBuild b _ o a > > [email address] ftp 0 * c > > Fri Mar 9 10:08:32 2007 1 66.249.66.196 149444 /ftp2/ubuntu/pool/ > > universe/x/xbanner/xbanner_1.31-23_powerpc.deb b _ o a > > [email address] ftp 0 * c > > Fri Mar 9 10:08:46 2007 1 66.249.66.196 7924 /ftp1/opendarwin/ > > darwinsource/projects/other/postfix-147/postfix/src/util/safe_open.c b > > _ o a [email address] ftp 0 * c > > Fri Mar 9 10:08:52 2007 1 66.249.72.107 16928 /ftp2/freebsd/ports/ > > amd64/packages-5-stable/All/whirlgif-3.04.tbz b _ o a > > [email address] ftp 0 * c > > Fri Mar 9 10:09:32 2007 1 66.249.72.107 46050 /ftp2/freebsd/ports/ > > amd64/packages-5-stable/All/ripit-3.5.1.tbz b _ o a > > [email address] ftp 0 * c > > Fri Mar 9 10:09:40 2007 1 66.249.66.196 11476 /ftp1/opendarwin/ > > darwinsource/projects/apsl/diskdev_cmds-143/fsck_hfs.tproj/dfalib/ > > SKeyCompare.c b _ o a [email address] ftp 0 * c
> > Doesn't seem to be a rhyme or reason for what the googlebots are > > downloading. Any information would be appreciated.
> These aren't HTTP requests, they're FTP. The log samples are from / > var/log/xferlogs, which is kept by vsftpd. Looking at the past week I > see these googlebot addresses downloading files:
> > Sure these are HTTP requests? Your robots.txt disallows Googlebot > > correctly. The first IP I've checked is indeed a crawler, not a > > Googler.
> > On Mar 9, 5:18 pm, O wrote:
> > > I know about robots.txt ... I have robots.txt on this server and I've > > > verified through the Google Webmaster tools that my robots.txt has > > > been read. What I would like to know, and stop, is why googlebot is > > > downloading files, through FTP, from my ftp server: ftp.utexas.edu/ > > > ftp.the.net.
> > > Here's some sample from my xferlog:
> > > Fri Mar 9 10:07:39 2007 8 66.249.72.107 966294 /ftp2/ubuntu/pool/ > > > universe/a/axiom/axiom-databases_20050901-1ubuntu1_all.deb b _ o a > > > [email address] ftp 0 * c > > > Fri Mar 9 10:07:57 2007 1 66.249.66.196 13687 /ftp2/freebsd/ports/ > > > amd64/packages-7-current/All/jpegoptim-1.2.2.tbz b _ o a > > > [email address] ftp 0 * c > > > Fri Mar 9 10:08:12 2007 1 66.249.72.107 897 /ftp1/slackware/slackware- > > > current/source/n/netwatch/netwatch.SlackBuild b _ o a > > > [email address] ftp 0 * c > > > Fri Mar 9 10:08:32 2007 1 66.249.66.196 149444 /ftp2/ubuntu/pool/ > > > universe/x/xbanner/xbanner_1.31-23_powerpc.deb b _ o a > > > [email address] ftp 0 * c > > > Fri Mar 9 10:08:46 2007 1 66.249.66.196 7924 /ftp1/opendarwin/ > > > darwinsource/projects/other/postfix-147/postfix/src/util/safe_open.c b > > > _ o a [email address] ftp 0 * c > > > Fri Mar 9 10:08:52 2007 1 66.249.72.107 16928 /ftp2/freebsd/ports/ > > > amd64/packages-5-stable/All/whirlgif-3.04.tbz b _ o a > > > [email address] ftp 0 * c > > > Fri Mar 9 10:09:32 2007 1 66.249.72.107 46050 /ftp2/freebsd/ports/ > > > amd64/packages-5-stable/All/ripit-3.5.1.tbz b _ o a > > > [email address] ftp 0 * c > > > Fri Mar 9 10:09:40 2007 1 66.249.66.196 11476 /ftp1/opendarwin/ > > > darwinsource/projects/apsl/diskdev_cmds-143/fsck_hfs.tproj/dfalib/ > > > SKeyCompare.c b _ o a [email address] ftp 0 * c
> > > Doesn't seem to be a rhyme or reason for what the googlebots are > > > downloading. Any information would be appreciated.
Actually, looking through my AOL search database, there are only very few "clicks" on ftp:// URLs (perhaps 15?). Could it be that there is a FTP "service" out there that is redirecting http requests to your ftp server? Perhaps a "file-search" service that has http links which are automatically redirected to the "closest" ftp server? I imagine if the Googlebot were to stumble into a service like that, it might get tricked to download via ftp (but it would be strange if they didn't check for that and halt the crawl in those cases...hmm)
> Actually, looking through my AOL search database, there are only very > few "clicks" on ftp:// URLs (perhaps 15?). Could it be that there is a > FTP "service" out there that is redirecting http requests to your ftp > server? Perhaps a "file-search" service that has http links which are > automatically redirected to the "closest" ftp server? I imagine if the > Googlebot were to stumble into a service like that, it might get > tricked to download via ftp (but it would be strange if they didn't > check for that and halt the crawl in those cases...hmm)
What about Google code search? They trawl ftp sites, download packages and index the contents of the files. I don't think they have a separate crawler useragent... but which robots.txt would they respect?
> What about Google code search? They trawl ftp sites, download packages > and index the contents of the files. I don't think they have a > separate crawler useragent... but which robots.txt would they respect?
> > What about Google code search? They trawl ftp sites, download packages > > and index the contents of the files. I don't think they have a > > separate crawler useragent... but which robots.txt would they respect?