It is great to be able to set meta tags via the HTTP header!
But Sebastian, what does index,nofollow mean for a PDF file? What links be collected from a PDF file if you would have set index,follow for a PDF file in the HTTP header?
I knew someone would point that out. It's unusual at least. If that PDF is a marketing study linking out to the competition probably it makes even sense hehe. Good fetch :) Actually, you can use noindex,noarchive,nosnippet too making it an even more dangling node. Everything valid in a robots meta tag can be stuffed into an X-Robots-Tag. Sebastian
> It is great to be able to set meta tags via the HTTP header!
> But Sebastian, > what does index,nofollow mean for a PDF file? > What links be collected from a PDF file if you would > have set index,follow for a PDF file in the HTTP header?
I've been wondering ... would "nosnippet" remove the HTML version? or would you use "noarchive" for that? I could see a few applications where it would be great to be able to suppress the converted version.
Hi Sebastian, It is very interesting that you mentioned the nofollow for PDF files, this new possibility created by the meta tags via HTTP headers.
Some PDF files do not have well designed hyperlinks in them, some have URLs without hyperlinks, and your posting shows that it would be a good idea to re-consider the way some PDF files are done for better embedded links, maybe even to have some sort of site navigation. I assume that the links appearing in 'view as html' in the search results for a PDF file are the links collected by Googlebot like from any HTML file.
> I knew someone would point that out. It's unusual at least. If that > PDF is a marketing study linking out to the competition probably it > makes even sense hehe. Good fetch :) > Actually, you can use noindex,noarchive,nosnippet too making it an > even more dangling node. Everything valid in a robots meta tag can be > stuffed into an X-Robots-Tag. > Sebastian
My choice would be nosnippet, it should remove the snippet and its extension, the view-as HTML link. However, using both crawler directives should certainly remove it. I really want to know it for sure ... Sebastian
> I've been wondering ... would "nosnippet" remove the HTML version? or > would you use "noarchive" for that? I could see a few applications > where it would be great to be able to suppress the converted version.
But isn't the 'view as HTML' link a feature for all PDF files in search results? As far as I know nosnippet prevents the display of the snippet in search results, not the crawling and conversion to HTML of PDF files.
> My choice would be nosnippet, it should remove the snippet and its > extension, the view-as HTML link. However, using both crawler > directives should certainly remove it. I really want to know it for > sure ... > Sebastian
> On Aug 2, 9:26 am, JohnMu wrote:
> > I've been wondering ... would "nosnippet" remove the HTML version? or > > would you use "noarchive" for that? I could see a few applications > > where it would be great to be able to suppress the converted version.
Yep, that's why we aren't sure. The HTML version is something like a huge snippet, both snippets and HTML extract are previews. Noarchive OTOH just makes Google's fetched copy of the file unviewable. Technically one could argue that Google can't/shouldn't/won't transform unviewable contents into HTML previews, but from a searchers perspective grouping the snippet and the HTML version under preview makes more sense. That's why I vote for nosnippet, unless Google invents NOPREVIEW or NOTRANSFORM to suppress HTML versions of PDFs, transcripts from vids, text/link excerpts from flash, additional info from jpegs ...
> But isn't the 'view as HTML' link > a feature for all PDF files > in search results? > As far as I know nosnippet prevents > the display of the snippet in search results, > not the crawling and conversion to HTML > of PDF files.
> On Aug 2, 1:33 pm, Sebastian wrote:
> > My choice would be nosnippet, it should remove the snippet and its > > extension, the view-as HTML link. However, using both crawler > > directives should certainly remove it. I really want to know it for > > sure ... > > Sebastian
> > On Aug 2, 9:26 am, JohnMu wrote:
> > > I've been wondering ... would "nosnippet" remove the HTML version? or > > > would you use "noarchive" for that? I could see a few applications > > > where it would be great to be able to suppress the converted version.
When you look at the 'view as HTML' page for a PDF file you see the message that Googlebot automatically generates html versions of documents as it crawls the web. I presume that this means that a PDF document is automatically transformed into an HTML document to be crawled and content extracted from it for the search result page, like title, links collected, etc. as for any HTML document, even when nosnippet or noarchive are specified ????
> Yep, that's why we aren't sure. The HTML version is something like a > huge snippet, both snippets and HTML extract are previews. Noarchive > OTOH just makes Google's fetched copy of the file unviewable. > Technically one could argue that Google can't/shouldn't/won't > transform unviewable contents into HTML previews, but from a searchers > perspective grouping the snippet and the HTML version under preview > makes more sense. That's why I vote for nosnippet, unless Google > invents NOPREVIEW or NOTRANSFORM to suppress HTML versions of PDFs, > transcripts from vids, text/link excerpts from flash, additional info > from jpegs ...
> Sebastian
> On Aug 2, 2:49 pm, cristina wrote:
> > But isn't the 'view as HTML' link > > a feature for all PDF files > > in search results? > > As far as I know nosnippet prevents > > the display of the snippet in search results, > > not the crawling and conversion to HTML > > of PDF files.
> > On Aug 2, 1:33 pm, Sebastian wrote:
> > > My choice would be nosnippet, it should remove the snippet and its > > > extension, the view-as HTML link. However, using both crawler > > > directives should certainly remove it. I really want to know it for > > > sure ... > > > Sebastian
> > > On Aug 2, 9:26 am, JohnMu wrote:
> > > > I've been wondering ... would "nosnippet" remove the HTML version? or > > > > would you use "noarchive" for that? I could see a few applications > > > > where it would be great to be able to suppress the converted version.
Maybe, would make sense to store contents in a somewhat unified format, but we're talking about the *linked* HTML version available from the SERP. When you chose noarchive for a HTML page it removes the "cached" link and the call from the toolbar as well. That does not mean that Google didn't keep a copy ;) BTW looking at the HTML version of a PDF might lead to ideas on optimizing the original for the engines ... Sebastian
> When you look at the 'view as HTML' > page for a PDF file you see the message that > Googlebot automatically generates html versions of documents > as it crawls the web. > I presume that this means that a PDF document > is automatically transformed into an HTML document > to be crawled and content extracted from it > for the search result page, > like title, links collected, etc. > as for any HTML document, > even when nosnippet or noarchive are specified > ????
> On Aug 2, 2:38 pm, Sebastian wrote:
> > Yep, that's why we aren't sure. The HTML version is something like a > > huge snippet, both snippets and HTML extract are previews. Noarchive > > OTOH just makes Google's fetched copy of the file unviewable. > > Technically one could argue that Google can't/shouldn't/won't > > transform unviewable contents into HTML previews, but from a searchers > > perspective grouping the snippet and the HTML version under preview > > makes more sense. That's why I vote for nosnippet, unless Google > > invents NOPREVIEW or NOTRANSFORM to suppress HTML versions of PDFs, > > transcripts from vids, text/link excerpts from flash, additional info > > from jpegs ...
> > Sebastian
> > On Aug 2, 2:49 pm, cristina wrote:
> > > But isn't the 'view as HTML' link > > > a feature for all PDF files > > > in search results? > > > As far as I know nosnippet prevents > > > the display of the snippet in search results, > > > not the crawling and conversion to HTML > > > of PDF files.
> > > On Aug 2, 1:33 pm, Sebastian wrote:
> > > > My choice would be nosnippet, it should remove the snippet and its > > > > extension, the view-as HTML link. However, using both crawler > > > > directives should certainly remove it. I really want to know it for > > > > sure ... > > > > Sebastian
> > > > On Aug 2, 9:26 am, JohnMu wrote:
> > > > > I've been wondering ... would "nosnippet" remove the HTML version? or > > > > > would you use "noarchive" for that? I could see a few applications > > > > > where it would be great to be able to suppress the converted version.
Some of the search results shown at http://blogsci.com/randoms/academic-publishers-as-spammers show a full snippet but suppress the HTML version of the PDF. I wonder how they're doing that (besides the fact that they're cloaking to Google and trying to take out visitors hoping to view the PDFs)...