It is great to be able to set meta tags via the HTTP header!
But Sebastian, what does index,nofollow mean for a PDF file? What links be collected from a PDF file if you would have set index,follow for a PDF file in the HTTP header?
I knew someone would point that out. It's unusual at least. If that PDF is a marketing study linking out to the competition probably it makes even sense hehe. Good fetch :) Actually, you can use noindex,noarchive,nosnippet too making it an even more dangling node. Everything valid in a robots meta tag can be stuffed into an X-Robots-Tag. Sebastian
> It is great to be able to set meta tags via the HTTP header!
> But Sebastian, > what does index,nofollow mean for a PDF file? > What links be collected from a PDF file if you would > have set index,follow for a PDF file in the HTTP header?
I've been wondering ... would "nosnippet" remove the HTML version? or would you use "noarchive" for that? I could see a few applications where it would be great to be able to suppress the converted version.
Hi Sebastian, It is very interesting that you mentioned the nofollow for PDF files, this new possibility created by the meta tags via HTTP headers.
Some PDF files do not have well designed hyperlinks in them, some have URLs without hyperlinks, and your posting shows that it would be a good idea to re-consider the way some PDF files are done for better embedded links, maybe even to have some sort of site navigation. I assume that the links appearing in 'view as html' in the search results for a PDF file are the links collected by Googlebot like from any HTML file.
> I knew someone would point that out. It's unusual at least. If that > PDF is a marketing study linking out to the competition probably it > makes even sense hehe. Good fetch :) > Actually, you can use noindex,noarchive,nosnippet too making it an > even more dangling node. Everything valid in a robots meta tag can be > stuffed into an X-Robots-Tag. > Sebastian
My choice would be nosnippet, it should remove the snippet and its extension, the view-as HTML link. However, using both crawler directives should certainly remove it. I really want to know it for sure ... Sebastian
> I've been wondering ... would "nosnippet" remove the HTML version? or > would you use "noarchive" for that? I could see a few applications > where it would be great to be able to suppress the converted version.
But isn't the 'view as HTML' link a feature for all PDF files in search results? As far as I know nosnippet prevents the display of the snippet in search results, not the crawling and conversion to HTML of PDF files.
> My choice would be nosnippet, it should remove the snippet and its > extension, the view-as HTML link. However, using both crawler > directives should certainly remove it. I really want to know it for > sure ... > Sebastian
> On Aug 2, 9:26 am, JohnMu wrote:
> > I've been wondering ... would "nosnippet" remove the HTML version? or > > would you use "noarchive" for that? I could see a few applications > > where it would be great to be able to suppress the converted version.
Yep, that's why we aren't sure. The HTML version is something like a huge snippet, both snippets and HTML extract are previews. Noarchive OTOH just makes Google's fetched copy of the file unviewable. Technically one could argue that Google can't/shouldn't/won't transform unviewable contents into HTML previews, but from a searchers perspective grouping the snippet and the HTML version under preview makes more sense. That's why I vote for nosnippet, unless Google invents NOPREVIEW or NOTRANSFORM to suppress HTML versions of PDFs, transcripts from vids, text/link excerpts from flash, additional info from jpegs ...
> But isn't the 'view as HTML' link > a feature for all PDF files > in search results? > As far as I know nosnippet prevents > the display of the snippet in search results, > not the crawling and conversion to HTML > of PDF files.
> On Aug 2, 1:33 pm, Sebastian wrote:
> > My choice would be nosnippet, it should remove the snippet and its > > extension, the view-as HTML link. However, using both crawler > > directives should certainly remove it. I really want to know it for > > sure ... > > Sebastian
> > On Aug 2, 9:26 am, JohnMu wrote:
> > > I've been wondering ... would "nosnippet" remove the HTML version? or > > > would you use "noarchive" for that? I could see a few applications > > > where it would be great to be able to suppress the converted version.
When you look at the 'view as HTML' page for a PDF file you see the message that Googlebot automatically generates html versions of documents as it crawls the web. I presume that this means that a PDF document is automatically transformed into an HTML document to be crawled and content extracted from it for the search result page, like title, links collected, etc. as for any HTML document, even when nosnippet or noarchive are specified ????
> Yep, that's why we aren't sure. The HTML version is something like a > huge snippet, both snippets and HTML extract are previews. Noarchive > OTOH just makes Google's fetched copy of the file unviewable. > Technically one could argue that Google can't/shouldn't/won't > transform unviewable contents into HTML previews, but from a searchers > perspective grouping the snippet and the HTML version under preview > makes more sense. That's why I vote for nosnippet, unless Google > invents NOPREVIEW or NOTRANSFORM to suppress HTML versions of PDFs, > transcripts from vids, text/link excerpts from flash, additional info > from jpegs ...
> Sebastian
> On Aug 2, 2:49 pm, cristina wrote:
> > But isn't the 'view as HTML' link > > a feature for all PDF files > > in search results? > > As far as I know nosnippet prevents > > the display of the snippet in search results, > > not the crawling and conversion to HTML > > of PDF files.
> > On Aug 2, 1:33 pm, Sebastian wrote:
> > > My choice would be nosnippet, it should remove the snippet and its > > > extension, the view-as HTML link. However, using both crawler > > > directives should certainly remove it. I really want to know it for > > > sure ... > > > Sebastian
> > > On Aug 2, 9:26 am, JohnMu wrote:
> > > > I've been wondering ... would "nosnippet" remove the HTML version? or > > > > would you use "noarchive" for that? I could see a few applications > > > > where it would be great to be able to suppress the converted version.
Maybe, would make sense to store contents in a somewhat unified format, but we're talking about the *linked* HTML version available from the SERP. When you chose noarchive for a HTML page it removes the "cached" link and the call from the toolbar as well. That does not mean that Google didn't keep a copy ;) BTW looking at the HTML version of a PDF might lead to ideas on optimizing the original for the engines ... Sebastian
> When you look at the 'view as HTML' > page for a PDF file you see the message that > Googlebot automatically generates html versions of documents > as it crawls the web. > I presume that this means that a PDF document > is automatically transformed into an HTML document > to be crawled and content extracted from it > for the search result page, > like title, links collected, etc. > as for any HTML document, > even when nosnippet or noarchive are specified > ????
> On Aug 2, 2:38 pm, Sebastian wrote:
> > Yep, that's why we aren't sure. The HTML version is something like a > > huge snippet, both snippets and HTML extract are previews. Noarchive > > OTOH just makes Google's fetched copy of the file unviewable. > > Technically one could argue that Google can't/shouldn't/won't > > transform unviewable contents into HTML previews, but from a searchers > > perspective grouping the snippet and the HTML version under preview > > makes more sense. That's why I vote for nosnippet, unless Google > > invents NOPREVIEW or NOTRANSFORM to suppress HTML versions of PDFs, > > transcripts from vids, text/link excerpts from flash, additional info > > from jpegs ...
> > Sebastian
> > On Aug 2, 2:49 pm, cristina wrote:
> > > But isn't the 'view as HTML' link > > > a feature for all PDF files > > > in search results? > > > As far as I know nosnippet prevents > > > the display of the snippet in search results, > > > not the crawling and conversion to HTML > > > of PDF files.
> > > On Aug 2, 1:33 pm, Sebastian wrote:
> > > > My choice would be nosnippet, it should remove the snippet and its > > > > extension, the view-as HTML link. However, using both crawler > > > > directives should certainly remove it. I really want to know it for > > > > sure ... > > > > Sebastian
> > > > On Aug 2, 9:26 am, JohnMu wrote:
> > > > > I've been wondering ... would "nosnippet" remove the HTML version? or > > > > > would you use "noarchive" for that? I could see a few applications > > > > > where it would be great to be able to suppress the converted version.
Some of the search results shown at http://blogsci.com/randoms/academic-publishers-as-spammers show a full snippet but suppress the HTML version of the PDF. I wonder how they're doing that (besides the fact that they're cloaking to Google and trying to take out visitors hoping to view the PDFs)...
> Some of the search results shown athttp://blogsci.com/randoms/academic-publishers-as-spammers > show a full snippet but suppress the HTML version of the PDF. I wonder > how they're doing that (besides the fact that they're cloaking to > Google and trying to take out visitors hoping to view the PDFs)...
To me, it doesn't matter if they're a part of Google Scholar - what are they doing in the main web search index like that? Paid content from the Google News Archives is indexed and accessible through Google News (and *tagged* as paid content), but it's not included in the main web search index. Why is this treated differently?
> To me, it doesn't matter if they're a part of Google Scholar - what > are they doing in the main web search index like that? Paid content > from the Google News Archives is indexed and accessible through Google > News (and *tagged* as paid content), but it's not included in the main > web search index. Why is this treated differently?
> From what I've gathered, there's not currently a way for webmasters to > tell us not to include the "View as HTML" option. As you can imagine, > though, we're pretty intent on improving our conversion/interpretation > process.
Thanks for the comment, Adam :-)
I actually really like that "View as HTML" option, sometimes my PDF reader just takes tooo long to get moving, especially when I just need to look up a very short comment in some online document. It's also great as a way to read documents from a server that is currently offline (like the cache).
This might be too much for a topic like this, but could you imagine paid content being included in the web search index sometime in the future (perhaps marked as such)? Or is "Adwords" enough, in your opinion, for paid content? (this is the random chit chat area, so feel free to answer off-the-record, unofficially, etc :-)) I can't make up my mind and understand both sides involved :-).