google's rich snippets in exhibit

25 views
Skip to first unread message

David Huynh

unread,
May 22, 2009, 8:36:05 PM5/22/09
to simile-...@googlegroups.com
Hi all,

Google recently introduced "rich snippets", which are basically
microformats and RDFa:


http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html

The idea is that if your web page is marked up with certain attributes
then search results from your web page will look better on Google.

So far exhibits' contents are not crawl-able at all by search engines,
because they are contained inside JSON files rather than in HTML, and
they are then rendered dynamically in the browser.

Since Google is starting to pay attention to structured data within web
pages, I think it might be a really good time to start thinking about
how to make exhibits crawl-able *and* compatible with Google's support
for microformats and RDFa at the same time. Two birds with one stone.

One possible solution is that if you use Exhibit within a php file, then
you could make the php file get some service like Babel to take your
JSON file and generate HTML with microformats or RDFa, and inject that
into a <noscript> block.

Please let me know if you have any thought on that!

Thanks,

David

Vincent Borghi

unread,
May 26, 2009, 4:02:05 AM5/26/09
to simile-...@googlegroups.com
Hi,

AFAI understand, in the possible solution you mention, you finally
always double the volume of the served data: you serve the original json
plus a specially tagged version in a <noscript>.

This works and is surely appropriate in many cases,

I just add as a remark that, since it may cost bandwidth just to serve
additional data (data specially tagged for Google) that in the general case
(a human visitor using a browser) is not used, an alternative solution
may be preferable in certain cases, and when this is possible:

For those of us who can customize their httpd.conf configuration
of their apache server, we may prefer to implement the solution
which is to serve appropriately, on the same URL, two different versions:
- one version being the "normal" exhibit, for "normal" human visitors,
- and the other, for (google)bots, being an ad-hoc html (either static or
dynamically generated by cgi or similar, using or not babel).

This assumes we configure apache to serve, for the same given URL,
the first or the other version, depending on the user-agent that visits this URL
(using appropriate "RewriteCond %{HTTP_USER_AGENT} .../ rewriterule..
in the apache httpd.conf).

Regards

John Clarke Mills

unread,
May 26, 2009, 1:36:12 PM5/26/09
to SIMILE Widgets
Vincent,

Although the idea of detecting user agent is a sound one, this can
also be construed as cloaking, which if caught, you will be penalized
by Google. I often flip a coin my head on a subject like this because
what you are saying makes perfect sense; however, we dont always know
how Googlebot is going to react.

Just some food for thought. There's a good chance I will be
attempting to combat this problem in the near future and I will report
back.

Cheers.

On May 26, 1:02 am, Vincent Borghi <vincent.borgh...@gmail.com> wrote:
> Hi,
>
>
>
> On Sat, May 23, 2009 at 2:36 AM, David Huynh <dfhu...@alum.mit.edu> wrote:
>
> > Hi all,
>
> > Google recently introduced "rich snippets", which are basically
> > microformats and RDFa:
>
> >http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-s...

David Huynh

unread,
May 26, 2009, 7:49:43 PM5/26/09
to simile-...@googlegroups.com
Search engines are only interested in crawling (probably) visible HTML
content, so anything to be crawled must be in HTML, and that spoils the
whole point of separating data from presentation. I think the only way
to have both separation of data and presentation as well as
crawl-ability is to store the data in JSON files or whatever, and have a
cached rendering of *some* of the data in HTML. Maybe you can specify
some ordering of the items as well as a cut-off limit, and that
determines which items--potentially the most interesting ones--get
rendered into HTML. That way you won't duplicate the data 100%.

So your PHP file will look something like this

<html>
<head>

<link rel="exhibit/data" href="data1.json"
type="application/json" />
<link rel="exhibit/data" href="data2.rdf"
type="application/rdf+xml" />

</head>
<body>
...

<div ex:role="lens" id="template-1" ...>...</div>

<noscript>
<?php
$curl_handle=curl_init();

curl_setopt($curl_handle,CURLOPT_URL,'http://service.simile-widgets.org/exhibit-render?');
curl_exec($curl_handle);
curl_close($curl_handle);
?>
</noscript>

</body>
</html>

The trouble is how to pass data1.json, data2.rdf, and the lens template
to the web service exhibit-render. We could potentially make a php
library file that when you include it into another php file, it parses
the containing php file, extracts out the data links and lens templates,
and calls the web service exhibit-render automatically.

<?php
include("exhibit-rendering-lib.php");
renderExhibit("template-1", ".age", true, 10); #id of lens
template to use, sort by expression, sort ascending, limit
?>

I don't know enough php to know if that's possible / easy.

David

David Karger

unread,
May 26, 2009, 11:46:04 PM5/26/09
to simile-...@googlegroups.com
Maybe instead of physical separation we can settle for logical separation.

Suppose we enable <link rel="exhibit/data" href="#local"> to specify
that the data can be found in the element with name or id #local in the
html doc? that data can be cdata encoded and meets the goal of being
machine readable. It does require xml parsing but that's a relatively
small cost.

David Legg

unread,
May 27, 2009, 6:22:26 AM5/27/09
to simile-...@googlegroups.com

David Huynh wrote:
> Search engines are only interested in crawling (probably) visible HTML
> content, so anything to be crawled must be in HTML, and that spoils the
> whole point of separating data from presentation.

I think you have this a little skewed ;-)

Just because your data is stored as HTML doesn't mean it's not separated
from presentation... that's why CSS was born!

I think the answer is you should encode your data as semantic HTML
markup or POSH[1] as it's known. This has a lot of advantages; Google
can 'see' your data, Browsers without Javascript enabled can see your
data AND exhibit then becomes a method for progressively enhancing the
display of the data.

This technique fits well with the current idea of 'progressive
enhancement' expounded by people like Jeremy Keith [2]. Currently if
you don't have Javascript enabled then an Exhibit page is useless.

Of course you also have to ask yourself... do you care if Google indexes
your data. If not then you can continue as before.

Regards,
David Legg

[1] http://microformats.org/wiki/posh
[2] http://domscripting.com/author/

Matt Pasiewicz

unread,
May 27, 2009, 12:54:01 PM5/27/09
to simile-...@googlegroups.com
Interesting ... I wonder if that might not also help with scalability of large datasets.  I wonder if there is some middle ground to be found on this front ... not sure if we can get meaningful gains by having a mixed clientside/server-side hybrid (and still retain the "exhibit experience"), but I can't help but wonder.

David Huynh

unread,
May 27, 2009, 1:45:08 PM5/27/09
to simile-...@googlegroups.com
Thanks for the links, although looking at [1], I can't find any example
of what POSH looks like.

Anyway, POSH and the likes--microformats, RDFa--should work for simple
cases in which the data consists of relatively unconnected items. As
soon as you have more than one types of items and relationships between
them (e.g., papers and authors), then these formats start to break down
because you can't flatten a graph into a tree (HTML) without duplicating
data or repeating data. Consider wanting to present 2 papers with a
common author

Paper 1's title, by John Doe (University X) and Jane Smith
(University Y)
Paper 2's title, by Joe Anderson (University Z) and John Doe
(University X)

If you model John Doe as a first-class entity, then in POSH,
microformats, RDFa, you have to repeat that information about him being
affiliated with University X. And whenever you have copies of data, you
run the risk of updating one copy and forgetting about the other copies.

Furthermore, sometimes you actually don't want to keep your data in
POSH, microformats, RDFa, etc., because they are hard to write and
manage. You might prefer to keep your data in a Google spreadsheet
(which Exhibit can access), or maybe you already have your data in a
conventional database. Or maybe your data comes from a web service
(e.g., your Del.icio.us bookmarks coming from a JSON feed). Etc. etc.

David

David Huynh

unread,
May 27, 2009, 1:54:12 PM5/27/09
to simile-...@googlegroups.com
Matt Pasiewicz wrote:
> Interesting ... I wonder if that might not also help with scalability
> of large datasets. I wonder if there is some middle ground to be
> found on this front ... not sure if we can get meaningful gains by
> having a mixed clientside/server-side hybrid (and still retain the
> "exhibit experience"), but I can't help but wonder.
There was a project called Backstage, which I started but haven't had
time to continue, that attempts to deal with large datasets by
dynamically constructing an RDF database on a server, loading the data
into it, and then serving the data off as any conventional web app
would. As author of an exhibit, you would manage your html and data in
much the same way, but you just need to link to a Backstage server
rather than (or in addition) to exhibit-api.js. Think of it as "renting
a faceted browsing web app on the fly".

There was a thread on that
http://simile.mit.edu/mail/ReadMsg?&msgId=23836
<http://simile.mit.edu/mail/ReadMsg?&msgId=23836>
but the demo is no longer running. You could still see the html file
http://people.csail.mit.edu/dfhuynh/misc/backstage-demo.html
but the server that it links to is no longer up
http://dfhuynh.csail.mit.edu:8181/backstage/api/backstage-api.js

Note that this still does not address the crawl-ability issue, since the
html still does not contain the data.

David

David Legg

unread,
May 27, 2009, 5:30:55 PM5/27/09
to simile-...@googlegroups.com
Hi David,
> Thanks for the links, although ... I can't find any example
> of what POSH looks like.
>

That's because POSH is an approach and not a hard and fast standard. In
any case it doesn't look much different to Plain Old Standard Html ;-)

> Anyway, POSH and the likes--microformats, RDFa--should work for simple
> cases in which the data consists of relatively unconnected items. As
> soon as you have more than one types of items and relationships between
> them (e.g., papers and authors), then these formats start to break down
> because you can't flatten a graph into a tree (HTML) without duplicating
> data or repeating data.

I think you're looking at it backwards. The idea is to start from a
need to present some data to a human by way of a web page. If the
user's browser is capable of supporting CSS then you can make the data
look more appealing. If the user's browser supports Javascript then you
can bring the full power of exhibit in to group or sort or selectively
present the data.

In other words you don't flatten the graph into a tree, you start with
data encoded in HTML as a graph and build up the tree from the graph
just as you currently do with JSON or XML.

> Consider wanting to present 2 papers with a
> common author
>
> Paper 1's title, by John Doe (University X) and Jane Smith
> (University Y)
> Paper 2's title, by Joe Anderson (University Z) and John Doe
> (University X)
>

You could model this in HTML somethng like this: -

<h1>Papers</h1>
<div class="paper">
<h2>Paper Darts</h2>
<p>A classic paper on a subject not to be missed</p>
<p class="authors">
<a href="#a1">John Doe</a>
<a href="#a2">Jane Smith</a>
</p>
</div>
<div class="paper">
<h2>Riveting for pleasure</h2>
<p class="authors">
<a href="#a2">Jane Smith</a>
<a href="#a3">Joe Anderson</a>
</p>
</div>
...
<h1>Authors</h1>
<div class="author" id="a1">
<h2>John Doe</h2>
<p>John is an amiable fellow who doesn't say much.</p>
<p class="university">
<a href="#u1">University X</a>
</p>
</div>
<div class="author" id="a2">
<h2>Jane Smith</h2>
<p>There is nothing plain about Jane.</p>
<p class="university">
<a href="#u2">University Y</a>
</p>
</div>
<div class="author" id="a3">
<h2>Joe Anderson</h2>
<p>A Riveting chap.</p>
<p class="university">
<a href="#u3">University Z</a>
</p>
</div>
...
<h1>Universities</h1>
<div class="university" id="u1">
<h2>University X</h2>
</div>
<div class="university" id="u2">
<h2>University Y</h2>
</div>
<div class="university" id="u3">
<h2>University Z</h2>
</div>

> If you model John Doe as a first-class entity, then in POSH,
> microformats, RDFa, you have to repeat that information about him being
> affiliated with University X. And whenever you have copies of data, you
> run the risk of updating one copy and forgetting about the other copies.
>

I hope you can see that there is no need to duplicate data. HTML can
reference elements just like XML can and it's all accessible from the DOM.

> Furthermore, sometimes you actually don't want to keep your data in
> POSH, microformats, RDFa, etc., because they are hard to write and
> manage. You might prefer to keep your data in a Google spreadsheet
> (which Exhibit can access), or maybe you already have your data in a
> conventional database. Or maybe your data comes from a web service
> (e.g., your Del.icio.us bookmarks coming from a JSON feed). Etc. etc.
>

I understand your point about maintaining data in HTML could be a pain
and that exhibit was designed to let mere users do stuff without needing
to resort to databases or writing Java. However, I'm thinking of the
more likely case where your data is stored in some database (or
triplestore?) and the HTML page is generated from that automatically.

Regards,
David Legg

Gmail

unread,
May 28, 2009, 1:31:51 PM5/28/09
to simile-...@googlegroups.com
Hi the discussion has generated good implementation ideas

David Legg wrote:
> I understand your point about maintaining data in HTML could be a pain
> and that exhibit was designed to let mere users do stuff without needing
> to resort to databases or writing Java. However, I'm thinking of the
> more likely case where your data is stored in some database (or
> triplestore?) and the HTML page is generated from that automatically.
>

Continuing to extend the concepts; a central Exhibit database contains
data within the database and control elements to control the rendering
of it into html. It also has a "backend" to interface with a server
database for download a la "Backstage". Exhibit then not only groups and
presents the data, but also programs the control elements to render the
selected data elements in html.

Consider parameters determining the "crawlibility" of the content. My
experience confirms that the success of crawling or the response depends
on the content. Principally the frequency response of the keywords and
the scope of the content. Google appears to prefer longer content that
maintains a higher repetition of the keywords. On that basis rendering
data from the database into html will not produce pages of high
ratibility by Google. Controlling the presentation through Exhibit makes
the Google entry look better but the page should offer extensive and
focused content to improve the rating.

Another facet to the discussion is the user interaction. Based on my
integration experience there are few databases or user clients offering
entry for JSON or RDF. Could the central database offer an integration
to user input so that the page not only renders the data, but also
stores user input in the chosen format? My preferred application is
enhancement of database information and this approach would continue to
position Exhibit, Timeline and Timplot as visualization tools not only
for web applications.

The discussion is interesting and timely.

Regards,

Gary Gabriel


David Huynh

unread,
May 29, 2009, 12:38:02 AM5/29/09
to simile-...@googlegroups.com
Hi David,

We actually have an "importer" that can grab data out of an inline HTML
table.

http://api.simile-widgets.org/exhibit/2.2.0/scripts/data/importers/html-table-importer.js
That's been around for maybe 2 years, but it hasn't gotten much use.
That's why I'm skeptical about the idea of humans managing structured
data within HTML.

Another point of skepticism is that--I think--authors don't usually
think of their audience as being composed of progressively more capable
individuals. For example, a painter doesn't start by thinking that some
of his audience have low vision, so he must paint in really big strokes
and high contrast to make sure that they can see his work. And then he
puts in progressively finer details to engage audience with better
vision. Painters usually just start painting what they have in mind.
Similarly, "data artists" probably start out thinking something like, "I
want to show a map of news articles about epidemics" [1], rather than
thinking, "I have data about epidemics and some in my audience only have
really basic non-scripting browsers". I'm not saying that one behavior
is better than the other; I'm just describing what I think is the more
common behavior. When an exhibit starts to get more and more audience,
its author might start to think about investing efforts in
accessibility--because only then does accessibility have good ROI.

David
[1] http://epispider.org/

David Legg

unread,
May 29, 2009, 3:59:51 AM5/29/09
to simile-...@googlegroups.com

David Huynh wrote:
> We actually have an "importer" that can grab data out of an inline HTML
> table...

> That's been around for maybe 2 years, but it hasn't gotten much use.
> That's why I'm skeptical about the idea of humans managing structured
> data within HTML.
>

Thanks for the reminder.

> Another point of skepticism is that--I think--authors don't usually
> think of their audience as being composed of progressively more capable
> individuals.

[snip]

> When an exhibit starts to get more and more audience,
> its author might start to think about investing efforts in
> accessibility--because only then does accessibility have good ROI.
>

You started this thread with a desire to make Exhibit pages visible to
the likes of Google. You have to think of Google like a human with
accessibility issues. It only thinks in terms of the structure of the
text. If you present your data as a HTML table it won't be able to work
out the relative importance of one string to another. For SEO (Search
Engine Optimization) purposes you *have* to use the hierarchical headers
h1, h2 etc.

It looks to me like it is an impossible goal to let ordinary
non-technical people write an exhibit page that works *and* ranks well
on Google!

Regards,
David Legg

David Legg

unread,
May 29, 2009, 10:59:58 AM5/29/09
to simile-...@googlegroups.com
Hi Gary,

> Consider parameters determining the "crawlibility" of the content...


> Google appears to prefer longer content that
> maintains a higher repetition of the keywords. On that basis rendering
> data from the database into html will not produce pages of high
> ratibility by Google.

I would hope that nobody is assuming sticking the contents of a large
database into a web page is a good idea! If that is anyone's intention
then they would be better off creating many smaller more focussed web
pages each of which links to a page where the entire database is
presented through exhibit or timeline or whatever.

> Controlling the presentation through Exhibit makes
> the Google entry look better but the page should offer extensive and
> focused content to improve the rating.
>

Remember that Google will not see *any* of the better looking, better
organized content... no matter how brilliant Exhibit is. Google *just*
looks at the HTML, it doesn't care (much) about your clever CSS and it
certainly doesn't run any of your Javascript. Try turning off CSS,
images and Javascript in your browser and *that* is what Google sees.

> The discussion is interesting and timely.
>

I agree.

Regards,
David Legg

Reply all
Reply to author
Forward
0 new messages