Can't get Google Crawler to index my GWT-based site no matter what I do. Help!

Benjamin Possolo

unread,

Oct 27, 2012, 6:17:33 PM10/27/12

to google-we...@googlegroups.com

I am unable to get my GWT-based site to be indexed by Google no matter what I do.

My URLs all look like this:

http://marketplace.styleguise.net/#!/home
http://marketplace.styleguise.net/#!/new-listings
http://marketplace.styleguise.net/#!/item/172001
http://marketplace.styleguise.net/#!/about

and so on.

I have a URL servlet handler that properly returns static HTML snapshots of my site (using HtmlUnit) when they are requested as:

http://marketplace.styleguise.net/?_escaped_fragment_=/home
http://marketplace.styleguise.net/?_escaped_fragment_=/new-listings
http://marketplace.styleguise.net/?_escaped_fragment_=/item/172001
http://marketplace.styleguise.net/?_escaped_fragment_=/about

My host HTML page has the special meta tag:

<meta name="fragment" content="!">

Finally, I have a sitemap with about 15 URLs. One of them is the host page, the rest are all hash-bang based URLs.

Within the Webmaster Tools, Google is reporting only one page as having been indexed (the home page without a hash-bang). I've tried submitting my URLs individually using the Fetch As Googlebot tool but that seems to disallow one from submitting hash-bang URLs to the index, even if they fetch properly and the preview is correct. I've tried both with and without a robots.txt file as well. Nothing works!

This is driving me mad! Has anyone managed to get Google to index their GWT site? If so, I would REALLY appreciate any advice.

ant...@gmail.com

unread,

Oct 28, 2012, 10:28:04 AM10/28/12

to google-we...@googlegroups.com

Your URLs could be further improved if you go through the proposal in

http://carlosaguayo.posterous.com/html5-history-in-gwt

Was i in your position, i would do the above and then generate a sitemap.xml for search engine submittion

Antonios [dot] Chalkiopoulos [at] keepitcloud [dot] com

Joseph Lust

unread,

Oct 28, 2012, 11:50:27 AM10/28/12

to google-we...@googlegroups.com

I see you're using places and the url tokens, are you using GWTP? There is some built in crawler support there that you could use, or at least investigate if you're not using GWTP.

http://code.google.com/p/gwt-platform/wiki/CrawlerSupport

SIncerely,

Joseph

Benjamin Possolo

unread,

Oct 29, 2012, 1:26:36 AM10/29/12

to google-we...@googlegroups.com

Thanks for responding!

Yes you are correct. I am using places and activities (but not GWTP). I use straight GWT throughout the entire app (including all the editor + validation stuff).

GWTP does include a canned filter for handling crawlability. I have a very similar one albeit slightly optimized.

/**

* Special filter that adds support for Google crawling as outlined here

* ({@link https://developers.google.com/webmasters/ajax-crawling/docs/getting-started}

*

* @author Benjamin Possolo

*/

public class GoogleCrawlerFilter implements Filter {

private static final Logger log = Logger.getLogger(GoogleCrawlerFilter.class.getName());

private static final ThreadLocal<WebClient> webClient = new ThreadLocal<WebClient>(){

@Override

protected WebClient initialValue() {

log.info("Instantiating headless browser");

WebClient wc = new WebClient(BrowserVersion.FIREFOX_3_6);

wc.setThrowExceptionOnScriptError(false);

wc.setThrowExceptionOnFailingStatusCode(false);

wc.setCssEnabled(false);

return wc;

};

@Override

public void init(FilterConfig config) throws ServletException {}

@Override

public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)

throws IOException, ServletException {

HttpServletRequest req = (HttpServletRequest)request;

HttpServletResponse resp = (HttpServletResponse)response;

String queryString = req.getQueryString();

if( queryString != null && queryString.contains("_escaped_fragment_") ){

log.info("Detected request from Google Crawler");

//google requests the URL with the place fragment as a query parameter.

//they do this because URL fragments (the portion after the hash #) are

//not sent with an HTTP request.

//convert the ugly URL to the real url that uses the hashbang

queryString = queryString.replaceFirst("&?_escaped_fragment_=", "#!");

queryString = URLDecoder.decode(queryString, "UTF-8");

StringBuilder pageToCrawlSb = new StringBuilder(req.getScheme()).append("://").append(req.getServerName());

if( req.getServerPort() > 0 )

pageToCrawlSb.append(':').append(req.getServerPort());

pageToCrawlSb.append(req.getRequestURI());

if( ! queryString.startsWith("#!") )

pageToCrawlSb.append('?');

pageToCrawlSb.append(queryString);

String pageToCrawl = pageToCrawlSb.toString();

log.log(Level.INFO, "Page being crawled: {0}", pageToCrawl);

//check if a snapshot of the requested page already exists

String htmlSnapshot = MemcacheUtil.getHtmlSnapshot(pageToCrawl);

if( htmlSnapshot == null ){

try{

//use HtmlUnit to render the requested page

long start = System.currentTimeMillis();

log.info("Using headless browser to fetch page");

HtmlPage page = webClient.get().getPage(pageToCrawl);

log.info("Pumping javascript event loop for 8 seconds");

webClient.get().getJavaScriptEngine().pumpEventLoop(8000); //execute javascript for 8 seconds

long end = System.currentTimeMillis();

log.log(Level.INFO, "Time to generate page snapshot: {0} seconds", ((end - start) / 1000L));

//we add a special message to the top of the page so that anyone seeing the snapshot will

//know it is meant for Google crawling

String snapshotMsg = new StringBuilder("<body>\n\n")

.append("<hr />\n")

.append("<center>\n")

.append(" <h3>\n")

.append(" You are viewing a non-interactive page that is intended for the crawler.<br/>\n")

.append(" You probably want to see this page: <a href=\"" + pageToCrawl + "\">" + pageToCrawl + "</a>\n")

.append(" </h3>\n")

.append("</center>\n")

.append("<hr />\n")

.toString();

htmlSnapshot = page.asXml();

htmlSnapshot = htmlSnapshot.replaceFirst("<body[^>]*>", snapshotMsg);

//store the rendered page in memcache

MemcacheUtil.putHtmlSnapshot(pageToCrawl, htmlSnapshot);

}

finally{

webClient.get().closeAllWindows();

}

//send the html snapshot back to the crawler

resp.setContentType("text/html; charset=UTF-8");

PrintWriter writer = resp.getWriter();

writer.print(htmlSnapshot);

}

else{

chain.doFilter(request, response);

}

@Override

public void destroy() {

//this is never called on Google App Engine

}

Benjamin Possolo

unread,

Oct 29, 2012, 1:44:18 AM10/29/12

to google-we...@googlegroups.com

On Sunday, October 28, 2012 7:28:05 AM UTC-7, ant...@gmail.com wrote:

Your URLs could be further improved if you go through the proposal in
http://carlosaguayo.posterous.com/html5-history-in-gwt
Was i in your position, i would do the above and then generate a sitemap.xml for search engine submittion
Antonios [dot] Chalkiopoulos [at] keepitcloud [dot] com

Thank you for taking the time to respond.

I presume you are suggesting I get rid of the hash-bang entirely from my URLs. I didn't know that was possible yet (apparently the HTML5 History API permits this; thanks for pointing that out to me). However, by doing that, Google would no longer consider my site an "ajax site". I am following the ajax crawlability guidelines they document here. So when they send the special query parameter (_escaped_fragment_) my server knows that it should generate and return an HTML snapshot of whatever page/gwt-place is being crawled.

If I removed the hash-bang, google would no longer send the _escaped_fragment_ query parameter with their crawl requests and it would be near-impossible for me to know when I should generate a snapshot versus just returning the normal content.

Benjamin Possolo

unread,

Oct 29, 2012, 1:46:46 AM10/29/12

to google-we...@googlegroups.com

On Sunday, October 28, 2012 7:28:05 AM UTC-7, ant...@gmail.com wrote:

Your URLs could be further improved if you go through the proposal in
http://carlosaguayo.posterous.com/html5-history-in-gwt

Was i in your position, i would do the above and then generate a sitemap.xml for search engine submittion
Antonios [dot] Chalkiopoulos [at] keepitcloud [dot] com

I forgot to add, I already have a sitemap which I submitted to the Webmaster Tools. It's really short...only 15 URLs. Doesn't seem to help at all though.

http://marketplace.styleguise.net/sitemap.xml

Message has been deleted

Gonzalo Ferreyra Jofré

unread,

Oct 29, 2012, 7:28:11 AM10/29/12

to google-we...@googlegroups.com

Hello,

the hashbang is inverted (!#) in your URLs in the xml. Should be this way #!

Try switching the position of the hash

Benjamin Possolo

unread,

Oct 29, 2012, 12:06:15 PM10/29/12

to google-we...@googlegroups.com

On Monday, October 29, 2012 4:28:11 AM UTC-7, Gonzalo Ferreyra Jofré wrote:

Hello,
the hashbang is inverted (!#) in your URLs in the xml. Should be this way #!
Try switching the position of the hash

Oh wow! big mistake on my behalf. thank you for catching that!!

i wonder if that will do the trick. google always plays down the importance of sitemaps.

I'm fixing, uploading, resubmitting and ill report back

Benjamin Possolo

unread,

Oct 29, 2012, 1:40:03 PM10/29/12

to google-we...@googlegroups.com

It looks like that may have done the trick. I am not 100% certain if it was that because my app engine log files are showing a ton of traffic from the google bot yesterday at night. either way, thanks for finding that major mistake.

googling "site:marketplace.styleguise.net" is now finally showing entries!!

Reply all

Reply to author

Forward