How to cope with a website that fills HTML dynamically?

58 views
Skip to first unread message

Kouichi NAKAMURA

unread,
Sep 25, 2017, 8:50:46 AM9/25/17
to zotero-dev

As it turned out, the web page that I'm trying to translate (e.g. , http://www.abstractsonline.com/pp8/#!/4376/presentation/308) is dynamically generated and its source code merely contains a few lines below and some others in the body element.

  
<div class="container content">
    <div id="header"></div>
    <div id="body"></div>
    <div id="footer"></div>
</div>

I don't know exactly what kind of technologies are used here, but I found the following line in the head element of the html.

  
script data-main="js/main" src="js/lib/require.js"></script>

When I run doWeb() while showing up the page of a single entry, the doc object contains actual page elements, so doWeb() works fine.

However, when I try to run doWeb() from search results page (eg. http://www.abstractsonline.com/pp8/#!/4376/presentations/rubinstein/1), it seems that, although as far as I can see I'm using the same correct URL for single entries, the object used by the callback function of ZU.processDocuments(url,function (doc){ … }) does not seem to contain the actual page content. So the same code that works for single entry does not work for multiple input.

Also, probably from the same reason, attachments is not working properly (the snapshot only shows page layout without an actual content, with a sign that says loading is taking time), although the URL is correct.

  
// attachments
item.attachments = [{
    url: doc.URL,
    title: "Print page",
    mimeType: "text/html",
    snapshot: true
}];

Before I give up implementing multiple and attachments (I'm almost given them up), is there a good way to work around this kind of website? Any useful function to retrieve the actual page content with a given URL?

Cheers,

Kouichi

Dan Stillman

unread,
Sep 25, 2017, 6:38:35 PM9/25/17
to zoter...@googlegroups.com
On 9/25/17 8:50 AM, Kouichi NAKAMURA wrote:

However, when I try to run doWeb() from search results page (eg. http://www.abstractsonline.com/pp8/#!/4376/presentations/rubinstein/1), it seems that, although as far as I can see I'm using the same correct URL for single entries, the object used by the callback function of ZU.processDocuments(url,function (doc){ … }) does not seem to contain the actual page content. So the same code that works for single entry does not work for multiple input.


Without looking too closely, most likely what's happening is that the page itself is loading, causing processDocuments() to run your processing code, even though the actual page content hasn't yet been added to the page.

There's a utility function (monitorDOMChanges()) that might help here, but you shouldn't use it — we're in the process of changing processDocuments() to use XMLHttpRequest + DOMParser instead of using a hidden browser, which will mean that processDocuments() will only ever have access to the initial page structure (and be much faster, since it won't have to actually load full webpages in the background and run JavaScript).

The key thing in these cases is that the fact that the page is generated client-side means that you also have access to the same data it's using to generate the page, either inline as JSON or via an API request the page is making. Judging by the URL, this page is likely doing the latter. So you'd want to use the Network pane of the browser dev tools to see what requests it's making and make the same request from the translator, probably for JSON that you can then use directly.

Whether you still get the data from the page for a single-page save is up to you. The API request might be cached by the browser, so requesting it again from the translator might not result in an additional network request even for a single-page save, and it would let you use the same code and also get properly structured data instead of having to scrape the page. But whether that makes sense depends on the site.

Sebastian Karcher

unread,
Sep 25, 2017, 11:21:37 PM9/25/17
to zoter...@googlegroups.com
admittedly, though, this is significantly more advanced than just
writing a translator, so if this is too daunting for you (I remember you
saying you just started with javascript) I'd take the translator with
code for multiples and attachments commented out and either we'd
initially accept it as is or one of us will look "under the hood" and
add those bits. Obviously, if you want to have a go at this and ask for
help when you're stuck, that's even better.

On 09/25/2017 06:38 PM, Dan Stillman wrote:
> On 9/25/17 8:50 AM, Kouichi NAKAMURA wrote:
>>
>> However, when I try to run |doWeb()|from search results page (eg.
>> http://www.abstractsonline.com/pp8/#!/4376/presentations/rubinstein/1
>> <http://www.abstractsonline.com/pp8/#%21/4376/presentations/rubinstein/1>),
> --
> You received this message because you are subscribed to the Google
> Groups "zotero-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to zotero-dev+...@googlegroups.com
> <mailto:zotero-dev+...@googlegroups.com>.
> To post to this group, send email to zoter...@googlegroups.com
> <mailto:zoter...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/zotero-dev.
> For more options, visit https://groups.google.com/d/optout.


Kouichi NAKAMURA

unread,
Sep 26, 2017, 3:11:39 AM9/26/17
to zotero-dev
Thank you, guys. I looked at Network Monitor pane of devtool. There are many requests for json data. Looking into some of them, I think they contain the contents of the web page to fill in (see below). But, issuing requests to complete the page from my code is a bit overwhelming. I'd rather leave it as it is.... 

Kouichi


AuthorBlock<b>*L.-L. PAI</b><sup>1</sup>,…Harvard Med. Sch., Boston, MA
ControlNumber14440
DisclosureBlock&nbsp;<b>L. Pai:</b> None.&nbs…funds); Neurona Therapeutics.
End11/13/2017 10:00:00 AM
Id30154
Position18
PosterboardNumberB8
PresentationNumber282.18

Kouichi NAKAMURA

unread,
Sep 28, 2017, 5:13:00 PM9/28/17
to zotero-dev

Apparently, the page requested 10 json data. But what's the next step? How do I make API request? How do I know which one(s) to request in what order?



Reply all
Reply to author
Forward
0 new messages