Building a scraper service for TW5

Abraham Samma

unread,

Jan 18, 2018, 1:41:15 PM1/18/18

to TiddlyWiki

Hello all,

Recently I have been experimenting with building an external service for scraping and converting web articles into readable versions that can be added into a tiddlywiki. For those who use Firefox's Pocket app, it is the same concept, just for TW5 instead.

I believe this would benefit a lot of folks who read around a lot but find it difficult to copy and paste entire articles into TW5. The wiki becomes a kind of human-run archiving system for the web.

The API works quite well. All that's left is to integrate it with TW5. I hope to share a demo with you all when it is done. Opinions and suggestions are welcomed.

In the meantime, have a good one.

Andreas Hahn

unread,

Jan 18, 2018, 3:06:59 PM1/18/18

to tiddl...@googlegroups.com

Hi Abraham,

that sounds pretty interesting, I've had some thoughts about that in the
past too, but I've never tried to do it. Are you (planning on) using the
Browser messaging mechanism that plugin libraries also use to inject the
scraped tiddlers into the TW? Will the scraping support XPATH requests?
(that would be really cool)

I look forward to your finished version, I'd definitely use it.

/Andreas

Mark S.

unread,

Jan 18, 2018, 6:07:23 PM1/18/18

to TiddlyWiki

BJ's TiddlyClip works very much like a configurable web scraper. The main thing that's missing is the ability to download and link to images.

-- Mark

BJ

unread,

Jan 18, 2018, 6:51:42 PM1/18/18

to TiddlyWiki

the new version of of tiddlyclip allows snaps to be saved external to the wiki (with autolinking), and I have been thinking of adding a lib that allow webpages to be converted to 'single pages apps' - ie internalizes images and style sheets, and autolinked. Adding support to download images would be simple...

BJ

unread,

Jan 18, 2018, 7:02:36 PM1/18/18

to TiddlyWiki

Hi Abraham,

Maybe you would be interested to have you work inter-operate with tiddlyclip. Info about tiddlyclip is at tiddlyclip.tiddlyspot.com. This is somewhat out of date as I have been adding a lot of new things lately. The lasted version (pre release) of the tw plugin is at https://github.com/buggyj/tiddlyclip-plugin/releases and the latest version of the browser plugin is at https://github.com/buggyj/tiddlyclip/releases.

All the best

BJ

Abraham Samma

unread,

Jan 19, 2018, 11:41:26 AM1/19/18

to TiddlyWiki

Thanks for your interest Andreas.

Xpath will not be supported for now. And no, everything is parsed by the server before returning it to your wiki. No browser messaging used. Spares me of worrying about future browser vendor support ;-)

Another reason for this design choice is:

1. You can possibly do bulk scraping operations across the web (the server will do most of the hardwork for you)
2. Simplicity (just drop the url that points to the resource to scrape)
3. Accessible to mobile phones because not everyone uses Firefox for mobile phones. So no add-ons there.

Abraham Samma

unread,

Jan 19, 2018, 11:43:42 AM1/19/18

to tiddl...@googlegroups.com

Yes I've checked out TiddlyClip. A very useful thing. Thanks for making this. It definitely complements what I have in mind, at least on the desktop browser.

@TiddlyTweeter

unread,

Jan 19, 2018, 12:21:46 PM1/19/18

to TiddlyWiki

Very interesting theme. Bespoke SCRAPING ability is almost my definition of what decent net use is. I mean, isn't it most of it about the search for information? Too much of the time you gotta recreate the stuff, so you don't, as you don't have the hours. Far better to scrape what you need.

TiddlyClip is extremely good but IMO there was still much manual work to do. That is why I'm looking forward to the next version. It avoids some issues set API's have.

An issue that comes up on good scrapers (using an API rather than ad hoc jabbing) is licensing rapidly comes up as an issue.

I could not tell you how many times API based scrapers of IMDB have been challenged to pay license fees. Many.

Best wishes
Josiah

TonyM

unread,

Jan 19, 2018, 7:37:55 PM1/19/18

to TiddlyWiki

As a lay philosopher not a licencing expert, my view

I suppose this is all about fair use. If you try and recreat a database it's not fair. If you capture bits of it for personal research it is fair. If you republish it as your own it may not be. I favor the collaborative approach, if you republish part provide a link to the source and promote use of the source, this may be reasonable (but not always legal) thus you need to read the IMDb licence.

IMDb contains publically available information but if you systematical tax the IMDb server then you are stealing their resources.

We are all in this together.

Tony

TonyM

unread,

Jan 19, 2018, 7:49:34 PM1/19/18

to TiddlyWiki

P.s.

I am interested and enthusiastic in developing intergration and conversion into tiddlywiki, I also have skills in this area but as I am busy at the moment I will lurk for now.

Good stuff
Tony

Abraham Samma

unread,

Jan 20, 2018, 5:34:38 AM1/20/18

to TiddlyWiki

I totally agree with you. I think embedding a warning about this issue into the service would help inform end users before they get into trouble with copyright law.

@TiddlyTweeter

unread,

Jan 20, 2018, 12:29:01 PM1/20/18

to tiddl...@googlegroups.com

Ciao Abraham & TonyM

I'll stick with IMDB as an example use case on copyright issues. Because I lived through problems on licensing numerous times. Because its a very good resource base where the designers from the start built in permalinks. You can link to a IMDB title as close to as "forever" as the net allows. Its the best one-stop, perennial, database made to date for movies. Most other databases for Movies on the web use its data as their starting point.

Movies are interesting in that most everything that was "released" is, now, thankfully, data on IMDB. That's not because of legal requirement (ISBN is legally required for published books and ISSN for journals, not so movies). It is because the creators of IMDB recognized the possibility that all movies publicly released could be catalogued. THEY created the approach. It in one sweep gave cinephiles access to the full history of cinema.

It does have weaknesses within certain types of cinema where "release" is not clearly defined (e.g. African films in general and Nigeria in particular). But it is NOT a closing issue as their scope has ALSO highlighted issues in coverage that folk interested in International cinema are concerned with--and slowly they get addressed.

All of this is a way of pointing to the fact IMDB need financial support FOR their project to expand. Paid licensing is important to their survival.

My experience is they only get "The Hump" when commercial scrapers, across 1,000s of instances, making thousands of calls a day, are not contributing to their welfare.

I do NOT think WE should be sweating this. But awareness of the issue is just basic responsibility.

I really can't see 50 users of TW scraping IMDB for a handful of records a day as being any kind of problem.

I think the previous line applies to other scraping cases too.

Best wishes
Josiah

Abraham Samma

unread,

Jan 21, 2018, 5:50:27 AM1/21/18

to tiddl...@googlegroups.com

Ok, Andreas, I *may* support Xpath, just not now.

On Thursday, January 18, 2018 at 11:06:59 PM UTC+3, Andreas Hahn wrote:

Reply all

Reply to author

Forward