Query: About BJ's TiddlyClip as a "Screen Scraper"?

264 views
Skip to first unread message

@TiddlyTweeter

unread,
Jan 14, 2018, 2:07:05 PM1/14/18
to TiddlyWiki
BJ's TiddlyClip is a really interesting tool in bridging the gap between a self contained TW & a browser environment of Web Pages waiting to be clipped.

As I slowly get better with TW I'm beginning to wonder if TiddlyClip could be used to "scrape" web pages ... i.e detect specific wanted sections and extract them. A common case, for me, would be extracting data for a single movie from IMDB and maybe processing it to populate a Tiddler and some of its fields.

Any thoughts?

Best wishes
Josiah

Mark S.

unread,
Jan 14, 2018, 3:22:28 PM1/14/18
to TiddlyWiki
Has TiddlyClip been updated for FF57?

Is the IMDB web structure consistent enough to allow web scraping?

-- Mark

coda coder

unread,
Jan 14, 2018, 3:52:11 PM1/14/18
to TiddlyWiki
The solution would be here (assuming this is actually working - I've never tried it) https://stackoverflow.com/a/7744369

coda coder

unread,
Jan 14, 2018, 3:55:16 PM1/14/18
to TiddlyWiki
Scratch that, they all appear to be dead.

A node guru could probably get something working via https://www.npmjs.com/package/imdb-api

BJ

unread,
Jan 14, 2018, 4:19:56 PM1/14/18
to TiddlyWiki
Tiddlyclip has been updated, and is available from github. However I have added quiet a few things and have not finished the documentation(tiddlyclip at tiddlyspot still has the old docs). However the original functionality and mode of operations remain so it should be ok to use, BUT it is a pre-release at the moment as I am still adding to it.

all the best
BJ

@TiddlyTweeter

unread,
Jan 15, 2018, 6:19:27 AM1/15/18
to TiddlyWiki
Mark S. asked ...
Is the IMDB web structure consistent enough to allow web scraping?

Short answer: yes.

The page to be scraped in not actually the main landing page for a film title. Its a more detailed sub-page you need scrape. The layout is consistent and parsing the HTML is enough.

Long answer. Its not exactly intuitive.

But those are more to do with the complexity of movies than complexity of IMDB data per se, which is well designed. (Issues like having more than one director; selecting the top 5 main actors rather than the zillions in the full cast.)

Best wishes
Josiah 

@TiddlyTweeter

unread,
Jan 15, 2018, 6:30:41 AM1/15/18
to TiddlyWiki
Ciao coda coder


coda coder wrote:
Scratch that, they all appear to be dead.

A node guru could probably get something working via https://www.npmjs.com/package/imdb-api

FWIW there are issues using programmatic access to IMDB. Thy are fine with modest usage/scraping for non-profit use. They are not fine with systematic widely used apis unless licensed. This is partly why I want a "home-made scraper" ... Movie Collectorz is an excellent program that used to auto-scrape IMDB I used for years ... but were forced to discontinue their scraper because of licensing costs.

Best wishes
Josiah

@TiddlyTweeter

unread,
Jan 15, 2018, 7:30:22 AM1/15/18
to TiddlyWiki
Ciao BJ

One thing that interests me is finer grain control of extraction.

I was wondering about whether there could be a fit between TiddlyClip and the version of your Flexitype plugin that allows sequences of regexes to be run? A use case could be scraping with TiddlyClip an IMDB page for its core movie data section and then passing it on to Flexity for further precise extraction? Just a thought.

Best wishes
Josiah  

BJ wrote:
Tiddlyclip has been updated, and is available from github.... BUT it is a pre-release at the moment as I am still adding to it.

TonyM

unread,
Jan 15, 2018, 8:42:32 AM1/15/18
to TiddlyWiki
Bj,

Thanks so much for upgrading tiddlyclip. There a lot of keen users looking forward to it so I am as sure you are generating lots of good karma :)

Tony

BJ

unread,
Jan 15, 2018, 11:27:10 AM1/15/18
to TiddlyWiki
actual tiddlyclip has alway supported running any number of regexes.

@TiddlyTweeter

unread,
Jan 15, 2018, 11:54:44 AM1/15/18
to tiddl...@googlegroups.com
Ciao BJ

Its unfortunate I'm not so bright on tech :-(. If you could point me to relevant docs or an example I'm sure I'll get there.

BJ

unread,
Jan 15, 2018, 12:48:11 PM1/15/18
to TiddlyWiki
It's easier to use with the new version, which will be out soon - remind me again after the release

BJ

Mark S.

unread,
Jan 16, 2018, 11:22:36 PM1/16/18
to TiddlyWiki
Thanks for maintaining this code, BJ!

A few comments.

There seem to be a LOT of undocumented @variables.

The Image and Link modes don't seem to hide/unhide the rule, but maybe I'm not using it correctly.

How do I make a Link inside a rule? The "|" destroys the table formatting causing the whole rule to not be run.

There is a reference to "Sections", but I only see how to identify categories. How/where are sections used?

It would be handy if there was a linkTEXT (or linkText) variable so that a complete link with name could be captured

Thanks!
Mark

Mark S.

unread,
Jan 17, 2018, 10:54:03 AM1/17/18
to TiddlyWiki
I'm attempting to use macros in tiddlyclip

I have a macro makelink tagged globally with $:/tags/Macro.

It's contents are:

\define makelink(clip link) [[$clip$|$link$]]

In the body column of the clip rule I have:

\n\n((*@makelink(@text @linkURL)*))

But when I run it, I get:

source: invalid val @text @linkURL

and then

makelinkmarco not found

The @text and @linkURL variables worked by themselves. And the makelink macro works fine by itself (not called from tiddlyclip). Is there a some sort of registration that has to happen for the macro to work?

Thanks!
Mark

BJ

unread,
Jan 17, 2018, 2:04:49 PM1/17/18
to TiddlyWiki


On Wednesday, January 17, 2018 at 4:22:36 AM UTC, Mark S. wrote:
Thanks for maintaining this code, BJ!

A few comments.

There seem to be a LOT of undocumented @variables.

The Image and Link modes don't seem to hide/unhide the rule, but maybe I'm not using it correctly.
 
As of version 0.1.2 not everything is working as it used to.


How do I make a Link inside a rule? The "|" destroys the table formatting causing the whole rule to not be run. ""
There are a number of way to get round this, you could use triple double quots around the bar i.e. """|""", but the rule will display in the tiddlywiki incorrectly, you will need to remove the 'rule' pragma with the tiddler to see the tiddlyclip rule show correctly. 

There is a reference to "Sections", but I only see how to identify categories. How/where are sections used?
you will only see one section called 'default'. I've attached my config which shows more. Each section is the name of the table of categories, these form the basis of the tiddlyclip context menus (see attached screen shots). 
 

It would be handy if there was a linkTEXT (or linkText) variable so that a complete link with name could be captured
maybe in the future, at present you need to hightlight the text 

Thanks!
Mark
tiddlyclipconfig.png
contextmenu.png

BJ

unread,
Jan 17, 2018, 2:06:18 PM1/17/18
to TiddlyWiki


On Wednesday, January 17, 2018 at 3:54:03 PM UTC, Mark S. wrote:
I'm attempting to use macros in tiddlyclip

I have a macro makelink tagged globally with $:/tags/Macro.

tiddlyclip macros are javascript only and are tagged with $:/tags/tiddlyclip 

Mark S.

unread,
Jan 17, 2018, 2:17:26 PM1/17/18
to tiddl...@googlegroups.com
Is there any other magic? I have the javascript macro, but the variables inside it are "invisible".

My macro:

exports.name ="makelink";
exports
.run  = function(clip,link) {
       
var ret = "[["+clip+"|"+link+"]]";
       
return ret ;
}


(tagged with $:/tags/tiddlyclip)

When I run it via a clip, the result is "undefined", suggesting that it doesn't see one or more of the values that are passed.

Also, I'm looking through the code, but I can't find where linkURL and other variables are defined. Any hints?

Thanks!
-- Mark

Edit: I should have said, the results show up like this:

[[undefined|undefined]]

But each of those called variables works when I use them without passing them through a macro


BJ

unread,
Jan 17, 2018, 2:39:57 PM1/17/18
to TiddlyWiki

On Wednesday, January 17, 2018 at 7:17:26 PM UTC, Mark S. wrote:
Is there any other magic? I have the javascript macro, but the variables inside it are "invisible".
 
the macros need to be of type application/javascript and module type  'macro'

There needs to be a comma between parameters when calling the macro (missing from your example)-
((*@makelink(@text,@linkURL)*))
My macro:

exports.name ="makelink";
exports
.run  = function(clip,link) {
       
var ret = "[["+clip+"|"+link+"]]";
       
return ret ;
}


(tagged with $:/tags/tiddlyclip)

When I run it via a clip, the result is "undefined", suggesting that it doesn't see one or more of the values that are passed.

Also, I'm looking through the code, but I can't find where linkURL and other variables are defined. Any hints?

Thanks!
-- Mark

Mark S.

unread,
Jan 17, 2018, 2:58:21 PM1/17/18
to TiddlyWiki
Thanks! It was the comma. Always the little things that get you.

I see now that the @variables are being passed from the web extension. Or at least some of them.

Thanks!
Mark
Reply all
Reply to author
Forward
0 new messages