Capture web page with formatting

413 views
Skip to first unread message

dukeja

unread,
Jan 23, 2009, 11:23:15 AM1/23/09
to TiddlyWiki
One thing I found most limiting me in use TiddlyWiki as information or
knowledge management is the difficulty to "snip" or "clipping" a
section or a whole page into Tiddly Wiki. (Actually, it is not easy
to maintain web page format to cut&paste web content into MS Word or
OneNote). I have heard Tiddlysnip, but it seems only works for plain
text mode. Rich text mode operation is in its ToDo list.

I have tried to use some wysigwyg editor (fckeditor) and cut&paste web
content into a tiddler. The results are mixing. Some pages are
copied with very good format accuracy. Others may fail totally. I
know it is my naiveness to comprehend how technically difficult it
is. But I wonder if anyone can provide some hint on how to do it
correctly. E.g. copy web content into Nvu? or save the web page and
then imported with some plugin?

Thanks for the help!

Duke

Mark S.

unread,
Jan 23, 2009, 2:27:09 PM1/23/09
to TiddlyWiki
I've also been interested in this issue.

One of my conclusions is that the only application that does a really
good job at preserving content, images, and format is an application
built for exactly that purpose (mainly surfulator). SQLnotes does a
decent job, though not quite as nice as surfulator. Some of the other
dedicated web-clip applications only do a so-so job.

You can print out to a PDF file, and the look of the site is preserved
perfectly, but you're missing all the links and real text (unless your
PDF print engine can also do OCR).

With TW, one approach is to cut and paste the source of the target web
page into your tiddler, taking everything that you want from within
the <body> tags of the target page, and pasting them between <html></
html> tags of the TW. But then the formatting will be that of TW, not
of the target page. You might be able to work around this by cutting
and pasting the stylesheet attributes for the target page into the
stylesheet tiddler, giving them an extra outer class (e.g. .mypage h
{...} ) so that they don't clash with tiddler attributes, and then
enclosing your cut and paste text in tw style enclosures (e.g. {{mypage
{<html>...</html>}

Whew! What a lot of work. And it won't be portable when offline if
depending on images for its appearance.

Thinking about it from a different view, why do I need to preserve
*everything* on a web page? What I usually want is the text, one or
two images, the original url, and maybe one or two useful links.
Frequently pages are cluttered with banner adds, links to unrelated
information, etc. Why not just capture the useful stuff, and ignore
the rest?

One (1) way to do this is to do a screen capture, copy the captured
file to a subdirectory below tw. Use a image link (with sizing feature
if necessary) to view the page. Use tiddlysnip to copy the essential
text you want and paste it below the image. If you keep the images in
a directory below your TW file, then you only have to copy one
directory to your USB drive when its time to hit the road.

Or (2), copy the text you want with tiddlysnip, save the images you
want in a TW sub-directory, link up the images, and apply any
formatting you want. Insert any links from the source document that
are useful. Tiddlysnip will already have captured the source url. A
little more work, but the result is a reference page probably more
useful than the original page, and information that is consistent
across tiddlers.

Another idea (3) that comes to mind is that you could save an entire
website in a directory below TW. There's a Tiddler plugin
(MiniBrowser ?) that will allow you to display a web page or url
inside of a tiddler. You could capture text information and insert it
into your tiddler, possibly hidden so that it can be searched. The
tiddler would use the plugin to display the site you have saved.

Like I said, I haven't settled on one solution, but these are the ones
I've been experimenting with.

-- Mark

Joe Andrieu

unread,
Jan 23, 2009, 6:40:23 PM1/23/09
to Tiddl...@googlegroups.com
Duke,

As I understand it, there isn't an easy way to capture HTML from within a
browser without a plug-in--because the javascript can't get access to the
full clipboard data with the HTML in it.

I'd love to be wrong about that, but I worked through this with Drag & Drop
between HTML pages. That uses the clipboard, just like copy & paste: you can
get the source URL and the TXT version of the source content from within
javascript, but not the source HTML. Again... correct me if someone knows
how to do that. Maybe things have changed or I'm just in error.

However, I am working on a plug-in based solution for this, using
TiddlyWiki. We (SwitchBook) are building a bit more than just a
scrapbooking tool, but one thing we will enable is a way to capture from web
pages and have that auto-import as a Tiddler in a TiddlyWiki hosted in the
explorer (vertical) toolbar for IE on the PC. Eventually, we'll get to
Firefox and then to Mac and Linux. The trick is that we are relying on
access to the browser DOM to get the capture data (and a bunch of behavioral
data as well) so we can render it accurately later. Eventually, we'll be
able to easily do a bunch of cool TiddlyWiki stuff with content captured
from websites, and yes, we'll make it easy to load your own TiddlyWiki app
into the plug-in window, so you can capture to your own TiddlyWiki if that's
what you want to do.

As Mike S. mentions, there are a lot of tricky things to get the captures to
work right. Hopefully we can solve that problem in a way lots of folks can
use.

We are still in development and while we met a bunch of the TiddlyWiki
community at Tiddly West last year, this is really the first introduction to
the community at large.

Hello. =)

You can get a sense of what we are working towards in the following blog
posts. We call it User-driven Search, and there isn't much at our official
website yet.

http://blog.joeandrieu.com/2008/07/12/towards-user-driven-search/

http://blog.joeandrieu.com/2008/07/20/notes-on-user-driven-search/

http://blog.joeandrieu.com/2009/01/19/farewell-google-notebook-move-over-sea
rchwiki-we-need-a-search-map/


We intend to release all of client-side code (Tiddly and Plug-in) as open
source. (Hopefully sooner rather than later.) We'll also release a reference
implementation of the server out under open source.

So, greetings fellow TiddlyWikiers. Nice to meet you all and I'm looking
forward to getting to know more of you. If you are curious or want to get
involved, drop me a line or post a question to the list. We're fairly open
about what we're trying to do.

Cheers,

-j

--
Joe Andrieu
SwitchBook
http://www.switchbook.com
j...@switchbook.com
+1 (805) 705-8651

Mark S.

unread,
Jan 24, 2009, 12:14:03 AM1/24/09
to TiddlyWiki
Just experimenting with idea (3). This does seem to be the way to go
if you really want to save a web page in detail, including links and
images. Firefox seems to do a good job of saving everything needed to
reconstruct the page. Of course, then you're not carrying it around in
one file, though with a utility like rsync it should still be pretty
portable. The other problem, if you don't have a 20 inch screen, is
that there may not be room for TW and the target web site -- you'll
have to scroll the internal web page left/right to see it all.

-- Mark

On Jan 23, 11:27 am, "Mark S." <throa...@yahoo.com> wrote:
> I've also been interested in this issue.
>

skye riquelme

unread,
Jan 24, 2009, 12:10:51 PM1/24/09
to TiddlyWiki
Hi Again

Just to comment that I often work with the right side bar toggled off
(TiddlyTools) and sometimes the left one as well!! And with header
that just has a simple one-line menubar.....the whole TW takes up
almost no screen real-estate...I put the TiddlyTools toggle arrows in
the top menu...when I am working on the TW structure they are both
toggled on...for simple browsing ...toggled off!!

Skye

Eric Shulman

unread,
Jan 24, 2009, 12:35:09 PM1/24/09
to TiddlyWiki
> Just to comment that I often work with the right side bar toggled off
> (TiddlyTools) and sometimes the left one as well!! And with header
> that just has a simple one-line menubar.....the whole TW takes up
> almost no screen real-estate...I put the TiddlyTools toggle arrows in
> the top menu...when I am working on the TW structure they are both
> toggled on...for simple browsing ...toggled off!!

You might also want to try this one:
http://www.TiddlyTools.com/#ToggleFullScreen

It basically combines ToggleLeftSidebar, ToggleRightSidebar and
ToggleSiteTitles into a single click. In addition, it automatically
adds a 'floating' button in the upper right corner of the page that
restores the full display.

Plus: it works as an "instant bookmarklet"... just drag the
"fullscreen" link from the above tiddler on TiddlyTools, and drop it
directly on your *browser's* toolbar to create a new button. All the
needed code is self-contained in the toolbar button, and you can click
it at any time to toggle fullscreen on/off... and it work for *any*
standard TiddlyWiki document... without needing to install anything at
all in those documents!!

enjoy,
-e
Eric Shulman
TiddlyTools / ELS Design Studios
Reply all
Reply to author
Forward
0 new messages