HTML Cleaner: Remove HTML tags and attributes, inline styles, clean up everything to get plain text only

1,931 views
Skip to first unread message

bimlas

unread,
Oct 14, 2019, 4:38:02 AM10/14/19
to TiddlyWiki
Dear all,

When you copy text from a web page and paste it into the wiki, it usually appears in the style of the web page (for example, it has a white background or appears in a different font).

I just found an online tool that makes it easier to import HTML texts into TiddlyWiki: Select and copy the desired section from a web page, paste it into this tool and press the "Clean HTML" button to delete the unnecessary parts (inline style, classes). You can paste the stripped text into the wiki without any problems, and it will have the same appearance as other tiddlers.

Different options can be set for what to delete.


It is also possible to delete all HTML markup.

@TiddlyTweeter

unread,
Oct 14, 2019, 5:41:51 AM10/14/19
to TiddlyWiki
bimlas

https://html-cleaner.com/

Nice tool. Useful.

FWIW, it should be possible to make a tool in TW to do that. Plus optionally convert HTML to WikiText.

When I get time I'll make a protoype. 
Also there has been some work (I can't find it at the moment that does conversion, & I think somone did it for WikiPedia pages?)

Best wishes
TT

Mark S.

unread,
Oct 14, 2019, 10:02:06 AM10/14/19
to TiddlyWiki
Somewhere, lost in this forum, I made such a cleaner/converter that works with BJ's tiddlyclip.
It uses regular expression to do a usually-but-not-always successful clean up. It uses JS,
so purists may not like it.

A more ambitious approach would be to load the HTML into it's own DOM tree, find and parse
the elements, and convert to TW text.

Also in the forum is something we did (I believe you helped) with pandoc.

For tasks like these, it would be handy if TW spoke markdown natively. There are already
libraries and tools for HTML-to-markdown.

@TiddlyTweeter

unread,
Oct 14, 2019, 11:03:47 AM10/14/19
to TiddlyWiki
Mark S. wrote:
Somewhere, lost in this forum, I made such a cleaner/converter that works with BJ's tiddlyclip.
It uses regular expression to do a usually-but-not-always successful clean up. It uses JS,
so purists may not like it.

Mark S.

unread,
Oct 14, 2019, 11:43:34 AM10/14/19
to TiddlyWiki
Looks like it!
Message has been deleted

TonyM

unread,
Oct 14, 2019, 7:41:04 PM10/14/19
to TiddlyWiki
Mark,

This is highly speculative!

I think it is the correct time to mention it because you are suggesting a similar mechanism that I wonder if it could be generalised further.

Interesting your talk about load the HTML into it's own DOM tree, I do not have the skills to do it but I have being wondering if there is a way we could do this to use HTML forms and the like with a tiddlywiki widget pulling the values entered out of its own DOM as variables or into fields or tiddlers rather than a submit to php etc.. 

I know this sounds a little odd and obscure but actually I believe it may be a method not only to enable and "make use of html solutions" that proliferate on the web to be integrated with tiddlywiki, but also by operating in a separate DOM such interactions will not trigger Tiddlywiki refresh until the activity is ended. 

Why would this be important? As far as I understand most html solutions have multiple pages with their own head and body and movement between pages loads a whole new page. tiddlywiki in effect lets us stay inside a single page, and as a result this integrates all the features we love tiddlywiki for, including its quine nature, single file option and more including its refresh tree. All brilliant stuff. This advantage also makes it difficult to introduce some functionality because various coded solutions must participate in the tiddlywiki tree, via widgets and plugins, this forces the need to integratie into tiddlywiki. My thought is if other websites can move from page to page introducing their own javascript and css as needed is there a way we can emulate this with a generic solution that transfers an independant DOM into tiddlywiki objects, for example I imagine a replacement for the HTML submit that rather than post to a php script, or run javascript it posts the results into a tiddler, variables and or fields, which the tiddlywiki can then respond to.

Enabling this would allow additional solutions normaly hosted outside tiddlywiki to be brought inside the tiddlywiki and further strengthen the single file model. This would allow more sophisticated ways to interact with the tiddlywiki's hosts or other websites.

I have already experimented with some html/javascript and PHP solutions partially embedded inside tiddlywiki with some success, but they often involve multiple files in the same folder as the tiddlywiki. Many of these files can be hosted inside the tiddlywiki such as CSS, data and html all that I can see that remains is to deal with the following;
  • pass the results of such interactions into tiddlywiki rather than needing to post outside
  • Some issues with a shared html head/body
Please forgive my speculation, I think I have enough knowledge to see the possibilities but not to ask someone in technical terms so they understand what I am pointing towards. So I am trying to voice this when the opportunity arises, and your comment suggested, that perhaps you could help consider these ideas because you said;

A more ambitious approach would be to load the HTML into it's own DOM tree, find and parse the elements, and convert to TW text.


Regards
Tony

Mark S.

unread,
Oct 15, 2019, 10:53:51 AM10/15/19
to TiddlyWiki
Hi Tony

I wasn't thinking of anything that ambitious. JS has a tool that will let you take a clump of HTML text and convert it
into a JS dom object. You can then follow the dom tree inside the object to parse your text and formatting. Then
use that info to recreate TW text.

Re your idea to use JS forms, I don't think that per se is possible in TW because of the way it is rendered. But,
you might look at Jed's work with federation. What (I think) he does is pull a TW into an iframe, and then
communicate with the TW from the outside. Possibly you could use the same technique, putting whatever
form elements inside an HTML that goes inside an iframe. Then pass information in/out of the frame
to the calling TW. Speculative also.

@TiddlyTweeter

unread,
Oct 15, 2019, 11:34:05 AM10/15/19
to TiddlyWiki
Mark S. wrote:
... JS has a tool that will let you take a clump of HTML text and convert it
into a JS dom object. You can then follow the dom tree inside the object to parse your text and formatting. Then
use that info to recreate TW text.

Sounds good. In that it might force an "AST" that deals with some issues when the HTML is mal-formed??

A downside with raw regex is error avoidance gets very baroque--since its not designed for it.

When HTML is well formed it does well.

Just a comment
TT


@TiddlyTweeter

unread,
Oct 15, 2019, 2:18:30 PM10/15/19
to TiddlyWiki
Mark S. 
A more ambitious approach would be to load the HTML into it's own DOM tree, find and parsethe elements, and convert to TW text.

TonyM wrote
This is highly speculative!

Really? 

To me it looks like a sensible incremental method to reduce errrors in  parsing. (coda: IF it works.)

A limited innovation congruent with its genesis. IMO. 

Best wishes
TT

Suzanne McHale

unread,
Oct 16, 2019, 1:15:44 AM10/16/19
to TiddlyWiki
I'd like a HTML-to-TW markup converter (macro?), if that were possible! I have a lot of documents in HTML markup that I would like to put into a TW with its native markup.
Reply all
Reply to author
Forward
0 new messages