Tiddlywiki and regexp

272 views
Skip to first unread message

Mohammad

unread,
Aug 15, 2019, 11:56:24 AM8/15/19
to TiddlyWiki
As Tiddlywiki supports regexp and this feature is quite powerful, yet many of us (like me) has no or little information about it, I would like to introduce
this small resource for easy and quick learning regexp.



Sometimes ago Tiddly Twitter had a discussion to give examples of using regexp in TW, but seems he is very busy with Polly!


By the way if someone can create a wiki in GitHub or tiddlyspot with practical examples of using regexp, it would be quite useful.


Cheers
Mohammad

PMario

unread,
Aug 15, 2019, 4:29:46 PM8/15/19
to TiddlyWiki
Hi,

This https://www.regular-expressions.info/quickstart.html page and all linked pages there are among the best tutorials I've ever found about regexp in different languages.

The descriptions there use color coding and a language, that actually lets you understand regexp patterns.

have fun!
mario

coda coder

unread,
Aug 15, 2019, 6:37:06 PM8/15/19
to TiddlyWiki

You can even save/share your regexes.

Mohammad

unread,
Aug 16, 2019, 1:56:15 AM8/16/19
to TiddlyWiki
Thank you all!
Is there any page shows examples and practices of regexp in Tiddlywiki?

These links may be related somehow


Cheers
Mohammad

@TiddlyTweeter

unread,
Aug 16, 2019, 2:57:08 AM8/16/19
to TiddlyWiki
Mohammad wrote:
Is there any page shows examples and practices of regexp in Tiddlywiki

I have an almost finished system build on top of your Tiddler Commander for that.

It aims to ...

(1) show what regex do & how; 
(2) allows interactive testing against test data to improve one's regex; 
(3) illustrates how you use regex in TW filters.

I don't have time to finish it now. Eventually I will because it interests me.

TT

Mohammad

unread,
Aug 16, 2019, 3:13:15 AM8/16/19
to TiddlyWiki
Hi Josiah,
 That's great news! Searching Tiddlywiki is still one of most wild area!

Thank you and looking forward to see it!

--Mohammad

TonyM

unread,
Aug 21, 2019, 7:37:17 PM8/21/19
to TiddlyWiki
Folks,

I have a great use case for some advanced regex. I would like to provide a macro with a tiddlername (default Current Tiddler) and field (default text) and a html tag eg `<table>, <tr>, <td> <section>, <article> <div> etc.. as documented here https://www.w3schools.com/html/html_blocks.asp

I would like a regex to search the target(s) for the html pairs eg `<li>A List items</li>` or `<li style etc>A List items</li>` and and return result optionaly with the html tags still present or only the content between them.

Perhaps later we could enhance this to interrogate id's and other tag info eg `<div id=nnnn> `

I think this could be implemented in a subfilter eg
\define html-tags() regex filter
<$set name=html-tag value="article">
<$list filter="[[tiddlername]subfilter<html-tags>]">

</$list>
</
$set>


Regards
Tony

Jeremy Ruston

unread,
Aug 22, 2019, 4:31:05 AM8/22/19
to tiddl...@googlegroups.com
Hi Tony

There's an old trope in software that one should never use regexps to parse HTML:


So, while I'd be happy to see general regexp support improved in TW5, I don't think it's appropriate to specifically shape that support for the task of parsing HTML.

Of course, TW5 already includes an HTML parser so perhaps the best approach might be to explore how to make that functionality be more usefully exposed to wikitext.

Best wishes

Jeremy


On 22 Aug 2019, at 01:37, TonyM <anthony...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tiddlywiki/43b26b25-3d08-4c7f-9ef4-ea4daacfd50e%40googlegroups.com.

TonyM

unread,
Aug 22, 2019, 5:21:34 AM8/22/19
to TiddlyWiki
Jeremy,

You are aware I do not want so much to parse it as locate the content between matching tags.

Its intention is to access content delimited by html tags inside the text content.

Perhaps we could use it to retrieve items between the section div tags or all instances of text between the li tags.

Regards
Tony

TonyM

unread,
Aug 22, 2019, 5:27:29 AM8/22/19
to TiddlyWiki
I will just add I am method agnostic. I just want the ability to in effect transclued the content between some open and close tags.

Regards
Tony

@TiddlyTweeter

unread,
Aug 22, 2019, 9:43:00 AM8/22/19
to TiddlyWiki
Jeremy

I just saw this and thought it very interesting!

Jeremy Ruston wrote
There's an old trope in software that one should never use regexps to parse HTML:


So, while I'd be happy to see general regexp support improved in TW5, I don't think it's appropriate to specifically shape that support for the task of parsing HTML.

Of course, TW5 already includes an HTML parser so perhaps the best approach might be to explore how to make that functionality be more usefully exposed to wikitext.

I agree that the base parsers are a good way to come at it. Why? Because they work with primitive, good, aims. They not try to regex everything. And are inside ASTs that give order.

That said, I think the extent of use of regex under the TW hood is a potential revelation to many, and worth understanding more.

Regarding https://blog.codinghorror.com/parsing-html-the-cthulhu-way/, its exaggerated. IMO, its a "straw man" argument about regex. Its got points, but way overstated.

A more measured approach is that simple structures can be built quite well with regex directly, but its not a tool for complex situations. Its merely a great tool for textual deconstruction and reconstruction. At that its excellent.
 
Best wishes
TT

Mark S.

unread,
Aug 22, 2019, 10:22:47 AM8/22/19
to TiddlyWiki

There's that saying, "When all you have is a hammer, everything starts to look like a nail."

All we have is regex. It would be great to have some other tool for extracting actual DOM-like structures the way you
could with TW classic. But we don't have it.

Actually, the tool we have for regexp is also a bit lacking. There's no tool for directly lifting desired target text. The new splitregexp only splits, it doesn't
return the text we want to find. Here's my version that does most literally what you ask for

<$vars realchars="[^\s]+">
<$list filter="[{test}splitregexp[\n]join[ ]splitregexp[
<li>]butfirst[1]splitregexp[</li>]butlast[1]regexp<realchars>]">

</$list>
</$vars>

Input:

More text here
<li>line 3</li>
<li>line 2</
li>
<li>line 1</li>
More text there

Output


Good luck!

TonyM

unread,
Aug 22, 2019, 10:58:06 AM8/22/19
to TiddlyWiki
Mark - Wow,

I will test it out tomorrow to see how far I can take it. 

I hope it works for multi-line tags

My interest would be also the option to return
<li>line 3</li>
<li>line 2</li>
<li>line 1</li>
or
Because keeping the valid tags can be made use of as well.

Ahd also see how to handle If the list tag had a style eg <li style="something"> it would be nice if we could return
<li style="something">line 1</li>
or
line 1

If so a lot can be done to extract useful content from html, even if just to summarise some content.

Perhaps further resolution would help like <section name=extract>content</section>

Or extract list items.

Even without using html a tiddlers text field could use html block and inline elements https://www.w3schools.com/html/html_blocks.asp to structure the content, and with such a regex macro extract parts of the tiddler text such as say a prepared extract from the content, or an excerpt, or a config settings or more.

Regards
Tony

@TiddlyTweeter

unread,
Aug 22, 2019, 11:23:54 AM8/22/19
to tiddl...@googlegroups.com
Mark, S. 
All we have is regex.
 
It would be great to have some other tool for extracting actual DOM-like structures the way you could with TW classic. But we don't have it.

Actually, the tool we have for regexp is also a bit lacking. There's no tool for directly lifting desired target text.

I'd be interested in better documenting the regex operators TW has in the context of what JS regex can do. 
I strongly believe it needs referents, i.e. informed by what regex "match" AND "replace" do in standard JS,

In raw form (pre-parser intervention) TW can, of course, do anything JS can, but that is not at the level where most are working.

TT







Mark S.

unread,
Aug 22, 2019, 12:08:20 PM8/22/19
to TiddlyWiki
Re your 2nd question, you can make the filter slightly more robust:

[{test}splitregexp[\n]join[ ]splitregexp[<li.*?>]butfirst[1]splitregexp[</li>]butlast[1]regexp<realchars>]

Re your 1st question, I don't believe you can do this in a single filter. It will probably take multiple lines if possible at all. Because, there are no core tools
for grabbing the actual text you want -- only for splitting. People have done a lot with splitting, but it gets tedious.

If you had a regular expression filter that could split and return groups (e.g. #2963) then you could simply search for and lift out the <li ...> group and the content group in one regular expression.

Mohammad

unread,
Aug 22, 2019, 12:23:48 PM8/22/19
to tiddl...@googlegroups.com
Added to TW-Scripts!

Mark,
 What  does the below part do?

<$vars realchars="[^\s]+">

--Mohammad


Mark S.

unread,
Aug 22, 2019, 12:34:06 PM8/22/19
to TiddlyWiki
That's a regular expression that says "matcch anything that is not whitespace". It's used to verify that
a line is not empty. It has to be defined in a variable beause it contains square brackets [].

Thanks!

Mohammad

unread,
Aug 22, 2019, 12:52:54 PM8/22/19
to TiddlyWiki
Many thanks for clarification.

I need those explanation when documenting your solution in TW-Scripts.

Cheers
Mohammad

TonyM

unread,
Aug 24, 2019, 11:50:46 PM8/24/19
to TiddlyWiki
Mark,

Thanks for this, I only just got to test this; A Test tiddler as follows is not working as I may expect it
zfdtshwfthf
<li>Content</li>
sfghn
<li>Content2</
li>

sfghsfgh
<li>Content3</li>
sxgfhfgsdh

I would have hoped it would return
Content
Content2
Content3


If it was to return only the content between the `<li> and </li>` and not any other content from the test tiddler I could do this;
\define output()

<$vars realchars="[^\s]+">
<
$list
filter
="[{test data}splitregexp[\n]join[ ]splitregexp[<li.*?>]butfirst[1]splitregexp[</li>]butlast[1]regexp<realchars>addprefix[<li>]addsuffix[</li>]]">

</$list>
</
$vars>
\end
<$wikify name=result text="<<output>>">
<<result>>
</$wikify>
Which would find all list items in test (HTML copied from somewhere) and create a new list of only list (li) items in the HTML

Does that make sense?

Regards
Tony

@TiddlyTweeter

unread,
Sep 17, 2019, 7:08:21 AM9/17/19
to TiddlyWiki
TonyM

It makes great sense to throw away unneeded text BETWEEN tags.

Unfortunately I could not get your version to work.

As far as I can see it just re-adds tags you just took off, and also adds them to text you need to excise.

Yes?

TT

@TiddlyTweeter

unread,
Sep 17, 2019, 7:50:23 AM9/17/19
to TiddlyWiki
Ciao Mark

I'm late on this. I got really interested in this kind of extraction, which I think there is demand for.

Two issues I can't figure out ...

1 - does "<$vars realchars="[^\s]+">" need to be that? Rather than its inverse "<$vars realchars="\S+">"? (Where you would not need the variable as no need for 
"[...]"??)

2 - WHEN you have text BETWEEN tags, is there a way to dump it?

Only if you have time and interest!

Best wishes
TT 

TonyM

unread,
Sep 17, 2019, 7:56:22 AM9/17/19
to TiddlyWiki
Yt?

In this case I was extracting all list items from a more complex html source then relisting the items.
The result is clean with only list items.

Regards
Tony

@TiddlyTweeter

unread,
Sep 17, 2019, 8:03:02 AM9/17/19
to TiddlyWiki
Right.

Its clean when you have consecutive items.

I'm trying to work out what to do when you don't.

TT
Reply all
Reply to author
Forward
0 new messages