[OT] Help for a RegEx

Fonntuggnio

unread,

Aug 2, 2023, 1:39:17 PM8/2/23

to

Sorry for the total OT, but I failed to build a RegEx with
the "help" (rotfl) of three different so called IA, getting
to nothing

I am scanning an HTML document (not in javascript, so I do
not have access to DOM nodes from inside) and I need to
match EVERY whole tag.

for whole I mean, starting from the , but such paragrapha MAY (and may not)
contain

a long list of attributes, with or without zero or more \n
\r \t characters, valid, before the >.

An innerText possibly multiline, also with or without zero
or more \n \r \t characters inside the text.

I have tried most suggestions from Bearly, ChatGpt and
You.Com IA, but none worked

(my test is the RegEx engine from KATE Editor with the
loaded HTML. It is handy since it highlights in yellow the
matches, and I can verify that the RegEx tried fail to
detect perfectly valid paragraphs).

If sb happens to be familiar with RegEx supporting
"invisible" characters ... I'd be very grateful for any hint.
Ciao !

Christian Gollwitzer

unread,

Aug 2, 2023, 2:00:28 PM8/2/23

to

Am 02.08.23 um 19:38 schrieb Fonntuggnio:

>
> Sorry for the total OT, but I failed to build a RegEx with the "help"
> (rotfl) of three different so called IA, getting to nothing
>
> I am scanning an HTML document (not in javascript, so I do not have
> access to DOM nodes from inside) and I need to match EVERY whole tag.
>
> for whole I mean, starting from the , but such paragrapha MAY (and may not) contain

This may not be possible at all. RegExes cannot match nesting pairs,
i.e. if your contains other pairs then you have reached
the end of what a RE is capable of. Also due to the way these tags are
structured, you need at least negative lookahead for it, which also not
all RE engines support.

If you do
<p.*

then the RE would catch from the first , hence you
need to specify the .* with a lookahead like (?!), or use a
non-greedy RE.

> a long list of attributes, with or without zero or more \n \r \t
> characters, valid, before the >.

> (my test is the RegEx engine from KATE Editor with the loaded HTML.

> If sb happens to be familiar with RegEx supporting "invisible"
> characters ... I'd be very grateful for any hint.

This may as well be the problem. Some RE engines treat newline
characters as special, i.e. it may be that Kate matches only *within* a
line.

In short - maybe a RE engine is simply not a good tool to do that. Then
use an XML parser instead.

Christian

Paavo Helde

unread,

Aug 2, 2023, 2:16:09 PM8/2/23

to

02.08.2023 20:38 Fonntuggnio kirjutas:
>
> Sorry for the total OT, but I failed to build a RegEx with the "help"
> (rotfl) of three different so called IA, getting to nothing
>
> I am scanning an HTML document (not in javascript, so I do not have
> access to DOM nodes from inside) and I need to match EVERY whole tag.
>
> for whole I mean, starting from the , but such paragrapha MAY (and may not) contain

I'm afraid HTML cannot be parsed with a regex in general. Also, the HTML
rules are very lax, for example there is no such guarantee that there
actually appears a corresponding terminating tag.

Also, there is no guarantee that the actual content is contained in the
 tags, it might well be outside and all the tags might actually
be empty .

For extracting the content of unknown pages reliably you would probably
need some kind of a state machine, with a fair knowledge of obscure HTML
rules. Of course, there are libraries for that.

Scott Lurndal

unread,

Aug 2, 2023, 2:20:45 PM8/2/23

to

Best way to deal with HTML is using xslt processors. You may want
to run the html text through a canonicalizer first.

Keith Thompson

unread,

Aug 2, 2023, 2:58:43 PM8/2/23

to

Fonntuggnio <JoeFonn...@libbbero.it> writes:
> Sorry for the total OT, but I failed to build a RegEx with the "help"
> (rotfl) of three different so called IA, getting to nothing
>
> I am scanning an HTML document (not in javascript, so I do not have
> access to DOM nodes from inside) and I need to match EVERY whole
> tag.

[...]

https://stackoverflow.com/a/1732454/827263

"TONY THE PONY HE COMES"

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

MarioCPPP

unread,

Aug 2, 2023, 6:53:02 PM8/2/23

to

On 02/08/23 20:00, Christian Gollwitzer wrote:
> Am 02.08.23 um 19:38 schrieb Fonntuggnio:
>>
>> Sorry for the total OT, but I failed to build a RegEx with
>> the "help" (rotfl) of three different so called IA,
>> getting to nothing
>>
>> I am scanning an HTML document (not in javascript, so I do
>> not have access to DOM nodes from inside) and I need to
>> match EVERY whole tag.
>>
>> for whole I mean, starting from the > corresponding , but such paragrapha MAY (and may not)
>> contain
>
> This may not be possible at all. RegExes cannot match
> nesting pairs, i.e. if your contains other

this may be safely excluded. Other type of tags (like or
 may be nested, but not itself). Is it still a problem ?

> pairs then you have reached the end of what a RE is capable
> of. Also due to the way these tags are structured, you need
> at least negative lookahead for it, which also not all RE
> engines support.
>
> If you do
> <p.*
>
> then the RE would catch from the first ,
> hence you need to specify the .* with a lookahead like
> (?!), or use a non-greedy RE.

the .* seem to fail facing multiline and tabs alas

>
>> a long list of attributes, with or without zero or more \n
>> \r \t characters, valid, before the >.
>
>> (my test is the RegEx engine from KATE Editor with the
>> loaded HTML. If sb happens to be familiar with RegEx
>> supporting "invisible" characters ... I'd be very grateful
>> for any hint.
>
> This may as well be the problem. Some RE engines treat
> newline characters as special, i.e. it may be that Kate
> matches only *within* a line.

mmmmmm intresting.
What other editor would you recommend then ?

>
> In short - maybe a RE engine is simply not a good tool to do
> that. Then use an XML parser instead.
>
> Christian
>

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

MarioCPPP

unread,

Aug 2, 2023, 6:55:40 PM8/2/23

to

On 02/08/23 20:15, Paavo Helde wrote:
> 02.08.2023 20:38 Fonntuggnio kirjutas:
>>
>> Sorry for the total OT, but I failed to build a RegEx with
>> the "help" (rotfl) of three different so called IA,
>> getting to nothing
>>
>> I am scanning an HTML document (not in javascript, so I do
>> not have access to DOM nodes from inside) and I need to
>> match EVERY whole tag.

it is HTML generated by LibreOffice .odt, so rather well
formatted (if not elegant)

>>
>> for whole I mean, starting from the > corresponding , but such paragrapha MAY (and may not)
>> contain
>
> I'm afraid HTML cannot be parsed with a regex in general.
> Also, the HTML rules are very lax, for example there is no
> such guarantee that there actually appears a corresponding
> terminating tag.

I have read manually a lot of the generated code, and I had
no evidence of bad formatting

>
> Also, there is no guarantee that the actual content is
> contained in the tags, it might well be outside and all
> the tags might actually be empty .

true, <td> and some other contain some renderized text. But
for my purpose just paragraphs could suffice.

>
> For extracting the content of unknown pages

they are not unknown : they are .odt exported as HTML, by
LibreOffice.

> reliably you
> would probably need some kind of a state machine, with a
> fair knowledge of obscure HTML rules. Of course, there are
> libraries for that.
>

MarioCPPP

unread,

Aug 2, 2023, 6:56:38 PM8/2/23

to

both terms were unknown to me beforehand, so I thank you
since I can do some searches

Paavo Helde

unread,

Aug 3, 2023, 4:36:29 AM8/3/23

to

03.08.2023 01:55 MarioCPPP kirjutas:
> On 02/08/23 20:15, Paavo Helde wrote:
>>
>> For extracting the content of unknown pages
>
> they are not unknown : they are .odt exported as HTML, by LibreOffice.

Well, that makes things easier. If we can exclude some complications
like CDATA, HTML comments and nested tags, then it might be indeed
possible to use regex to extract some content.

Be sure to use a non-greedy regex to match the closest end tag , and
the equivalent of /s or dotall for '.' to match newlines (or use
(.|\r|\n) instead of dot). This seems to work at first glance:

grep -Po '' abc.xhtml

(-P is needed for grep to support non-greedy search).

MarioCPPP

unread,

Aug 3, 2023, 8:01:58 PM8/3/23

to

On 03/08/23 10:36, Paavo Helde wrote:
>

intresting. I tried this one and it detects most of
paragraphs, except those that does not have attributes
within the opening tag.

Is it there a way to also include those ones ?

Ben Bacarisse

unread,

Aug 3, 2023, 8:26:36 PM8/3/23

to

MarioCPPP <NoliMihiFran...@libero.it> writes:

> On 03/08/23 10:36, Paavo Helde wrote:
>> 
>
> intresting. I tried this one and it detects most of paragraphs, except
> those that does not have attributes within the opening tag.
>
> Is it there a way to also include those ones ?

PH's regex insists on a space after the "<p". Whilst this is not
exactly the same as requiring an attribute it will be effectively the
same. You could try

<p[ >](.|\r|\n)*?

but I can't stress enough -- none of this can really work in all cases.

--
Ben.

MarioCPPP

unread,

Aug 4, 2023, 8:17:22 PM8/4/23

to

ok, tnx for the prudent disclaimer

Your variant found 3883 matches against 3644 matches found
by the previous one.

scanning visually the book, it seems this RegEx is able to
find ALL.

I'll try variants with h1...h10 to find headings also.

Many thanks for this precious hint

Now I try to better understand the Expression :D

wij

unread,

Aug 13, 2023, 2:33:38 AM8/13/23

to

On Thursday, August 3, 2023 at 1:39:17 AM UTC+8, Fonntuggnio wrote:
> Sorry for the total OT, but I failed to build a RegEx with
> the "help" (rotfl) of three different so called IA, getting
> to nothing
>
> I am scanning an HTML document (not in javascript, so I do
> not have access to DOM nodes from inside) and I need to
> match EVERY whole tag.
>
> for whole I mean, starting from the corresponding , but such paragrapha MAY (and may not)
> contain
>
> a long list of attributes, with or without zero or more \n
> \r \t characters, valid, before the >.

The following is an example program of class Regex (class wrapper of regex(3)
functions). The regular expression ".*" should do most of the job, except
real HTML involves comments, nested tags, erroneous format...etc:

[]a_grep ".*" *html

-------------------------------------------------
/* Copyright is licensed by GNU LGPL, see file COPYING. by I.J.Wang 2023

Simulate grep command (Extended regular expression, ERE)

Build: make a_grep
*/
#include <Wy.stdio.h>
#include <Wy.unistd.h>
#include <Wy.regex.h>

using namespace Wy;

constexpr const char Red[]="\x1B[31m";
constexpr const char Reset[]= "\x1B[0m";

void sim_grep(Regex& rexpr, const char* fname)
{
Errno r;
String str;
::regmatch_t mbuf[5];
RegFile regf(fname,O_RDONLY);
RdBuf strm(regf);

for(;strm.is_eof()==false;) {
if((r=strm.read(str))!=Ok) {
WY_THROW(r);
}
if((r=rexpr.regexec(str.c_str(),mbuf,WY_CARR_SIZE(mbuf),0))!=Ok) {
continue;
}
cout << fname << ": ";
cout << StrSeg(str.begin(), str.begin()+mbuf[0].rm_so);
cout << Red << StrSeg(str.begin()+mbuf[0].rm_so,
str.begin()+mbuf[0].rm_eo) << Reset;
cout << StrSeg(str.begin()+mbuf[0].rm_eo, str.end());
}
};

int main(int argc, const char* argv[])
try {
static const char usage[]="a_grep <pattern> <file>+" WY_ENDL;
Errno r;

if(argc<3) {
cout << "Error: Invalid argument" WY_ENDL "Usage: "
<< usage << WY_ENDL;
return -1;
}
const char* ptn= argv[1];
Regex rexpr;

if((r=rexpr.regcomp(ptn,REG_EXTENDED))!=Ok) {
if(r!=EBADMSG) {
WY_THROW(r);
}
String str;
if((r=rexpr.regerror(str))!=Ok) {
WY_THROW(r);
}
cout << str << WY_ENDL;
return -1;
}

for(int i=2; i<argc; ++i) {
const char* fname= argv[i];
sim_grep(rexpr,fname);
}

cout << "OK" WY_ENDL;
return 0;
}
catch(const Errno& e) {
cerr << wrd(e) << WY_ENDL;
return -1;
}
catch(...) {
cerr << "main() caught(...)" WY_ENDL;
throw;
};

jak

unread,

Aug 14, 2023, 4:37:21 AM8/14/23

to

MarioCPPP ha scritto:

Could I ask you if you could kindly provide a piece of text on which you
are doing the tests? Thanks in advance.

MarioCCCP

unread,

Aug 15, 2023, 8:15:52 AM8/15/23

to

well, actually not : it's an .ODF book converted to .HTML,
written in LbO, with just a main structural index, headings
untill Title3 rank, a few tables, and some 300'000 words of
text. But, being unpublished and not going to be given away
for free, I won't share the content.

The tags are often heavily loaded with font style
attributes, and some nested tags, the text is also
full of nested and tags (italics and bold). It's a
very large document, but not complex in structure. Just
badly designed (imvho).
For example LbO does not track smartly editings that could
be "collapsed". It just dumbly obey and records formatting
command as they are, It does not produce very "CLEAN" HTML.
But I won't revise it manually, it's huge. And actually the
books to be analyzed and steganized are SIX, not just one.

I have decised to use a true XML parser though, even if the
RegEx worked well, the program is growing too complex to use
string-only approeach, and I need some true "dom-like" aware
approach to edit nodes content.

I am also considering to abandon GAMBAS and do that in
Javascript, which is able to act upon the HTML from inside
and injecting stuff is native at it.

Sorry if my reply is a bit frustrating, but those six books
are 30 years of my life, I'heve poured blood in them :D

jak

unread,

Aug 16, 2023, 3:24:33 PM8/16/23

to

MarioCCCP ha scritto:

Hi Mario,
absolutely no problem. Maybe I misunderstood the thread and I thought
you refer to an XML document useful for debug. Practically a unique
document that contains all the characteristics, peculiarities and
exceptions of the XML that would allow an exhaustive debug of a parser.
If I had sensed that you refer to your personal document, then I would
not have allowed myself to make this request. Excuse me.

MarioCCCP

unread,

Aug 18, 2023, 3:05:20 PM8/18/23

to

On 16/08/23 21:24, jak wrote:
> MarioCCCP ha scritto:
>> On 14/08/23 10:37, jak wrote:
>>> MarioCPPP ha scritto:
>>>> On 04/08/23 02:26, Ben Bacarisse wrote:
>>>>> MarioCPPP <NoliMihiFran...@libero.it> writes:
>>>>>
>>>>>> On 03/08/23 10:36, Paavo Helde wrote:
>>>>>>> 
>>>>>>

>

> Hi Mario,
> absolutely no problem. Maybe I misunderstood the thread and
> I thought
> you refer to an XML document useful for debug. Practically a
> unique
> document that contains all the characteristics,
> peculiarities and
> exceptions of the XML that would allow an exhaustive debug
> of a parser.
> If I had sensed that you refer to your personal document,
> then I would
> not have allowed myself to make this request. Excuse me.
>

don't even mention it ! My English is poor at times, I
cannot explain very well :D

jak

unread,

Aug 19, 2023, 3:18:31 AM8/19/23

to

MarioCCCP ha scritto:

> On 16/08/23 21:24, jak wrote:
>> MarioCCCP ha scritto:
>>> On 14/08/23 10:37, jak wrote:
>>>> MarioCPPP ha scritto:
>>>>> On 04/08/23 02:26, Ben Bacarisse wrote:
>>>>>> MarioCPPP <NoliMihiFran...@libero.it> writes:
>>>>>>
>>>>>>> On 03/08/23 10:36, Paavo Helde wrote:
>>>>>>>> 
>>>>>>>
>
>
>>
>> Hi Mario,
>> absolutely no problem. Maybe I misunderstood the thread and I thought
>> you refer to an XML document useful for debug. Practically a unique
>> document that contains all the characteristics, peculiarities and
>> exceptions of the XML that would allow an exhaustive debug of a parser.
>> If I had sensed that you refer to your personal document, then I would
>> not have allowed myself to make this request. Excuse me.
>>
>
> don't even mention it ! My English is poor at times, I cannot explain
> very well :D
>

Allora ce tocca da parlà come se magna XDD