How to fix up the xml file?

3 views
Skip to first unread message

Why Tea

unread,
Dec 20, 2009, 5:16:01 AM12/20/09
to SemWare
While trying out swish-e (Simple Web Indexing System for Humans -
Enhanced), a conversion tool converted the index file into an xml file
of > 8 million lines of text. Unfortunately some of the attributes
contain invalid characters of '<' and '>'.

Example:
...
<path freq="4" title="Upgrade of Memory > 4GB on ...">xxx/yy//
ID245.html</path>
<path freq="6" title="Windows <2003> Upgrade...">xxx/zz/
ID246.html</path>
<path freq="8" title="Backup procedure">xxx/yy/ID247.html</path>
...

To fix up the problem, '<' and '>' in the double quotes have to be
replaced with &lt; and &gt;. I.E.
...
<path freq="4" title="Upgrade of Memory &gt; 4GB on ...">xxx/yy//
ID245.html</path>
<path freq="6" title="Windows &lt;2003&gt; Upgrade...">xxx/zz/
ID246.html</path>
<path freq="8" title="Backup procedure">xxx/yy/ID247.html</path>
...

I ran the following Tse macro to do the replacements:
proc main()
BegFile()
while lFind('title=".*>.*"', "ix")
MarkBlockBegin()
Find('">.*$', "x")
MarkBlockEnd()
Replace(">", "&gt;", "lgn") // replaces >
unmarkBlock()
endwhile
BegFile()
while lFind('title=".*<.*"', "ix")
MarkBlockBegin()
Find('">.*$', "x")
MarkBlockEnd()
Replace("<", "&lt;", "lgn") // replaces <
unmarkBlock()
endwhile
End

It's quick-and-dirty and uses brute force. It took forever to loop
over 8 million lines once, let alone twice. I am interested in knowing
how the Tse gurus out there would do it.

One more thing, lFind('title=".*>.*"', "ix") in the macro fails to
find the following line (i.e. '>' in ">100000"):
<path freq="8" title=">100000 lines of text">xxx/yy/ID247.html</
path>
If I remember correctly, '*' means zero or more and ".*" means zero or
more of any characters. '+' means one or more. Can anyone explain?

Have a Merry X'mas!

/Why Tea

Carlo Hogeveen

unread,
Dec 20, 2009, 7:06:52 AM12/20/09
to sem...@googlegroups.com

Part of the answer is: change your remaining Find() and Replace() commands
to lFind() and lReplace() as well. Find() and Replace() cause the macro to
continuously "pause" to update the screen, which slows your macro way down.

Ross B

unread,
Dec 20, 2009, 7:21:36 AM12/20/09
to sem...@googlegroups.com
What's happening is the inner lFind() is positioning the cursor on the
first character of the found text - so it doesn't mark/include the
character you need to replace.

Try this...
Note the cursor is being positioned after the targeted < or > by use of
the \c command.
This ensures that the marked block includes the < and > characters that
you wish to replace.

proc main()
BegFile()
while lFind('title=".*>.*"', "ix")
MarkBlockBegin()

lFind('">\c.*$', "x")


MarkBlockEnd()
Replace(">", "&gt;", "lgn") // replaces >
unmarkBlock()
endwhile
BegFile()
while lFind('title=".*<.*"', "ix")
MarkBlockBegin()

lFind('">\c.*$', "x")


MarkBlockEnd()
Replace("<", "&lt;", "lgn") // replaces <
unmarkBlock()
endwhile
End

If we can assume that the strings all occur on one line then it should
be possible to write 2 simple replaces.
ie.
lReplace('{title=".*}{>}{.*"}',"\1\&gt;\3","ix")
lReplace('{title=".*}{<}{.*"}',"\1\&lt;\3","ix")

Perhaps this is more bulletproof to force the replaces to occur only
between quotes
lReplace('{title="[~"]*}{>}{[~"]*"}',"\1\&gt;\3","ix")
lReplace('{title="[~"]*}{<}{[~"]*"}',"\1\&lt;\3","ix")


Hope this helps...
Ross


Carlo Hogeveen

unread,
Dec 20, 2009, 7:36:58 AM12/20/09
to sem...@googlegroups.com

> Van: sem...@googlegroups.com [mailto:sem...@googlegroups.com] Namens
> Why Tea
> Verzonden: zondag 20 december 2009 11:16
> Aan: SemWare
> Onderwerp: [TSE] How to fix up the xml file?


>
>
> ...
>
> I ran the following Tse macro to do the replacements:
> proc main()
> BegFile()
> while lFind('title=".*>.*"', "ix")
> MarkBlockBegin()
> Find('">.*$', "x")
> MarkBlockEnd()
> Replace(">", "&gt;", "lgn") // replaces >
> unmarkBlock()
> endwhile
>

> ...
> End
>
> ...


>
> One more thing, lFind('title=".*>.*"', "ix") in the macro fails to
> find the following line (i.e. '>' in ">100000"):
> <path freq="8" title=">100000 lines of text">xxx/yy/ID247.html</
> path>
> If I remember correctly, '*' means zero or more and ".*" means zero or
> more of any characters. '+' means one or more. Can anyone explain?


Your lFind('title=".*>.*"', "ix") is in itself correct, so the most likely
explanation is an unexpected format of the data, which causes Find('">.*$',
"x") to find the ">10000" before lFind('title=".*>.*"', "ix") does.

knud van eeden

unread,
Dec 20, 2009, 10:41:43 AM12/20/09
to sem...@googlegroups.com
TSE is (also) very fast because it handles by design the file usually only in memory.

If you are working on a 64 bit machine, the maximum memory available is by design is about 10^10 gigabytes (in the order of 2^64 = 18446744073709551616 bytes totally available for memory). 
But if you should be working on a 32 bit machine, the maximum memory available is by design is between 2 to 4 gigabytes (in the order of 2^32=4294967296 bytes totally available for memory).
It might be that the particular machine has only say 512 megabytes available totally.
So very large files (say in the order of 1 to 4 gigabytes) which are close to or even larger than available memory will be handled by swapping to disk.
By design RAM memory is usually much faster (say 10 to 100 times) than information handled on disk.

If this scenario applies, then if the parts of the file are rather independent and may thus be split, to let it run faster, it might be a solution to split (e.g. halving) the original file (e.g. in 2, 4, 8, ..., N sub files), apply the (search/replace/...) algorithm(s), then merge the N subfiles back again to 1 big file.

with friendly greetings,
Knud van Eeden

Why Tea

unread,
Dec 20, 2009, 7:52:57 PM12/20/09
to SemWare
On Dec 20, 11:06 pm, "Carlo Hogeveen" <Carlo.Hogev...@xs4all.nl>
wrote:

> Part of the answer is: change your remaining Find() and Replace() commands
> to lFind() and lReplace() as well. Find() and Replace() cause the macro to
> continuously "pause" to update the screen, which slows your macro way down.

Thanks for the explanation, Carlo. I ran it on a 4GB Celeron
netbook and I got a lot "no response" hanging before the job
was done. At some stage, I thought it would hang for ever.

Why Tea

unread,
Dec 20, 2009, 7:56:20 PM12/20/09
to SemWare

Thanks Ross. I was wondering how to do it like the last
two lReplace you suggested.

Reply all
Reply to author
Forward
0 new messages