Example:
...
<path freq="4" title="Upgrade of Memory > 4GB on ...">xxx/yy//
ID245.html</path>
<path freq="6" title="Windows <2003> Upgrade...">xxx/zz/
ID246.html</path>
<path freq="8" title="Backup procedure">xxx/yy/ID247.html</path>
...
To fix up the problem, '<' and '>' in the double quotes have to be
replaced with < and >. I.E.
...
<path freq="4" title="Upgrade of Memory > 4GB on ...">xxx/yy//
ID245.html</path>
<path freq="6" title="Windows <2003> Upgrade...">xxx/zz/
ID246.html</path>
<path freq="8" title="Backup procedure">xxx/yy/ID247.html</path>
...
I ran the following Tse macro to do the replacements:
proc main()
BegFile()
while lFind('title=".*>.*"', "ix")
MarkBlockBegin()
Find('">.*$', "x")
MarkBlockEnd()
Replace(">", ">", "lgn") // replaces >
unmarkBlock()
endwhile
BegFile()
while lFind('title=".*<.*"', "ix")
MarkBlockBegin()
Find('">.*$', "x")
MarkBlockEnd()
Replace("<", "<", "lgn") // replaces <
unmarkBlock()
endwhile
End
It's quick-and-dirty and uses brute force. It took forever to loop
over 8 million lines once, let alone twice. I am interested in knowing
how the Tse gurus out there would do it.
One more thing, lFind('title=".*>.*"', "ix") in the macro fails to
find the following line (i.e. '>' in ">100000"):
<path freq="8" title=">100000 lines of text">xxx/yy/ID247.html</
path>
If I remember correctly, '*' means zero or more and ".*" means zero or
more of any characters. '+' means one or more. Can anyone explain?
Have a Merry X'mas!
/Why Tea
Try this...
Note the cursor is being positioned after the targeted < or > by use of
the \c command.
This ensures that the marked block includes the < and > characters that
you wish to replace.
proc main()
BegFile()
while lFind('title=".*>.*"', "ix")
MarkBlockBegin()
lFind('">\c.*$', "x")
MarkBlockEnd()
Replace(">", ">", "lgn") // replaces >
unmarkBlock()
endwhile
BegFile()
while lFind('title=".*<.*"', "ix")
MarkBlockBegin()
lFind('">\c.*$', "x")
MarkBlockEnd()
Replace("<", "<", "lgn") // replaces <
unmarkBlock()
endwhile
End
If we can assume that the strings all occur on one line then it should
be possible to write 2 simple replaces.
ie.
lReplace('{title=".*}{>}{.*"}',"\1\>\3","ix")
lReplace('{title=".*}{<}{.*"}',"\1\<\3","ix")
Perhaps this is more bulletproof to force the replaces to occur only
between quotes
lReplace('{title="[~"]*}{>}{[~"]*"}',"\1\>\3","ix")
lReplace('{title="[~"]*}{<}{[~"]*"}',"\1\<\3","ix")
Hope this helps...
Ross
> Van: sem...@googlegroups.com [mailto:sem...@googlegroups.com] Namens
> Why Tea
> Verzonden: zondag 20 december 2009 11:16
> Aan: SemWare
> Onderwerp: [TSE] How to fix up the xml file?
>
>
> ...
>
> I ran the following Tse macro to do the replacements:
> proc main()
> BegFile()
> while lFind('title=".*>.*"', "ix")
> MarkBlockBegin()
> Find('">.*$', "x")
> MarkBlockEnd()
> Replace(">", ">", "lgn") // replaces >
> unmarkBlock()
> endwhile
>
> ...
> End
>
> ...
>
> One more thing, lFind('title=".*>.*"', "ix") in the macro fails to
> find the following line (i.e. '>' in ">100000"):
> <path freq="8" title=">100000 lines of text">xxx/yy/ID247.html</
> path>
> If I remember correctly, '*' means zero or more and ".*" means zero or
> more of any characters. '+' means one or more. Can anyone explain?
Your lFind('title=".*>.*"', "ix") is in itself correct, so the most likely
explanation is an unexpected format of the data, which causes Find('">.*$',
"x") to find the ">10000" before lFind('title=".*>.*"', "ix") does.
If you are working on a 64 bit machine, the maximum memory available is by design is about 10^10 gigabytes (in the order of 2^64 = 18446744073709551616 bytes totally available for memory).
But if you should be working on a 32 bit machine, the maximum memory available is by design is between 2 to 4 gigabytes (in the order of 2^32=4294967296 bytes totally available for memory).
It might be that the particular machine has only say 512 megabytes available totally.
So very large files (say in the order of 1 to 4 gigabytes) which are close to or even larger than available memory will be handled by swapping to disk.
By design RAM memory is usually much faster (say 10 to 100 times) than information handled on disk.
If this scenario applies, then if the parts of the file are rather independent and may thus be split, to let it run faster, it might be a solution to split (e.g. halving) the original file (e.g. in 2, 4, 8, ..., N sub files), apply the (search/replace/...) algorithm(s), then merge the N subfiles back again to 1 big file.
with friendly greetings,
Knud van Eeden
Thanks for the explanation, Carlo. I ran it on a 4GB Celeron
netbook and I got a lot "no response" hanging before the job
was done. At some stage, I thought it would hang for ever.
Thanks Ross. I was wondering how to do it like the last
two lReplace you suggested.