Smarter breaking of excessively long lines

Fred H Olson

unread,

May 5, 2013, 2:53:06 PM5/5/13

to Semware

html files , particularly those generated by software can have
very long lines. I've been playing such a file that is 29 Mbytes
in length and had numerous lines that exceeded the 32000 byte limit
that this version of TSE seems to have (TSE - Linux v4.40.79 )
BTW the TSE help says under
>Basic Concepts >Introduction >The Editor Features
"Edit lines up to 16,000 characters in length"

If one loads a file with long lines in TSE it simply breaks the
line at 32000 bytes often in the middle of a string that should
not be broken.

My file has basically a very repetitive format:
<h2 ... </h2> sequences followed by one or more
<tr> ... </tr> sequences
where ... represents up to about 300 characters of other text.
There are some other html tags ect but the above is the vast
majority of the file. It is the sequences of <tr> ... </tr>
that get to be greater than 32000 bytes. What I'd like to do is
have each <tr> ... </tr> sequence be a seperate line.

Is there a way to add newlines with TSE BEFORE long lines are
arbitrarily broken at 32000 bytes? Or does anyone have a
reliable macro or algorithm for breaking up long lines at
selected places AND putting lines back together at breaks?

If "Remove Trailing Whitespace" is on when long lines were broken up
the first part of broken lines may not be 32000 bytes long if there
happened to be spaces at that point.

Given that, is there another way to know where lines were broken?
Turn "Remove Trailing Whitespace" off before loading long lines?
What if there happened to be a line exactly 32000 bytes long?

It seems like breaking lines at selected points before they were
broken at 32000 bytes would be a lot cleaner. Would doing it with a
streaming editor like SED be the best approach?

Fred

--
Fred H. Olson Minneapolis,MN 55411 USA (near north Mpls)
Email: fholson at cohousing.org 612-588-9532
My Link Pg: http://fholson.cohousing.org My org:
Communications for Justice -- Free, superior listserv's w/o ads

knud van eeden

unread,

May 5, 2013, 3:38:23 PM5/5/13

to sem...@googlegroups.com

As a first approach:

At this moment I think in general the best approach here would be to use a HTML parser (e.g. some beautifiers) because that looks in detail at the (key)words found.

E.g. Tidy
http://goo.gl/p9XQ9

In general writing parsers is non-trivial (you do not write e.g. a HTML parser in 5 minutes in TSE, probably more likely in days, weeks, months,
 ...).

Example of HTML syntax
http://goldparser.org/grammars/files/html.zip

Creating parsers is certainly usually a very huge, time-consuming job because of all the 
possible exceptions and constraints you have to take into consideration.

E.g. HTML it is usually also interspersed with JavaScript and other script languages, complicating that job further, as 2 or more (different) languages are present in the same file.

===

Global search/replace as used by programs like SED and also TSE should be in general more 'close your eyes' and 'brute force'.

If the requested end result after replacement is non-critical (e.g. one does not care if it does not work exactly the same anymore, as long as e.g. the graphics are about the same in that HTML page), it might be a method to use.

Global search/replace gives in general pretty good results, but no 'guarantee' that the program before and after works the same. Even *one* character change could make the difference between working and not working (and that will mean usually debugging and other (usually time-consuming) activities to repair it again). So manual editing afterwards might (almost always) be necessary.

---

With such relatively large files as e.g. 29 megabytes doing the job manually is usually too time consuming, depending on how many lines are involved. But it might in some cases if the right tools are not available out of the box the only way. Probably some partial
 automation (e.g. using TSE) might be possible, e.g. if it is about rather fixed patterns, like <tr>...</tr>, typically using regular expressions.

---

Programs
 like Perl (see e.g. the module CPAN module 'HTML::Parser' 
http://search.cpan.org/dist/HTML-Parser/Parser.pm, 
possibly SED, ... do on the other hand allow lines of unlimited length (only restricted by memory available). 
So making the parse job 'line based' instead of more general multi-line or file based, which is in general more
 difficult.

===

Conclusion:

So using a beautifier (like Tidy, PolyStyle, ...) would probably the way to go or try.

with friendly greetings,
Knud
 van Eeden




  
  From: Fred H Olson <fho...@cohousing.org>
 To: Semware <sem...@googlegroups.com> 
 Sent: Sunday, May 5, 2013 8:53 PM
 Subject: [TSE] Smarter breaking of excessively long lines
  

-- 

--- 
You received
 this message because you are subscribed to the Google Groups "SemWare TSE Pro text editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semware+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Boyd

unread,

May 5, 2013, 5:36:37 PM5/5/13

to sem...@googlegroups.com

Hello,

I tried writing an sgml/html/xml parser in TWE and it took months. Pretty good but not perfect.

Michael

To unsubscribe from this group and stop receiving emails from it, send an email to semware+u...@googlegroups.com.

knud van eeden

unread,

May 5, 2013, 5:52:36 PM5/5/13

to sem...@googlegroups.com

> I tried writing an sgml/html/xml parser in TWE and it took months

Can you please inform about what is TWE?

Thanks,

with friendly greetings,
Knud van Eeden

    From: Michael Boyd <michae...@verizon.net>
 To: sem...@googlegroups.com 
 Sent: Sunday, May 5, 2013 11:36 PM

Klaus Hartnegg

unread,

May 5, 2013, 3:52:16 PM5/5/13

to sem...@googlegroups.com

Am 05.05.2013 20:53, schrieb Fred H Olson:
> If one loads a file with long lines in TSE it simply breaks the
> line at 32000 bytes often in the middle of a string that should
> not be broken.
>
> My file has basically a very repetitive format:
> <h2 ... </h2> sequences followed by one or more

I would load it in binary mode and
replace </h2> with </h2>+chr(13)+chr(10)
Then save it and load it normally.

knud van eeden

unread,

May 5, 2013, 7:36:08 PM5/5/13

to sem...@googlegroups.com

I wonder thus if TWE is TSE or not.

If it is TSE, I am sure interested in more information.

    From: knud van eeden <knud_va...@yahoo.com>
 To: "sem...@googlegroups.com" <sem...@googlegroups.com> 
 Sent: Sunday, May 5, 2013 11:52 PM

Fred H Olson

unread,

May 5, 2013, 8:20:44 PM5/5/13

to sem...@googlegroups.com

Thanks Klaus,

I had read a bit about binary mode which said:
> If you specify a value for LineLength that exceeds
> the editor's maximum line length, the editor's maximum line length is
> assumed.

I assumed that this meant there would still be a problem with lines
breaking. But with youyr encoragement I experimented. It seems as long
as -b32000 is not used, binary lines are not limited inital length specified.
That is insertions make the lines longer. When the file is closed all
the lines are concatenated.

I did puzzle a bit over how to have the replacement string include a
linefeed (I'm using Linux) interactively where your notation does not work.
The easiest I found was to make a column block over a <lf> in binary mode,
copy it and insert it in the replacement string.

Knud, yes, parsing html can certainly get complicated but I find that
simple html can be manipulated some with TSE.

Michael Boyd

unread,

May 5, 2013, 10:17:27 PM5/5/13

to sem...@googlegroups.com

TSE. Sorry.

knud van eeden

unread,

May 6, 2013, 5:00:08 AM5/6/13

to sem...@googlegroups.com

> I tried writing an sgml/html/xml parser in TWE and it took months

Interesting.

There exist of course tools (e.g. Lex, Flex, Yacc, Bison, Gold, JavaCC, ...) 

http://en.wikipedia.org/wiki/Compiler-compiler

to automatically
 create (lexical analyzers and) parsers which allow to create the start for it expected pretty fast (e.g. within a day). Given that you have the suitable Backus Naur Form for that language (which usually can be found on the Internet) in the correct format for the parser-generator.

But the output is usually in C, C++, Java, ... thus not in TSE SAL, so much more work ahead expected.

Certainly SGML (from which the simpler XML and even simpler HTML is derived) is a heavy-weight language.

Any information about your trying to create such a parser in TSE would be interesting to know. E.g. Bottom up, top down, automated, non-automated, hand-coded, how restricted, what can it handle, compiler or interpreter, how far did you get, documentation, ...

Thanks,

with friendly greetings,
Knud van Eeden

 From: Michael
 Boyd <michae...@verizon.net>
 To: sem...@googlegroups.com 
 

Sent: Monday, May 6, 2013 4:17 AM

Klaus Hartnegg

unread,

May 12, 2013, 11:19:58 AM5/12/13

to sem...@googlegroups.com

Am 06.05.2013 02:20, schrieb Fred H Olson:
> I did puzzle a bit over how to have the replacement string include a
> linefeed (I'm using Linux)

In DOS: first press Ctrl-P, then Ctrl-J.
Will be shown as circle in a rectangle.
I don't know if this also works in Linux.

Klaus

Reply all

Reply to author

Forward