Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

document markup: HTML, LaTeX, etc.

4 views
Skip to first unread message

Ivan Shmakov

unread,
Nov 4, 2016, 1:30:25 PM11/4/16
to
>>>>> Rich <ri...@example.invalid>:
>>>>> Marko Rauhamaa <ma...@pacujo.net> wrote:

[I took the liberty of adding news:comp.infosystems.www.misc to
Newsgroups:, as the discussion would seem quite on-topic there.]

[...]

>> * At the bottom is HTML. It is extremely crude in its expressive
>> power. It lists a handful of element types, but documents need
>> more. In fact, different web sites need different element types.

> Well, I wouldn't describe HTML's "expressive power" as crude. HTML's
> expressive power is quite strong. What is "crude" for HTML is the
> default visual presentation of the underlying expressive power of
> HTML. The default visual rendering for basic HTML is what is quite
> crude.

Which is quite understandable. First of all, contrary to LaTeX,
there're multiple /independent/ "implementations of HTML" in
existence. An author of a LaTeX document can request it being
processed with "LaTeX 2\epsilon". With HTML, one gets no such
luxury^1.

And it won't pay out to standardize any "fancy" rendering as
part of a future HTML version, either: as the authors would be
unable to request a renderer that implements (at least) that
specific version, they will be required to supply their own
"fancy" CSS anyway. And if they won't benefit from a
"fancy built-in" CSS, why introduce one in the first place?
And that's good for the "forward compatibility", too.

There's one another reason, however. Also contrary to LaTeX,
HTML is expected to be rendered in a wide variety of forms, such
as computer screens of varying dimensions, paper, by the means
of speech synthesis, and so on^2. Indeed, the standard could
prescribe that the document's title is rendered in a
"17pt Roman" font^3, but how useful that would be if I read that
document on a character-cell terminal with Lynx^4? And trying
to specify /all/ the possible ways to render the title doesn't
look like a sensible approach, either.


Notes

^1 "Best viewed with" fad of the bad old days aside.

^2 Not to mention the processing by various "robot" software.

^3 Default \title formatting per classes.dtx.

^4 As I often do.

[...]

--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

Ivan Shmakov

unread,
Nov 4, 2016, 2:12:38 PM11/4/16
to
>>>>> Marko Rauhamaa <ma...@pacujo.net> writes:

[I took the liberty of adding news:comp.infosystems.www.misc to
Newsgroups:, as the discussion would seem quite on-topic there.]

[...]

> I mean HTML only has a handful of element types, regardless of
> visuals. Compare it with (La)TeX, which allows you to define new
> element types with semantics and all. In HTML, you do it by defining
> a div and a class name and externalizing the semantics outside HTML.

> There's no <aphorism>, <adage>, <saw>, <definition>, <theorem>,
> <conjecture>, <joke>, <guess>, <lie>, <rumor> etc...

But of course you can define new HTML elements, much the same as
you do that with TeX-based systems. As for the "prior art"
example, Wayback Machine routinely uses <wb_p /> in place of
<p /> (and more) for their markup on the archived pages^1.

There are, of course, a couple drawbacks to this approach --
which, arguably, are not all that different to what one gets
with TeX.

First of all, such HTML has little chance of passing "validity"
tests^2. That said, TeX-based systems do not introduce the
concept of "validity" at all; the document is deemed "good" as
long as it renders the way it's intended. Or at the least, I'm
not aware of any "LaTeX validators" currently in wide use.

Also, while CSS makes it possible to specify the rendering^3,
the semantics remain undefined. On the TeX side, using
\def \foo \mathbf doesn't convey semantics, either -- only the
presentation. And that's (partially) covered by CSS.

And of course, CSS-wise, using such new elements is hardly any
different to using the standard "blank" <div /> and <span />
elements with a 'class' attribute. A more sensible approach is
to use some standard element (thus "inheriting" its semantics
for any third-party processors of said document) -- along with
suitable 'class' and 'role' attributes, RDFa, etc.

Now, as an aside, while I can imagine specific documents that
would benefit from the elements above, I fail to see their
utility to the Web at large. Should the search engine really
treat <joke /> any differently to <theorem />, for instance?
How <saw /> would be any different to <definition /> when
interpreted by the Web user agent? (Other than for their
presentation -- but we've got that covered with CSS, right?)
If anything, it feels like over-engineering to me, alas.


Ultimately, however, I'd like to note that the flexibility of
TeX comes from it being a full-weight programming language --
contrary to HTML and CSS, which are merely data languages.

Then, it was already noted that the modern Web ecosystem employs
both data languages, such as HTML and CSS (but also SVG, MathML,
RDFa, various "microformats", etc.), -- and JavaScript for the
programming language. And honestly, I'm not entirely sure that
comparing a data language with a programming language quite
makes sense. (So, if anything, shouldn't we rather be comparing
TeX to JavaScript here? Instead of HTML and CSS, that is.)

Hence, I claim that the power of TeX is also its weakness.
Yes, one can implement a seemingly-declarative "markup" language
in TeX (such as LaTeX), but will it be much different to
implementing such a language in JavaScript? Yes, one can
perform static analysis of a TeX document -- but will the
results /always/ be more useful than performing that same static
analysis on a pure JavaScript-based Web page^4? And no, one
does not "process" a TeX document to get a PDF -- one has to
"run" it instead. "Here be the halting problem."

Somewhat less importantly, TeX code is even less isolated (by
default) from the underlying system than JavaScript^5. One can
easily \input /etc/passwd -- or write to any file the user is
permitted to write to. And given that there're users that tend
to be wary of running arbitrary JavaScript, what should they
feel about running arbitrary TeX code?


Now, to speak of the bright side. HTML5 possesses a decent
"expressive power" and can be "specialized" as necessary by the
means of (generally ad-hoc) 'class' values and (more
standardized) "microformats", RDFa, etc. The standardization of
such elements as <article />, <nav /> and <time /> in HTML5
allows for easier extraction of the "payload" content and
metadata from the compliant documents. The inclusion of the DOM
interface specification makes it possible to provide uniform
interface to "HTML objects" across programming languages.

There're some developments (such as RASH^6) aimed at making HTML
a suitable format for authoring scientific papers in.

An even older project, MathJax^7, allows one to include quality
mathematics in HTML documents. It supports several formats for
both "input" (MathML, TeX, ASCIImath) and "output" (HTML and
CSS, SVG, MathML.) The formulae are rendered on the user's
side, which means that the user has a degree of control over the
final presentation. When the math is written in the TeX
notation, the user of a browser not implementing JavaScript, or
having it disabled, sees the unprocessed TeX -- which can be as
readable as the author manages to write it.

While the use of "client-side" JavaScript is questionable at
times, its /omnipresence/ can be regarded as an opportunity.
Frankly, I don't seem to recall there ever been a development
environment covering the computers ranging from something one
can carry on one's palm, to desktops, to supercomputers^8.


Notes

^1 Presumably to avoid possible clashes with the archived pages' own
styling.

^2 Unless, of course, the newly introduced elements become so
commonly used by other parties as to warrant inclusion into the
whatever new HTML TR W3C decides to publish.

^3 Most commonly visual, but, while frequently overlooked, I'd like
to note that CSS 2.1 offers properties to describe also the
/aural/ presentation of the document -- think of the speech
synthesers' users, for instance. I'm unaware of any similar
facility for TeX-based publishing systems.

^4 http://circuits.im/ comes to mind.

^5 The isolation the JavaScript implementations offer is also
stronger than, say, the one implemented in GhostScript (-dSAFER)
for PostScript -- that happens to be one another common
"document programming language".

^6 http://rawgit.com/essepuntato/rash/master/documentation/
RASH: Research Articles in Simplified HTML

^7 http://mathjax.org/

^8 Disclaimer: I do not advocate in favor of portable computers in
general, and even less so for any and all devices running
non-free software, or implementing cellular network protocols.
Also, I really hope that one wouldn't actually use JavaScript
for any "number crunching", but will rely on something like C
instead. That said, should I ever have to choose between
JavaScript and, say, Python -- I'd go with JavaScript, sure.

Marko Rauhamaa

unread,
Nov 4, 2016, 4:24:21 PM11/4/16
to
Ivan Shmakov <iv...@siamics.net>:

> Ultimately, however, I'd like to note that the flexibility of TeX
> comes from it being a full-weight programming language -- contrary to
> HTML and CSS, which are merely data languages.
>
> Then, it was already noted that the modern Web ecosystem employs both
> data languages, such as HTML and CSS (but also SVG, MathML, RDFa,
> various "microformats", etc.), -- and JavaScript for the programming
> language. And honestly, I'm not entirely sure that comparing a data
> language with a programming language quite makes sense. (So, if
> anything, shouldn't we rather be comparing TeX to JavaScript here?
> Instead of HTML and CSS, that is.)

The point is, is there a reason to bother with "data languages" when
what you really need is a programming language. Data is code, and code
is data.

Note, for example, how iptables are giving way to BPF, which in turn are
finding more and expanded uses.

Also, note how PostScript handles rendering beautifully in the printing
world and how elisp is used to "configure" emacs.

> Hence, I claim that the power of TeX is also its weakness. Yes, one
> can implement a seemingly-declarative "markup" language in TeX (such
> as LaTeX), but will it be much different to implementing such a
> language in JavaScript?

Well, no, it won't. That's the point. All you need is <div> and JS, but
that's also the minimum you need.

> Yes, one can perform static analysis of a TeX document -- but will the
> results /always/ be more useful than performing that same static
> analysis on a pure JavaScript-based Web page^4?

What do you need static analysis for? Ok, Google needs to analyze, but
leave that to them.

Point is, formal semantics needs a full-fledged programming language.
And by semantics, I'm not referring (mostly) to the visual layout but to
the structure and function of the parts of the document/web site.

> And no, one does not "process" a TeX document to get a PDF -- one has
> to "run" it instead.

Way to go!

> "Here be the halting problem."

Mostly, there will be problems with security and DoS, but I suppose
those can be managed. JavaScript and PostScript (among others) have had
to go through those stages.


Marko

Ivan Shmakov

unread,
Nov 4, 2016, 6:25:31 PM11/4/16
to
>>>>> Marko Rauhamaa <ma...@pacujo.net> writes:
>>>>> Ivan Shmakov <iv...@siamics.net>:

>> Ultimately, however, I'd like to note that the flexibility of TeX
>> comes from it being a full-weight programming language -- contrary
>> to HTML and CSS, which are merely data languages.

>> Then, it was already noted that the modern Web ecosystem employs
>> both data languages, such as HTML and CSS (but also SVG, MathML,
>> RDFa, various "microformats", etc.), -- and JavaScript for the
>> programming language. And honestly, I'm not entirely sure that
>> comparing a data language with a programming language quite makes
>> sense. (So, if anything, shouldn't we rather be comparing TeX to
>> JavaScript here? Instead of HTML and CSS, that is.)

> The point is, is there a reason to bother with "data languages" when
> what you really need is a programming language. Data is code, and
> code is data.

I beg to differ. Otherwise, I won't be using Lynx (which only
supports rendering HTML, but not running JavaScript -- or any
other programming language for that matter; lynxcgi:// aside)
or that wonderful NoScript extension for Firefox.

> Note, for example, how iptables are giving way to BPF,

Do they?

> which in turn are finding more and expanded uses.

> Also, note how PostScript handles rendering beautifully in the
> printing world

I haven't used PostScript for years; I choose to rely on PDF.
If anything, being a data language, it feels much "safer".

> and how elisp is used to "configure" emacs.

Given that Emacs is first and foremost a programming language
implementation, the "configuration" as a term feels nearly as
applicable to it as it would be to GNU Libc. Otherwise, indeed,
it only makes sense to customize a programming language using a
program written in it.

As an aside, will you fancy reading messages here on Usenet,
that would happen to be actual code, instead of ASCII data?
(And if so, what language would you prefer?)

... Not that it's something unheard of. Back in the BBS days,
the demogroups and tech-savvy individuals were not necessarily
satisfied with posting a plain text message, opting instead
to wrap one into a binary for the platform of their choice.
When run, the binary would render said message -- with music,
video effects, and so on.

Of course, that meant no luck for those who use different
hardware, but then again: those would probably not be interested
in the message in the first place.

FTR, the relevant software were known as "noters". I was able
to locate several mentions on the Web (say, ^1, ^2), but no
"encyclopedic" description of the concept so far.

>> Hence, I claim that the power of TeX is also its weakness. Yes, one
>> can implement a seemingly-declarative "markup" language in TeX (such
>> as LaTeX), but will it be much different to implementing such a
>> language in JavaScript?

> Well, no, it won't. That's the point. All you need is <div> and JS,
> but that's also the minimum you need.

That way, one's essentially using a format of one's own make,
while also requiring that those who happen to be interested in
the actual "payload" use the software also of one's own make.

Well, thanks, but no thanks; when the site tells me "best viewed
with our custom browser" (or "only viewed", for that matter),
it's one big red "STOP" sign to me. I prefer to stick to the
software of /my/ choice -- not theirs.

>> Yes, one can perform static analysis of a TeX document -- but will
>> the results /always/ be more useful than performing that same static
>> analysis on a pure JavaScript-based Web page?

> What do you need static analysis for?

Why, I may want to extract the table of contents of a document,
or make a list of the titles of some or all my documents, etc.

> Ok, Google needs to analyze, but leave that to them.

Indeed, Web search engines benefit the most from the use of data
languages on the modern Web. But in fact, anyone can join.
(But if you manage to convince Google to run the code you've put
on your site in order to index it -- I'd be very much interested
in the details.)

Also to mention is that "data format conversion" becomes an
ill-defined procedure once it's no longer "data" we speak of.

Say, subject to formats' color depth limitations, it's always
possible to convert a raster image from one lossless format to
another (say, PNG to PNM to BMP to...) with no loss of data.

Is it similarly possible to convert Forth into Perl? Or, more
relevant to this discussion, LaTeX into JavaScript -- and back
again?

> Point is, formal semantics needs a full-fledged programming language.
> And by semantics, I'm not referring (mostly) to the visual layout but
> to the structure and function of the parts of the document/web site.

I do not see how a programming language could help /conveying/
semantics -- as opposed to /implementing/ it.

Suppose, for example, that we are to define the C 'printf'
function semantics for a new C standard. What programming
language do we use, and how do we do that?

[...]

>> "Here be the halting problem."

> Mostly, there will be problems with security and DoS, but I suppose
> those can be managed. JavaScript and PostScript (among others) have
> had to go through those stages.

Off the top of my head, a bug in the Firefox implementation of
JavaScript has lead to the fall of Silk Road (say, ^3).
Also, it was shown that it's possible to exploit the
"row hammer"^4 hardware vulnerability from JavaScript.

^1 http://commodorefree.com/magazine/vol2/issue20.htm

^2 https://duckduckgo.com/html/?q=demomaker+"noter"

^3 https://daniweb.com/hardware-and-software/networking/news/460484/

^4 https://en.wikipedia.org/wiki/Row_hammer

Doc O'Leary

unread,
Nov 5, 2016, 1:25:54 PM11/5/16
to
For your reference, records indicate that
Marko Rauhamaa <ma...@pacujo.net> wrote:

> The point is, is there a reason to bother with "data languages" when
> what you really need is a programming language. Data is code, and code
> is data.

No. Things scale in natural ways. Everything is data *first*, and
only then can we look at it and see if the data could be code, and
then *how* the data and code can be best be presented together. When
all you want to do is say “Hello, World!”, the more simply a system
can do that, the better.

> Point is, formal semantics needs a full-fledged programming language.
> And by semantics, I'm not referring (mostly) to the visual layout but to
> the structure and function of the parts of the document/web site.

It *may* require that for some uses, but it is a mistake to impose
the most complex system possible at all scales. Too great a learning
curve will drive new people away. Tools that are too specialized
will get very few users.

And also keep in mind that semantics *will* differ between the
producer of a document and the consumer of the document. The more
“code” you embed in a document, the more you force a particular
meaning on its content. That is not always the right thing to do.

--
"Also . . . I can kill you with my brain."
River Tam, Trash, Firefly


0 new messages