Re: Simple conversions from HTML to simple markups are disappointing

17 views
Skip to first unread message

rtr

unread,
Jan 23, 2022, 7:37:54 AMJan 23
to
On Sun, 23 Jan 2022 13:25:29 +0100
Luca Saiu <lu...@ageinghacker.net> wrote:

>
> [...]
>
> Now, it is possible to obtain a better conversion by spending more
> effort: in particular lynx (which of course was never designed for
> this task) is inadequate in preserving markup information. It is
> possible to parse HTML instead, and start from an AST. On the other
> hand some fault lies in the HTML source document as well: The
> document could have used, for example, CSS for icons instead of <img>
> elements when the content was not significant enough to deserve
> translation. However some style information only encoded in CSS
> would be significant for translation: had I used CSS in the place of
> old-style <tt> elements, recognising “code”-type elements would have
> been an issue. My html-to-gemini or html-to-gopher conversion would
> need a lot of the complexity I want to avoid.
>
> I have come to believe that the only really practical solution is
> translating in the opposite direction: starting from a simple and
> clean markup (I would say Gemini) and from that generating other
> simple markups (Gopher) and the legacy system (HTML). This can and
> should handle relative, intra-server links.
>

Interesting. I also do think that gemini/gopher -> html is easier to
deal with rather than the other way around. When I was first starting to
get into gemini I also dabbled with the idea of just converting my HTML
pages to gemtext. I figured that it's just easier to strip everything
of formatting and starting with plaintext and convert that gemtext.

Granted I don't have that much posts to mess with so that probably
played into my decision making process.

--
Give them an inch and they will take a mile.

bunburya

unread,
Jan 23, 2022, 9:03:34 AMJan 23
to
On 23/01/2022 12:25, Luca Saiu wrote:
> I have come to believe that the only really practical solution is
> translating in the opposite direction: starting from a simple and clean
> markup (I would say Gemini) and from that generating other simple
> markups (Gopher) and the legacy system (HTML). This can and should
> handle relative, intra-server links.

I believe this is correct, because the features supported by HTML are a
superset of those supported by gemtext. So going from HTML -> gemtext
almost always results in a loss of some information, which means a
choice must be made as to how to handle the loss of information. I
suspect the optimal solution to the problem will depend heavily on the
context, so it is hard to create a perfect, generalised HTML -> gemtext
converter.

Alternatively, you could consider starting in markdown, which lies
somewhere between gemtext and HTML in terms of features. Markdown ->
HTML is easy and commonly done. In principle, markdown -> gemtext
suffers from the same issues as HTML -> gemtext (loss of information due
to moving to a format with fewer supported features), but much less
information is lost as markdown is much closer to gemtext to begin with.
There are some tools out there already to convert from markdown to
gemtext, such as https://pypi.org/project/md2gemini/

I'm guessing the main thing you are missing in gemtext is inline links.
If you start from markdown, these can be preserved perfectly when
converting to HTML, and handled sensibly when converting to gemtext
(there are a few common ways to do this, such as converting to
footnotes, which the above tool supports).

Luca Saiu

unread,
Jan 23, 2022, 11:58:22 AMJan 23
to
Hello bunburya.

On 2022-01-23 at 14:03 +0000, bunburya wrote:

> In principle, markdown -> gemtext suffers from the same issues as HTML
> -> gemtext (loss of information due to moving to a format with fewer
> supported features), but much less information is lost as markdown is
> much closer to gemtext to begin with.

Agreed.

> I'm guessing the main thing you are missing in gemtext is inline links.

To me the lack of control on preformatted text is more serious than the
lack of inline links, possibly because of the technical topics I
normally write about.

Not being able to display source code clearly is a fatal flaw for me;
and notice how Gopher is less flawed than Gemini in this sense, by
virtue of being less abstract.

I do not particularly mind non-inline links; in fact they may promote a
clear style. However numbered footnote-style links obtained by
conversion, without descriptive labels, are difficult to follow without
interrupting the flow of reading: see the end of my conversion example,
which I believe is representative in terms of clutter.

=> gemini://ageinghacker.net/test-conversion.gmi

One has to write, from the beginning, in the new “minimal markup” style.

Incidentally I am not saying that Gemini is perfect. In fact I miss a
comment syntax, which I would use to encode my own information
(examples: list of keywords, tags, priority in a site-wide page map).
But what I am thinking is a set of semantic extensions, the kind which
Gemini is designed to prevent; that is fair.

In the same way I also miss italic and bold -- call them “emphasis” if
you will -- but here it might be healthy to let them go altogether.
Part of this entire exercise is detoxing from the overabundance of
irrelevant information. For decades I have been planning never to use
smileys again and write text a dignified style, with meaning conveyed
through words instead of some flashy semi-literate replacement for them.
In the end once in a while laziness wins.


The more I consider the issue the more I lean towards defining my own
source format from which to machine-generate even simple formats like
Gemini. In the longer term it would be something powerful and
extensible, like M4 without the awful quoting mechanism. For the time
being it can be a trivial system.

I would keep my public site source tree, with markup files identified by
a specific extension linking each other, along with data files such as
images. A script would generate copies of the entire tree, with
symbolic links where appropriate, with notes translated into Gemini,
Gopher or HTML.


If I get to write this tool and make it even vaguely usable by others I
will announce it here.

How do you, and other people here, solve the problem? Do you write
sites accessible only to Gemini or only to Gopher?

Thanks for the conversation.

--
Luca Saiu -- http://ageinghacker.net
I support everyone's freedom of mocking any opinion or belief, no
matter how deeply held, with open disrespect and the same unrelented
enthusiasm of a toddler who has just learned the word "poo".

meff

unread,
Jan 23, 2022, 3:02:54 PMJan 23
to
On 2022-01-23, Luca Saiu <lu...@ageinghacker.net> wrote:
> Disgusted by the web with its anti-features, its enormous gratuitous
> complexity and its essentially proprietary nature (the effort of
> re-implementing a significant component from scratch is unrealistic for
> single developers), I have recently opened Gemini and Gopher services

Hm I didn't think Jehova's Witnesses made it from the Web onto the net
also...

> After experimenting for one day or two I have to admit that the result
> is disappointing. The conversion is unnatural and I find that at the
> same time some important information is lost (<tt> and <pre>) while some
> which is irrelevant is preserved (icons). Having out-of-line links does
> not help readability when references are numerous.

Indeed it's hard to move from HTML to more "lean" markup formats when
you are, at least somewhat, relying on the semantic information that
HTML is providing.

> I have come to believe that the only really practical solution is
> translating in the opposite direction: starting from a simple and clean
> markup (I would say Gemini) and from that generating other simple
> markups (Gopher) and the legacy system (HTML). This can and should
> handle relative, intra-server links.

HTML offers semantic information in its markup and browsers can take
that semantic information and make sense of it. There's obviously a
lot of conflation between visual and semantic information in HTML, but
the semantic information present makes it hard to translate to
non-semantic markup formats. Markdown, Gemtext, etc are mostly just
visual formats (where the browser is usually fairly "dumb" about how
to display the format.) It might make sense to format your information
in a non-semantic way first and then add semantic niceities, like
<pre> afterword.

bunburya

unread,
Jan 24, 2022, 4:24:15 PMJan 24
to

On 23/01/2022 16:58, Luca Saiu wrote:
> To me the lack of control on preformatted text is more serious than the
> lack of inline links, possibly because of the technical topics I
> normally write about.
>
> Not being able to display source code clearly is a fatal flaw for me;
> and notice how Gopher is less flawed than Gemini in this sense, by
> virtue of being less abstract.

What is the issue you are having with pre-formatted text? Line numbering
and syntax highlighting are the main things that come to mind - I think
these could be achieved on the client side, though I'm not aware of any
client that currently does so. (I know there was some discussion of
syntax highlighting in pre-formatted text on the mailing list a while
ago; the majority view seemed to be that the alt text part of the
pre-formatted text block could indicate the language, though some
disagreed with using alt text in that way).


> How do you, and other people here, solve the problem? Do you write
> sites accessible only to Gemini or only to Gopher?

Personally I publish only to Gemini; however, I write very little anyway
(really just my gemlog, which is not updated all that often) so I don't
claim to be any kind of example to follow.

Luca Saiu

unread,
Jan 25, 2022, 7:11:03 PMJan 25
to
On 2022-01-24 at 21:24 +0000, bunburya wrote:

> On 23/01/2022 16:58, Luca Saiu wrote:
>> To me the lack of control on preformatted text is more serious than the
>> lack of inline links, possibly because of the technical topics I
>> normally write about.
>> Not being able to display source code clearly is a fatal flaw for me;
>> and notice how Gopher is less flawed than Gemini in this sense, by
>> virtue of being less abstract.
>
> What is the issue you are having with pre-formatted text?

On Gemini we have to clearly indicate what is pre-formatted and what is
not, because the default is that whitespace can be congealed and lines
broken and moved in order to fill paragraphs; Gopher does not do it, but
that means that paragraphs may end up displayed too narrow or too wide
for the client.

If I convert from HTML my quick hack based on Lynx fails because the
information on what was pre-formatted is lost. Converting *well* from
HTML requires analysing CSS as well.

For new text, not obtained by conversion, the Gemini solution works
well.

> Line numbering and syntax highlighting are the main things that come
> to mind - I think these could be achieved on the client side, though
> I'm not aware of any client that currently does so.

Yes. I am not against these features as long as line numbering does not
interfere with cut-and-paste.

> (I know there was some discussion of syntax highlighting in
> pre-formatted text on the mailing list a while ago; the majority view seemed to
> be that the alt text part of the pre-formatted text block could indicate the
> language, though some disagreed with using alt text in that way).

The alt text is a good feature by itself (example: this kind of
colour-coding for different programming languages or abstraction layers:
http://ageinghacker.net/projects/jitter-tutorial/ ), but in my opinion
not very philosophically coherent with the rest of Gemini which is
otherwise so minimalistic. And the alt text, again, lends itself to
semantic extension.

>> How do you, and other people here, solve the problem? Do you write
>> sites accessible only to Gemini or only to Gopher?
>
> Personally I publish only to Gemini; however, I write very little
> anyway (really just my gemlog, which is not updated all that often) so
> I don't claim to be any kind of example to follow.

I see. Thanks.

I think I will write some simple translator to generate Gemini, Gopher
and HTML from the same source.
Reply all
Reply to author
Forward
0 new messages