Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HTML to LATeX

1,138 views
Skip to first unread message

Fastian

unread,
Jun 13, 2013, 7:20:47 AM6/13/13
to
I am in search of a good program that can convert an HTML file to an equivalent tex file. My requirement is that I am taking the input from the user on web in an HTML editor and storing it in the HTML file. I want this input file to be converted into its equivalent tex file. I tried to use Java based 'htmltolatex' (http://htmltolatex.sourceforge.net/) but was not able to run it properly.

Do you know any stable good program that can help me convert HTML file to Tex file?

superpollo

unread,
Jun 13, 2013, 7:53:43 AM6/13/13
to
Fastian ha scritto:
> I am in search of a good program that can convert an HTML file to an equivalent tex file.

what do you mean by "equivalent"?

> Do you know any stable good program that can help me convert HTML file to Tex file?

assuming you are trolling, i'd suggest to make a few screenshots of the
web page and then \includegraphics...

\bye

--
La Tunze funziona in tutte le dimensioni, la Standard solo
nei quadrati.

Robin Fairbairns

unread,
Jun 13, 2013, 2:30:28 PM6/13/13
to
superpollo <super...@tznvy.pbz> writes:

> Fastian ha scritto:
>> I am in search of a good program that can convert an HTML file to an
>> equivalent tex file.
>
> what do you mean by "equivalent"?

one that compiles to pages with the same semantics? it's too much to
hope that the result will be paginated and laid out the same.

>> Do you know any stable good program that can help me convert HTML
>> file to Tex file?
>
> assuming you are trolling, i'd suggest to make a few screenshots of
> the web page and then \includegraphics...

ho ho, very droll, mr chicken.

there are things that will _compile_ html (at some iteration of html's
ambition to be all things to all men). i doubt they would be much use
on modern web pages (though the uk faq pages are generated on the fly,
so a transformation of such pages won't be far from my original tortured
latex...)
--
Robin Fairbairns, Cambridge

jon

unread,
Jun 13, 2013, 4:21:28 PM6/13/13
to
On Jun 13, 7:20 am, Fastian <abdulbasit.f...@gmail.com> wrote:
> I am in search of a good program that can convert an HTML file to an equivalent tex file. My requirement is that I am taking the input from the user on web in an HTML editor and storing it in the HTML file. I want this input file to be converted into its equivalent tex file. I tried to use Java based 'htmltolatex' (http://htmltolatex.sourceforge.net/) but was not able to run it properly.
>
> Do you know any stable good program that can help me convert HTML file to Tex file?

'equivalent' is unclear to me in this context, but you could try
pandoc <http://johnmacfarlane.net/pandoc/>, which can read
html (among other things) and write latex and context (among
other things -- even beamer slides, apparently). it is actively
developed and maintained.

cheers,
jon.

Robin Fairbairns

unread,
Jun 14, 2013, 5:25:49 AM6/14/13
to
i didn't know that; pandoc seems a good thing, and i once tried to get
the author (authors?) to submit it to ctan ... with little success.
--
Robin Fairbairns, Cambridge

Axel Berger

unread,
Jun 14, 2013, 6:17:02 AM6/14/13
to
superpollo wrote:
> assuming you are trolling, i'd suggest to make a few screenshots
> of the web page and then \includegraphics...

Not necessarily so. Some sites are impossible to print acceptably. So
the only thing to do is copy and paste the main content. At least the
sematic bits, <Hx> <EM> <STRONG> are easy and obvious to convert. I have
done some of that as a macro for my editor, but a comprehensive solution
covering most eventualities would be nice.

Josef Kleber

unread,
Jun 14, 2013, 7:29:30 AM6/14/13
to
That's true for simple documents. For a complex LaTeX document, the
conversion to HTML is somehow disappointing. You really can't blame
pandoc for that because it is complex! Maybe HTML -> LaTeX works better,
as HTML is defined and more limited.

OTOH, pandoc works quite well if your starting point is a quite simple
markup like markdown. Conversion to LaTeX, HTML or others is then quite
good.

Josef

Lee Rudolph

unread,
Jun 14, 2013, 9:32:59 AM6/14/13
to
Josef Kleber <josef....@nurfuerspam.de> writes:

>Am 13.06.2013 22:21, schrieb jon:
>> On Jun 13, 7:20 am, Fastian <abdulbasit.f...@gmail.com> wrote:
>>> Do you know any stable good program that can help me convert HTML file to Tex file?
>>
>> 'equivalent' is unclear to me in this context, but you could try
>> pandoc <http://johnmacfarlane.net/pandoc/>, which can read
>> html (among other things) and write latex and context (among
>> other things -- even beamer slides, apparently). it is actively
>> developed and maintained.
>
>That's true for simple documents. For a complex LaTeX document, the
>conversion to HTML is somehow disappointing. You really can't blame
>pandoc for that because it is complex! Maybe HTML -> LaTeX works better,
>as HTML is defined and more limited.

On the other hand, much (I would hazard, very much) of the
"HTML" found on the web fails in some (often, many) ways
to accord with the definitions of HTML. Thanks to (or maybe
better, merely "due to") the way browsers are written,
non-conformant HTML often *appears* to be conformant (to
something; which may differ between browsers) when displayed.
What is an HTML -> LaTeX engine to do in such cases? Make
whatever guesses some fixed choice of browser makes, so as
to produce LaTeX that compiles to *something*, plus a log
full of LaTeX warnings? Make no guesses, and produce
non-compilable code plus a log full of LaTeX error
messages? Either situation leaves an engine-user
who does not have control of the original HTML (which
I assume is the original poster's situation) with an
enormous amount of post-engine manual labour, and
even then with no guarantee of having captured the
(badly implemented) intent of the author of the original
HTML. The situation is somewhat different if the same
person wrote the HTML (first), then converted it to
TeX; but I don't think all that different in quality
(just, with luck, different in quantity of manual
labour).

Lee Rudolph

Axel Berger

unread,
Jun 14, 2013, 11:43:48 AM6/14/13
to
Lee Rudolph wrote:
> What is an HTML -> LaTeX engine to do in such cases?

The same thing ALL browsers ought to have done right from the start:
Print "Syntax Error" and stop. If only they had agreed to that from the
beginning the Web would be a much better place today.

Axel

Lee Rudolph

unread,
Jun 14, 2013, 12:13:51 PM6/14/13
to
Of course I agree without reservations. However, my
guesses about what the OP wants to do lead me to suppose
that that wouldn't be satisfactory for him/her.

Lee Rudolph

Robert Heller

unread,
Jun 14, 2013, 1:46:22 PM6/14/13
to
This is just a *guess* on my part, but I *suspect* the the OP might actually
trying to convert from (shudder) MS-Word or OO to LaTeX and is using HTML as
an intermediate format.

One of the *major* problems with the HTML to LaTeX conversion (whether from
proper HTML, improper HTML, or HTML from a WP doc) is that HTML and LaTeX are
*very* different animals. HTML is all about *visual* markup and LaTeX is all
about *logical* markup. Yes, one can do something like map

<h1>Hello World</h1>

to

{\Huge Hello World}

But there is no way to tell that really this should be:

\chapter{Hello World}

Eg the original *intent* of the author of the HTML was that h1 tags were for
chapter headings.


>
> Lee Rudolph
>

--
Robert Heller -- 978-544-6933 / hel...@deepsoft.com
Deepwoods Software -- http://www.deepsoft.com/
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments



Axel Berger

unread,
Jun 14, 2013, 4:53:57 PM6/14/13
to
Robert Heller wrote:
> HTML is all about *visual* markup and LaTeX is all
> about *logical* markup.

Actually if anything it's the other way round. TeX at least is all
visual, LaTeX cloaks a semantic layer around that to a degree and tools
like hyperref do more, but behind it all there's nothing but
typesetting.
HTML on the other hand was designed to be all semantic with only the
browser and its user deciding how those semantics ought best to be
displayed.

In practice LaTeX users mostly care about maintainability and quality
while HTML "programmers" care about nothing much at all (except conning
the clueless client) so the real world end result tends to be as you
say, but that's not inherent.

Axel

Manuel Collado

unread,
Jun 15, 2013, 6:37:12 AM6/15/13
to
El 14/06/2013 22:53, Axel Berger escribió:
> Robert Heller wrote:
>> HTML is all about *visual* markup and LaTeX is all
>> about *logical* markup.
>
> Actually if anything it's the other way round. TeX at least is all
> visual, LaTeX cloaks a semantic layer around that to a degree and tools
> like hyperref do more, but behind it all there's nothing but
> typesetting.
> HTML on the other hand was designed to be all semantic with only the
> browser and its user deciding how those semantics ought best to be
> displayed.

Agreed.

>
> In practice LaTeX users mostly care about maintainability and quality
> while HTML "programmers" care about nothing much at all (except conning
> the clueless client) so the real world end result tends to be as you
> say, but that's not inherent.

I have developed an HTML to LaTeX^converter for myself [1]. It mostly
works, but with serious unavoidable limitations:

1.- Most web pages just contain garbage HTML code. The need intensive
clean-up before trying to convert the markup to anything else.

2.- The document model of HTML is quite different of the LaTeX one. So
there are HTML constructs that cannot be mapped to LaTeX in an effective way.

[1] http://lml.ls.fi.upm.es/~mcollado/xhtmlatex/
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

jon

unread,
Jun 15, 2013, 11:27:28 AM6/15/13
to
On Jun 15, 6:37 am, Manuel Collado <m.coll...@domain.invalid> wrote:
> El 14/06/2013 22:53, Axel Berger escribió:
>
> > Robert Heller wrote:
> >> HTML is all about *visual* markup and LaTeX is all
> >> about *logical* markup.
>
> > Actually if anything it's the other way round. TeX at least is all
> > visual, LaTeX cloaks a semantic layer around that to a degree and tools
> > like hyperref do more, but behind it all there's nothing but
> > typesetting.
> > HTML on the other hand was designed to be all semantic with only the
> > browser and its user deciding how those semantics ought best to be
> > displayed.
>
> Agreed.

i'm no expert in either, but the problem always seemed to me to be
the author, whether of html or *tex. maybe <h1>some words</h1>
wasn't meant to be \chapter{some words}, but on the latex side
how many of us have seen people people using \begin{center} ...
\end{center} to give their document a title, or ending paragraphs
with \\ (etc., etc.)?

cheers,
jon.

Nasser M. Abbasi

unread,
Jun 15, 2013, 10:37:20 PM6/15/13
to
On 6/13/2013 6:20 AM, Fastian wrote:
> I am in search of a good program that can convert an HTML file to an equivalent tex file.

One option to try:

1. Using Chrome, Select print and use PDF as output.
2. Use a pdf to latex conversion tool.

googling "pdf to latex" shows few that can be tried.

Chrome pdf saving of web pages is very good btw. You can
try it.

--Nasser

Peter Flynn

unread,
Jun 17, 2013, 3:13:02 PM6/17/13
to
Yes, XSLT2. But it's a programming language, so you have to specify what
you want the HTML converted to. Most of it is straightforward if the
HTML has been used conventionally, but more difficult if it has been
used for decoration. If you post a small example, I can show you more.

///Peter

Peter Flynn

unread,
Jun 17, 2013, 3:16:19 PM6/17/13
to
On 06/14/2013 06:46 PM, Robert Heller wrote:
> HTML is all about *visual* markup and LaTeX is all
> about *logical* markup.

HTML *can* be all about logical markup: that is, after all, what it was
originally designed for. But you are correct that most HTML nowadays is
just pretty-printing.

> Yes, one can do something like map
>
> <h1>Hello World</h1>
>
> to
>
> {\Huge Hello World}
>
> But there is no way to tell that really this should be:
>
> \chapter{Hello World}
>
> Eg the original *intent* of the author of the HTML was that h1 tags were for
> chapter headings.

You just have to make that assumption (although IMHO chapter is
unlikely), and translate H1 to \section, H2 to \subsection, etc.

If that's not the original author's intent, then either the OP must find
out what that was, or tough cookie.

///Peter

Peter Flynn

unread,
Jun 17, 2013, 3:17:24 PM6/17/13
to
On 06/15/2013 11:37 AM, Manuel Collado wrote:
[...]
> I have developed an HTML to LaTeX^converter for myself [1]. It mostly
> works, but with serious unavoidable limitations:
>
> 1.- Most web pages just contain garbage HTML code. The need intensive
> clean-up before trying to convert the markup to anything else.

Pass the HTML through Tidy first, and then work with the XHTML.

///Peter

Fastian

unread,
Jun 18, 2013, 3:13:16 AM6/18/13
to
On Thursday, June 13, 2013 4:20:47 PM UTC+5, Fastian wrote:
> I am in search of a good program that can convert an HTML file to an equivalent tex file. My requirement is that I am taking the input from the user on web in an HTML editor and storing it in the HTML file. I want this input file to be converted into its equivalent tex file. I tried to use Java based 'htmltolatex' (http://htmltolatex.sourceforge.net/) but was not able to run it properly.
>
>
>
> Do you know any stable good program that can help me convert HTML file to Tex file?

I have tried htmltolatex (http://htmltolatex.sourceforge.net/) and it has worked.
0 new messages