The output is:
> 0.
<to be read again>
!
<*> \showthe\count25!
!!
Look at the last two lines. They point to the character '!' after
\count25 which is an error because '!' is not part of the <internal
quantity> eaten by \showthe. Correct lines would be:
<*> \showthe\count25
!!!
It correctly works for single variables (\showthe\endlinechar) and for
countdef tokens(\countdef\x25\showthe\x), but not for registers with
indexes. Can I ask Knuth for legendary $327.68 ? :-)
I think the real error is reporting "count25" instead of
"count10"! :-)
I suspect you have sorted it out now, but the messages
are entirely correct and detailed enough to be clear.
TeX had to read the first "!" to find that it was not
another digit for the register index. It has put it aside
"to be read again". Two more exclamations are still
pending in the buffer and have not been read.
> It correctly works for single variables (\showthe\endlinechar)
It works correctly in both cases, but for \endlinechar it is
as you expected because TeX did not look for additional
digits.
Donald Arseneau
The reason for suspecting a bug was that \showthe is executed after
the number was parsed and TeX already knows where is stops. Currently
in my implementation of TeX parser (http://code.google.com/p/texpp/)
\showthe execution routine gets only the tokens which forms the number
itself so emulating such TeX behavior requires additional effort.
ks.vladimir wrote:
> The reason for suspecting a bug was that \showthe is executed after
> the number was parsed and TeX already knows where is stops. Currently
> in my implementation of TeX parser (http://code.google.com/p/texpp/)
> \showthe execution routine gets only the tokens which forms the number
> itself so emulating such TeX behavior requires additional effort.
Please tell us - or at least me - more about this project. I'm
particularly interested in knowing why you wish to write it (not that I
think it is a bad idea).
--
Jonathan
Currently TeXpp parses TeX documents in a simple form of document tree
(which should be especially convenient when using python interface).
The commands are actually executed, so TeX self-modifying features
will work. Of course it still requires huge amount of work to be
completed...
Thank you. I have some suggestions which might help you. But first,
could you give some examples of 'difficult features' in the documents
you are processing.
For example, do they change catcodes. (Most maths documents don't.)
> I've tested some of
> already existing projects (LaTeX::TOM, plasTeX) but they do not meet
> this requirement. I've also spend some time digging into TeX source
> code, but re-using it seems to be not an easy task for me.
I didn't know about LaTeX::TOM. Thank you.
--
Jonathan
Yes, catcodes are almost never touched (except for @). The most
problematic feature is macros and their arguments. Another problem is
huge number of packages which defines their own curious macros. I've
tested LaTeX::TOM, plasTeX and detex on huge number of random articles
from arxiv.org and the rate of failures was very big due to various
reasons.
Actually, due to time limits, my initial plan was to implement only
the most essential subset of TeX commands in the library and stubs for
other TeX and LaTeX commands in the python program on top of the
library. The direction for further work will be determined by testing
the program on real articles.
I get the same result, using
\message{\number`\zzz}
Note that 48 is the ASCII code for '0'. So it seems as though TeX is
inserting a '0' character as error recovery, and then applying \zzz.
I've also tried this:
===
$ tex
This is TeX, Version 3.141592 (MiKTeX 2.4)
**\def\zzz{123}
*\message{\number`\zzz}
! Improper alphabetic constant.
<to be read again>
\zzz
<*> \message{\number`\zzz
}
?
48123
*
===
I'd have to look at the source of TeX (which I don't have to hand) to
know if this is the intended behaviour. It does look at bit strange.
--
Jonathan
So is it arxiv article that you're wanting to process? There are others
doing this sort of thing.
http://kwarc.eecs.iu-bremen.de/projects/arXMLiv/
They use Bruce Miller's LaTeXML, and their agenda might be similar to yours.
--
Jonathan
It always assigns 48 (= the digit zero):
~$ tex
This is TeX, Version 3.1415926 (Web2C 7.5.7)
**\showthe\count0
> 1.
<*> \showthe\count0
?
*\count0=0
*\count0=`\par
! Improper alphabetic constant.
<to be read again>
\par
<*> \count0=`\par
?
*\showthe\count0
> 48.
<*> \showthe\count0
?
This is defined in tex.web:
if cur_val>255 then
begin print_err("Improper alphabetic constant");
@.Improper alphabetic constant@>
help2("A one-character control sequence belongs after a ` mark.")@/
("So I'm essentially inserting \0 here.");
cur_val:="0"; back_error;
end
--
Replace “READ-MY-SIG” by “tcalveu” to answer by mail.
> *\count0=`\par
> ! Improper alphabetic constant.
> <to be read again>
> \par
> <*> \count0=`\par
>
> ?
>
> *\showthe\count0
>> 48.
> <*> \showthe\count0
>
> ?
>
>
> This is defined in tex.web:
>
> if cur_val>255 then
> begin print_err("Improper alphabetic constant");
> @.Improper alphabetic constant@>
> help2("A one-character control sequence belongs after a ` mark.")@/
> ("So I'm essentially inserting \0 here.");
> cur_val:="0"; back_error;
> end
It seems to me that Don Knuth gets to keep his money. Inserting \0 is a
reasonable thing to do (although not the only reasonable thing, and
perhaps not the best).
I didn't think of using 'help2' to find out what was going on. Well done.
--
Jonathan
Yes, I understand. Thanks for pointing me to the code, I should have
thought about using help2 :)
Thanks a lot for pointing me to this project, looks very interesting.
The agenda is very similar, but not the same. I need not to convert
articles but to change the articles while keeping them suitable for
further editing by the author, which requires preserving original
formatting (i.e. keeping all characters that are normally thrown away
by TeX input processor). On the other hand my task is a lot easier
because I don't have to attach any meaning to commands except small
subset which is interesting to me - most commands have to be just
parsed but not converted.
Another note is that TeXpp approach is more general: I'm trying to
make it fully compatible with TeX itself. For example, I have
automated tests that compares behavior of TeXpp and TeX (http://
code.google.com/p/texpp/source/browse/#svn/trunk/tests/tex). In the
future it will make TeXpp able to load any tex package.
> I've found one more strange behavior in TeX: the code \count0=`\zzz
> assigns value 48 to the count0.
You should study the .log file and pressing h for more error
information:
\count0=`\zzz
\message{<\the\count0>}
\end
The result:
| This is TeX, Version 3.14159 (Web2C 7.5.2)
| (./test.tex
| ! Improper alphabetic constant.
| <to be read again>
| \zzz
| l.1 \count0=`\zzz
|
| ? h
| A one-character control sequence belongs after a ` mark.
| So I'm essentially inserting \0 here.
|
| ?
| ! Undefined control sequence.
| <recently read> \zzz
|
| l.1 \count0=`\zzz
|
| ? h
| The control sequence at the end of the top line
| of your error message was never \def'ed. If you have
| misspelled it (e.g., `\hobx'), type `I' and the correct
| spelling (e.g., `I\hbox'). Otherwise just continue,
| and I'll forget about whatever was undefined.
|
| ?
| <48> )
| No pages of output.
| Transcript written on test.log.
The TeXbook writes:
| But TeX actually provides another kind of <number> that makes it
| unnecessary for you to know ASCII at all! The token `_{12} (left quote),
| when followed by any character token or by any control sequence token
| whose name is a single character, stands for TeX's internal code for the
| character in question. For example, \char`b and \char`\b are also
| equivalent to \char98.
Yours sincerely
Heiko <ober...@uni-freiburg.de>
>> They use Bruce Miller's LaTeXML, and their agenda might be similar to yours.
>
> Thanks a lot for pointing me to this project, looks very interesting.
> The agenda is very similar, but not the same. I need not to convert
> articles but to change the articles while keeping them suitable for
> further editing by the author, which requires preserving original
> formatting (i.e. keeping all characters that are normally thrown away
> by TeX input processor). On the other hand my task is a lot easier
> because I don't have to attach any meaning to commands except small
> subset which is interesting to me - most commands have to be just
> parsed but not converted.
It seems to me that you want something like a pretty-printer, that might
also be able to do a some macro expansion (and perhaps contraction). In
particular, you'd like to keep author comments.
Is this correct? I ask because I would like something like this also.
--
Jonathan
Essentially yes. I want to replace some words and constructions in the
document but only in certain contexts and when it will not break
anything. As a simple example consider replacing all occurrences of
word "enumerate" in the text by "\href{http://something}{enumerate}".
It should not touch "\begin{enumerate}" but still should replace it in
"\section{About enumerate}". Another requirement is that resulting
document should still be easily editable by the author and it means
preserving comments, newlines, etc.
And what is your usecase ?
>> It seems to me that you want something like a pretty-printer, that might
>> also be able to do a some macro expansion (and perhaps contraction). In
>> particular, you'd like to keep author comments.
>>
>> Is this correct? I ask because I would like something like this also.
>
> Essentially yes. I want to replace some words and constructions in the
> document but only in certain contexts and when it will not break
> anything. As a simple example consider replacing all occurrences of
> word "enumerate" in the text by "\href{http://something}{enumerate}".
> It should not touch "\begin{enumerate}" but still should replace it in
> "\section{About enumerate}". Another requirement is that resulting
> document should still be easily editable by the author and it means
> preserving comments, newlines, etc.
You're use case is interesting. Here's my understanding of it. You
want to automatically create metadata, and use that metadata, and allow
the author to edit the metadata.
My use case is rather simpler. I've got a quite a few TeX files with
rather long lines, which I'd like to reformat.
In addition, I'm looking for better tools for parsing and processing TeX
source files.
I'll post soon about what I've been doing in this area.
--
Jonathan
Indeed, it does know. However that ! had to have been read to
know where the number stops, and the context lines, showing
the token stream, aren't under the control of \showthe at all.
The context lines, with the line-breaks to indicate what has
been read, are displayed juast as for all error messages,
under the (minimal) control of \errorcontextlines even though
\showthe isn't really an error.
Donald Arseneau
>> My use case is rather simpler. I've got a quite a fewTeXfiles with
>> rather long lines, which I'd like to reformat.
>>
>> In addition, I'm looking for better tools for parsing and processingTeX
>> source files.
> TeXpp could be such tool when it will be finished :)
> Currently I've already implemented all TeX data types and all internal
> parameters and registers. I hope very soon TeXpp will reach a stage
> when it could parse and load plain.tex format.
I'm interested in solving relatively simple special cases of this
problem. Such as well-behaved mathematics papers.
>> I'll post soon about what I've been doing in this area.
I'm a bit late on this. If you look at
http://pytex.svn.sourceforge.net/viewvc/pytex/trunk/pytex/sandbox/jfine/macroload/compile.py?revision=58&view=markup
you'll see that it's fairly easy to tokenize an input stream, provided
you can write the regular expressions.
I'd make things like
r'\begin{center}'
a single token.
Once that tokenizing is done, I think pretty-printing, and also the
transformation you want to do, will be straightforward.
I'd also make
r'''\makeatletter
% macro definitions
\makeatother'''
a single token.
--
Jonathan