First of all, there's the Inform compiler, which produces either Z-code
or Glulx (which is, sort-of, kind-of, a 32-bit version of Z-code,
although its author would disagree). You can find it here:
http://www.inform-fiction.org/
Given that, if you still want to write your own compiler then you need
"The Z-Machine Standards Document" found at:
http://www.inform-fiction.org/zmachine/standards/index.html
> On that note: has anyone ever done a program that could take a story file
> and generate a basic program (or any other language) from it?
Not off the top of my head. Back in '97-'98, there was a big effort to
decypher storyfiles, as part of the project to figure out how storyfiles
were put together and exactly what all of the Z-machine opcodes were
supposed to do. All of the Ztools were created at that time; you can
download them here:
http://www.inform-fiction.org/zmachine/ztools.html
InfoDump will display all of the strings in a storyfile, while txd will
give you just about every byte formatted for easy viewing. Nothing will
look at a procedure and create high-level code; the best you can ever
hope for is Inform assembler op-codes. You are especially out of luck
if you want a file that you can modify and re-compile, since storyfiles
use a compressed format to store pointers to objects, strings and
procedures. This means that deciding which variables and object
properties are pointers to objects and which are just numbers would be
especially uncertain. Any changes to the file that caused things to
move would invalidate any pointers that had been misidentified as
numbers (and vice versa), causing interesting errors at run time.
http://ifarchive.jmac.org/if-archive/infocom/tools/Language-Zcode-0.8.tar.gz
I believe it's built so that you should be able to translate Zcode into
languages other than Perl too, if you're willing to do some coding of
course.
Inform 1 was released to the public on 10 May 1993, and I recall Graham
Nelson claiming that the didn't decode the Zcode story file format
himself - others had already done that before him.
You have that problem, yes.
A bigger problem is distinguishing addresses from constants. If you
see a z-code opcode "load $4A8E into local variable 1", you have to
figure out that $4A8E actually refers to a string at that address,
rather than a numeric constant. This is important for a human reading
the code. It's even more important if you plan to modify the code and
recompile it, because in the new binary that string may be at a
different address. If you guess wrong about what that instance of
$4A8E means, you'll either print garbage or change an important
numeric constant in the program.
You can use some reasonable heuristics. For example, if local variable
1 is immediately printed, it's a very good bet that $4A8E is a string
address. If there *is* a string with address $4A8E, then again it's a
fair bet. But none of these heuristics is perfectly reliable, so any
true disassembling effort will require some going-over with human eyes.
(If that problem seems easy, then consider property identifiers. These
have the same difficulty, but they're small numbers like 4 or 7, which
-- unlike $4A8E -- are as likely to be *intended* as numbers as they
are to be property ids.)
--Z
"And Aholibamah bare Jeush, and Jaalam, and Korah: these were the borogoves..."
*
I'm still thinking about what to put in this space.
>
> You can use some reasonable heuristics. For example, if local variable
> 1 is immediately printed, it's a very good bet that $4A8E is a string
> address. If there *is* a string with address $4A8E, then again it's a
> fair bet. But none of these heuristics is perfectly reliable, so any
> true disassembling effort will require some going-over with human eyes.
Ok, so a big problem is distinguishing addresses from constants. How then
does an interpreter distinguish them? A disassembler can see just as much of
the story file as an interpreter. So why wouldn't it be able to distinguish
the cases from each other? If an interpreter can do it, so can a
disassembler.
> (If that problem seems easy, then consider property identifiers. These
> have the same difficulty, but they're small numbers like 4 or 7, which
> -- unlike $4A8E -- are as likely to be *intended* as numbers as they
> are to be property ids.)
And again, same reasoning. If an interpreter can distinguish them, then why
couldn't a disassembler?
The interpreter doesn't have to distinguish a value's meaning until it
actually get used by an instruction. When the interpreter sees the
opcode to print local variable 1, it can assume that the variable
contains the address of a string. (And if it doesn't, it can signal a
run-time error.) But the value may have been stored in that variable at
a completely different time. Also note that if the variable was
originally named 'temp', then it will likely have different types of
values at different times, which can futher bolix up the disassembler.
And static analysis of the entire program doesn't help since there might
be dead code and/or undiscovered bugs. Just because there's an
instruction somewhere to print local variable 1 doesn't mean that it
will ever be executed.
BTW, Infocom used a LISP-based language that is very different from
Inform. There are storyfiles (Seastalker comes immediately to mind)
that the Inform compiler would be unable to duplicate, because it won't
let you put procedures and strings in readable memory.
LISP, APL and Forth are all interpreted languages that are incredibly
powerful, but which have incredibly steep learning curves. LISP-like
languages are particularly suited for natural language applications,
which is why they are used so much in Artificial Intellegence and were
judged well suited for interactive fiction.
The excerpts that you've seen published are pretty much all that there
is, unless Activision has some code hidden away in a vault somewhere.
The circumstantial evidence is pretty strong, however, that all they
have are the binaries, the same as us. Someone posted last year that
they were going to try to recreate the Infocom compiler; nothing has
been heard on the subject since then.
The bits of Infocom code we have are not LISP. They're in a compiled
(not interpreted) language, which is imperative in form (not
functional or dynamic). It's a lot like early Inform, which is not a
surprise. Only the syntax is LISP-like.
The interpreter doesn't distinguish them. It just passes the value
through blindly, and assumes that the right thing will happen.
Note that I specified the tasks of "making human-readable code" and
"modifying and recompiling code". Those are tasks which an interpreter
doesn't have to do. They are tasks which the Z-code format is not
designed to make easy. Unsurprisingly, therefore, they're *not* easy.
My knowledge of decompilers has deteriorated a lot in the last 25 years, but
I'm fairly sure this is a "standard problem" solved by some variant of type
inference or some other form of dynamic code analysis. I think there are
similar issues with byte-code-to-native compilers. I'm not saying it's easy,
just that there is literature in the compiler community that addresses similar problems.
>You can use some reasonable heuristics. For example, if local variable
>1 is immediately printed, it's a very good bet that $4A8E is a string
>address. If there *is* a string with address $4A8E, then again it's a
>fair bet. But none of these heuristics is perfectly reliable, so any
>true disassembling effort will require some going-over with human eyes.
IIRC the above-mentioned amount to formalizing these kind of heuristics and
verifying that the code really does treat the value that way.
>(If that problem seems easy, then consider property identifiers. These
>have the same difficulty, but they're small numbers like 4 or 7, which
>-- unlike $4A8E -- are as likely to be *intended* as numbers as they
>are to be property ids.)
The interesting trick is if some clever compiler treats them as property
identifiers in one place and simple integers in another. I have vague
memories of some peephole optimizers generating such multi-use situations.
--
"Yo' ideas need to be thinked befo' they are say'd" - Ian Lamb, age 3.5
http://www.cs.queensu.ca/~dalamb/ qucis->cs to reply (it's a long story...)
Quite so, but I was mostly discussing why a language was chosen that was
"ugly". I've used a few variants of LISP, including (in reverse order)
CommonLisp, Scheme and an IBM mainframe version that was probably a
close decendent of McCarthy's original implementation. A really
interesting (though very short) discussion of ZIL (Zork Implementation
Language) can be found here:
http://lambda-the-ultimate.org/node/view/429
It indicates that Activision *does* have the source code for both ZIL
and all of the Infocom games. "A lot of people might not realize it,
but as late as 1993, Activision's "Return to Zork" was still written in
a (heavily extended) version of ZIL, i.e. a dialect of Lisp."
That's what I mean, though. If you're talking about MDL (the original
Zork), then it *was* more or less LISP-like, and I'm sure they chose
it because they were MIT hackers and that was the era when LISP was
the hippest thing in the cool universe.
If you're talking about ZIL, then I imagine they designed it ugly
because they were MIT hackers. And even though they were effectively
inventing a form of compiled BASIC, they were at least going to
*pretend* it was LISP-like. Dammit. Regardless, it doesn't actually
have any of the properties ("powerful, suited to natural languages")
that you were talking about.
> A really
> interesting (though very short) discussion of ZIL (Zork Implementation
> Language) can be found here:
>
> http://lambda-the-ultimate.org/node/view/429
I'm not convinced by everything posted there. In particular, calling
ZIL "MDL with a lot of its genuine arcana removed" seems to be
way off-base. ZIL had no facility to allocate or free memory; it had
no cons, car, or cdr facilities. I'm sticking with "compiled BASIC
with a lingering addiction to parentheses".
Having been a beta-tester for "Return to Zork", I would have to
disagree with the quote from this page... RtZ was written in a language
that _looked_ Lispy, but it had zero to do with MDL or ZIL... no common
code, no common authorship, etc.
-ethan
> Are you saying that due to this
> compressed format, any meaningful variable or function/procedure names
> would be lost, and instead you would be left with A,B,C, functionA,
> functionB, functionC, etc.? Because if the z-machine can run it, a
> z-disassembler ought to be able to render source code from it (albeit
> without friendly var/function names).
No, but many things are referred to as simply numbers, and you can't
tell what those numbers mean unless they're being used in their
functions.
For example, suppose that I have the following code:
[Foo direction;
print "That way is ", (name)location.direction;
];
[Bar obj;
print "That is ", (the)obj;
];
[Baz verb;
if (verb == ##Jump)
"You jump.";
"You don't jump.";
];
[Quux;
Foo(n_to);
Bar(platypus);
Baz(##Climb);
];
Now, in Quux, you're passing a number to Foo, a number to Bar, and a
number to Baz. The compiler is kind enough to let you use n_to (a
property name), platypus (an object identifier), and ##Climb (a verb
identifier) to refer to their numbers, but in the compiled code,
they're just numbers-- and they overlap. (In fact, the standard Inform
library relies on the fact that n_to and n_obj have the same number--
something I found out, much to my dismay, when I violated this just
before putting my program in beta.) A disassembler examining Quux
wouldn't know which is which, that is, it wouldn't know if the number
for n_to was meant to represent a property, object, verb, etc.
"But can't it look at Foo, Bar, and Baz?" Well, kind of. That would
take a considerable amount of analysis. (It's not entirely impossible;
Spidey, a Scheme analyzer, does some such analysis. But I suspect it's
impossible to always perform such analysis). Also, the routines may
not clearly define what type of object they're expecting. In the
program I'm working on now, some subroutines can be passed dictionary
words or objects, for example.
There are decompilers for some languages, but they're also unreliable
for similar reasons. It's more of a problem for Inform, since it
relies on lots of types. Decompilers are most common for languages
like C, which is almost a macro assembler anyway.
Cheers,
Piquan
I can't argue with that. :-) I once played with a Lisp Machine that my
company was thinking about buying.
> If you're talking about ZIL, then I imagine they designed it ugly
> because they were MIT hackers. And even though they were effectively
> inventing a form of compiled BASIC, they were at least going to
> *pretend* it was LISP-like. Dammit. Regardless, it doesn't actually
> have any of the properties ("powerful, suited to natural languages")
> that you were talking about.
Well, the dictionary does implement LISP-like atoms. Other than that,
yeah, it is mostly a fancy BASIC.
>>A really
>>interesting (though very short) discussion of ZIL (Zork Implementation
>>Language) can be found here:
>>
>> http://lambda-the-ultimate.org/node/view/429
>
>
> I'm not convinced by everything posted there. In particular, calling
> ZIL "MDL with a lot of its genuine arcana removed" seems to be
> way off-base. ZIL had no facility to allocate or free memory; it had
> no cons, car, or cdr facilities. I'm sticking with "compiled BASIC
> with a lingering addiction to parentheses".
I suppose that it depends on the definition of "genuine arcana"; lots of
people would put all of LISP in that category. ZORK was probably
written while peeking at the FORTRAN source for Adventure and so got
written in a FORTRANish pidgin. It didn't need to allocate or free
memory; it required no cons, car, or cdr.
MDL, like a lot of LISPs, could be compiled (mostly to counter the
perception at the time that LISP was slower than C), and given one
version of LISP, it was pretty easy to derive new versions. I have no
proof, but I imagine that they took an existing version of MDL and
ported it to a new "hardware" architecture, albeit one that was
incredibly limited. Those portions of MDL that implemented "arcana"
didn't get the procedures written for compilation to the new hardware
and so got dropped. And once a working library is written, it tends to
freeze things, so the later games got written in the same pidgin.
I do agree with your final accessment. Once the axe was put away, what
was left did resemble BASIC more than it did LISP. I'll bet, though,
that the compiler did make lots of use of cons, car and cdr.
Quite so. Java is very strongly typed, and (to support the
retrospection APIs) the class files keep a ton of data that would
normally be of interest to only to a debugger. As a result, Java
decompilers are a dime a dozen, because you can get by with a purely
static analysis. The Z machine is the exact opposite of Java, looking
more like an Motorola 6502. The architecture even seems to support
self-modifying code, although no one seems to have ever tried writing any.
Another problem is that there are multiple compilers. Turning a Java
class file back into Java is aided by the fact that there's really only
been a couple of versions of Java, and those have only very minor
differences. The major versions of Inform are quite different from each
other, there is a C compiler (AFAIK only used for "Silicon Castles"),
and then there's ZIL, of which little is known but was used to create
"the canon". Turing a storyfile compiled using Inform 4 into Inform 5
source would be very difficult; turning something that started out as C
or ZIL would, I suspect, be almost impossible.
I'm surprised to hear that. Where does that happen?
The Ztools date back to 1992, and I believe they were the _second_
generation of Z-machine story file tools.
However, there was a later push to decode V6 and fill in
some of the gaps.
Certainly Z-code can be disassembled. But at present, only to a
low-level assembler. Nobody has made a totally successful decompiler,
though there have been partially successful attempts.
In theory, most Z-code shouldn't be all that hard to decompile. It
comes down to a matter of bookkeeping; some manual help would be
needed when information was totally lost and to fill in variable and
function names, but not as much as you might think. In practice, no
one has done it. It doesn't help that TXD isn't really suitable as
the starting point for a decompiling effort, so you'd have to re-write
the disassembler part before you got to the fun stuff.
Most recent byte-code to native work has been done on type-aware VMs
like Java. This simplifies the task enormously.
On the flip side, the Z-machine's paucity of types simplifies that
side quite a bit.
>IIRC the above-mentioned amount to formalizing these kind of heuristics and
>verifying that the code really does treat the value that way.
Right. This is what I call "bookkeeping". If you can figure out that
parameter two of routine R_A562 must be a string (because it's used in a
print_paddr), then you can figure out that anything passed in as the
actual second parameter to R_A562 is a string also. You can work
things like that forward and backwards, and come up with the types of
a lot of things. With Infocom (but not Inform) code, you can also
assume that global variables have the same type throughout the code,
and that all properties of a given number have the same type.
>The interesting trick is if some clever compiler treats them as property
>identifiers in one place and simple integers in another. I have vague
>memories of some peephole optimizers generating such multi-use situations.
Fortunately, the ZIL compiler does not do this.
>Right. This is what I call "bookkeeping". If you can figure out that
>parameter two of routine R_A562 must be a string (because it's used in a
>print_paddr), then you can figure out that anything passed in as the
>actual second parameter to R_A562 is a string also. You can work
>things like that forward and backwards, and come up with the types of
>a lot of things. With Infocom (but not Inform) code, you can also
>assume that global variables have the same type throughout the code,
>and that all properties of a given number have the same type.
But presumably, because of Inform/Z-machine's weak typing, there are many
situations where this could go wrong? Things like PrintOrRun() or others
where different types are passed in at different points in a game?
Regards,
Graham Holden (g-holden AT dircon DOT co DOT uk)
--
There are 10 types of people in the world;
those that understand binary and those that don't.
No, I don't believe it does. It's unfortunately easy for the *user* to
write code that relies on this assumption (writing "rm.n_obj" when you
meant "rm.n_to"), but the library doesn't contain that assumption out
of the box.
PrintOrRun is an Inform thing; I don't think Infocom used anything
like it (I could be mistaken, it's been a while since I disassembled
the story files). However, even if you allow that case you've got it
down to a packed address; knowing it's either a routine or a string is
probably enough for most purposes, particularly since any constant
with such a type can be disambiguated in Inform-generated code.
Undoubtedly, though, you wouldn't be able to determine every type, and
you'd have to call for user help.
> The major versions of Inform are quite different from each
> other, there is a C compiler (AFAIK only used for "Silicon Castles"),
> and then there's ZIL, of which little is known but was used to create
> "the canon". Turing a storyfile compiled using Inform 4 into Inform 5
^^^^^^
_Nice_ typo.
Richard
There are also properties which can hold either an object number,
or a string or a routine, like n_to. There can't be any clues
in the code as to what kind of values they may contain.
/Fredrik
>> (In fact, the standard Inform library relies on the fact that n_to and
>> n_obj have the same number--
> No, I don't believe it does. It's unfortunately easy for the *user* to
> write code that relies on this assumption (writing "rm.n_obj" when you
> meant "rm.n_to"), but the library doesn't contain that assumption out
> of the box.
You are, of course, correct. I had thought (for some silly reason)
that the DM4 specified door_to should hold n_obj, not n_to, and had
written my code that way. Later I realized that GoSub would require
them to have the same number for ENTER DOOR to work.
Upon checking my assumptions, I discover that I was quite mistaken, and
apologize to the standard library authors.
Piquan
Only in my imagination, as it turns out; see my reply to Zarf.
Piquan
Since no one has mentioned it yet, let me plug my Z-machine decompiler, Reform:
http://www.darkweb.com/~benrg/if-decompilers/
It relies on the user to specify the types of local variables, arrays, and
whatnot, and does some simple local type inference to figure out the types
of local constants. E.g. if the expression "local2-->0 == $1234" appears in
the code, and the user declared local2 to be a pointer to a word array of
string addresses, then $1234 will be tagged as a string address. It works
well in practice, but at the expense of a lot of manual effort to identify
the types of variables.
I was somewhat aware of research in decompilation when I started. I read
Cifuentes's thesis, and even implemented some of the algorithms, but none of
that made it into the released version; I settled for simpler ad-hoc
techniques that worked well enough. I'd love to return to it, but I don't
know if I'll ever have the time.
One problem I had is that high-level analysis conflicts with accurate
decompilation to some extent. Zilch and Inform are very simple compilers,
and simple decompilation preserves more of the original program structure
than fancy graph analysis would. If you just pattern-match on code fragments
produced by the Inform compiler, you'll do fine. If you try to get clever,
you can easily end up with a flowgraph that can't be expressed directly in
Inform.
>>You can use some reasonable heuristics. For example, if local variable
>>1 is immediately printed, it's a very good bet that $4A8E is a string
>>address.
The problem is that it's a very good bet, rather than a certainty. I
actually implemented global type inference for Reform, but never got it to
work properly. It always inferred that (nearly) everything had every type. I
never tracked down exactly why. Undoubtedly one reason is that I didn't do
flow analysis on local variables; if one got reused to hold a different
type, equivalence classes would be merged incorrectly. Global variables are
nastier. Inform likes to use g254 and g255 to hold temporaries. Memory
locations and properties are nastier still.
I definitely believe that it could be made to work, at least much better
than my primitive attempt. If anyone wants to pick up the torch, I'll help
you out as best I can. :-)
> The interesting trick is if some clever compiler treats them as property
> identifiers in one place and simple integers in another. I have vague
> memories of some peephole optimizers generating such multi-use situations.
Both Inform and Zilch do basically no optimization, so I don't think this
situation arises.
-- Ben
--
+--------------------------------------------------------------------+
| Charles and Francis Richmond The way feminists see oppression |
| everywhere, men see urinals. |
| richmond at plano dot net It's a design feature. -- Fred Reed |
+--------------------------------------------------------------------+
You know, I was going to say "the Moto 6800", then decided that it was
more like the 6502. I think I've programmed in too many different
assembly languages, they're all starting to blur together.
One approach that might help would be to build an interpreter into the
decompiler (or build a decompiler into an interpreter...) and allow the
game to be played through. If a suitable "walkthrough" is available, this
wouldn't need too much user-input, but even if it had to be played
manually, it probably wouldn't feel like the same level of user input as
trying to guess the meaning of lumps of code (and certainly wouldn't need
the same "hacker-level" user to do it).
The idea being that by noting how addresses, references, variables etc. are
_actually_ used (by seeing what the interpreter does in a real game
session), instead of how they _might_ be used (through code analysis), you
might get a better picture of what was originally intended.
Quite how you'd go about this I don't know yet... it's just an "off the top
of my head" idea at the moment.
The only problem is that the old games were frequently buggy. Take a
look at http://www.google.com/search?q=infocom-bugs, but be warned that
there are a lot of spoilers on those lists. Inform has a lot of
safety-checks built in to prevent a large number of those problems. The
old games are respected today more for their writing and their ideas,
not their implementation.
There are some neat objects in some of the games, but I think that
everyone one of them has been re-created in Inform. Check out "Toyshop"
for details. Also, there is a re-implementation of "Deadline" (or is it
"Suspect") that was written entirely in Inform. However, I don't know
if the author worked from a disassembly of the original, or created
everything from scratch.
> I will definitely check out Toyshop. What would you and other users say
> are the top 3 games I can get the source code to, to study and learn
> good design? Or will the Inform manual be sufficient?
>
I learned from the DM4, and also picked up a few tips as to what to use
and when, after reading the Inform Beginners Guide (which has been
revamped in the last few months).
To email me, visit the site.
That would be a problem, but only for Inform code. It's Infocom code
I'm actually more interested in. The problem is similar to the problem of
unions in other languages, so there's probably been some work on it.
Now that's an interesting way of getting around a number of problems
-- particularly the function-pointer problem and the untyped-constant
problem. You could use this to collect data which would be fed into a
more traditional decompiler.
> If you're talking about ZIL, then I imagine they designed it ugly
> because they were MIT hackers. And even though they were effectively
> inventing a form of compiled BASIC, they were at least going to
> *pretend* it was LISP-like.
Or maybe they just didn't want to rewrite the existing Zork source
which was in MDL, and therefore retained the syntax even though the
actual language turned out to be very different.
--
Esa Peuha
student of mathematics at the University of Helsinki
http://www.helsinki.fi/~peuha/
> Alas if I only had a slightly bigger brain. I would think the value in
> dcompiling z-code would mainly be to see how the masters wrote the
> classic Infocom games - more for learning and game design purposes. I'm
> sure a lot of that can be done from newer Inform games whose source is
> freely available, but I still would love to start by learning from the
> original Zork.
You might want to check out Dungeon; after all, it _is_ the "original
Zork".
I doubt very much that functional MDL code works as functional ZIL
code. Why would the original MDL have avoided list operations?
But then again, I'm the sort of person who'd probably do that for the
fun of it, and I've known quite a few programmers who'd feel the same.