Keep in mind that parrot may be in the position where it has to
ignore or mistrust the metadata, so be really cautious with things
you propose as required.
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
Dear Dan,
I would like to see a powerful meta-data system made possible,
even if it is not implemented immediatly. The symantic web researchers
like David Beckett and Tim Bernard-Lee have been working on powerfull
systems to support meta-data in general, maybe as the parrot meta-data
is just getting started, we can cut a bit of that off?
Take a look at the list here at Diffuse MetaData Interchange [4] at the
bottom of this mail, you will see an overview of metadata systems.
Even if they are not specific to parrot, the goals are similar in many
casess.
Recently I have been making progress with the rdf[1], specifically with
the redland application framework[2]. With the simple concept of
triples of data, a triple being (subject, predicate, object) we are
able to capture the metadata of the gcc compiler, and I hope other
compilers and systems.
Redland is written in clean c, and supports meta-data storage in
memory, and on disk in multiple formats, in rdf/xml, rdf/ntriples (even
in berkleydb). It would be possible to create a new storage model to
store the a packfile as well.
The subjects are the items in the program, the node, each getting a
number inside the system. Predicates are important, the represent the
meat of the system. The objects are either literal data or other
subjects.
Via the redland api, you can add in new statements about things, and
find all the statements about a subject, about an object, all that meet
a predicate.
I tell you this, because maybe you want to provide this sort of
flexible meta-data api into parrot :
for example the predicates that we extract that you might find
interesting :
* Filename of the node
* Line number of the node (the Column Number is not supported yet)
* Internal Type of the node (variable declaration, type, integer
const, etc), as opposed to the type of the
* Name of the node (the identifier)
* Type of the node (if it is a variable, or constant) this is a
pointer to another node
* Unsigned Type of a type, if a type supports itself being unsigned,
here it is.
* Comments are supported, but not used yet, but would be a good idea.
Now we get into more specific types of predicates
* Parameters of an expression
* Variables in a block
* Size of a variable
* Alignment of a variable
* Constant flag
* Volitile flag
then we have
* Fields of a struct
* Parameters of a function
* Return type of a function
* Body block of a function
So, with this idea of meta-data, by adding more predicates,
you can support the capturing and storage of all the source code in an
abstract form, or just the basic function data.
You will probably think that this is overkill for parrot, but I think
that it would give you an extensible system to add in new forms of
meta-data as langauges are added. Via OWL[3] the users will be able to
define the meaning and the classes of metadata as well.
mike
[1] RDF http://www.w3.org/RDF/
[2] Redland http://www.redland.opensource.ac.uk/
[3] OWL http://www.w3.org/TR/owl-absyn/
[4] Diffuse MetaData Interchange standards
http://www.diffuse.org/meta.html
=====
James Michael DuPont
http://introspector.sourceforge.net/
__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
I do think that, whatever "native" (i.e. understood by Parrot) metadata
we support, we *must* allow for extensibility, both for future native
metadata and for third-party tools. Moreover, this must not be
implemented with a special type of metadata block, or by using
sequentially-increasing numbers. (The first means that any metadata we
decide to add in the future will be slower than the metadata we add now;
the second has problems with several third-party tools picking the same
number.)
--Brent Dax <bren...@cpan.org>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)
>How do you "test" this 'God' to "prove" it is who it says it is?
"If you're God, you know exactly what it would take to convince me. Do
that."
--Marc Fleury on alt.atheism
> Since it looks like it's time to extend the packfile format and the
> in-memory bytecode layout, this would be the time to start discussing
> metadata. What sorts of metadata do people think are useful to have
> in either the packfile (on disk) or in the bytecode (in memory).
Comments, if a disassembler is to be able to reconstruct the original source
sufficiently well[1].
-- c
1) for the various values of "well" that include "semantic equivalence"
after quite a long time away from keyboard and fighting through a huge
backlog of mail I'm (hopefully) back again.
Dan Sugalski <d...@sidhe.org> writes:
> Since it looks like it's time to extend the packfile format and the
> in-memory bytecode layout, this would be the time to start discussing
> metadata. What sorts of metadata do people think are useful to have in
> either the packfile (on disk) or in the bytecode (in memory).
My current idea for the in memory format of the bytecode is this:
One bytecodesegment is a PMC consisting of three parts the actual
bytecode (a flat array of opcode_t), the associated constants, which
don't fit into an opcode_t (floats and strings), and a scratch area
for the JITed code. All other Metadata will be attached as
properties (or maybe as elements of an aggregate). This will be an
easy way for future extension. The invoke call to this pmc would
simply start the bytecode from the first instruction.
To support inter-segment jumps a kind of symboltable is also
neccessary. All externally reachable codepoints need some special
markup. This could be a special opcode extlabel_sc or an entry in a
symboltable. Also needed is a fixup of the outgoing calls, either via
modification of the bytecode or via a jumptable. Both have their pros
and cons: The bytecode modifcation prohibits a readonly mmap of the
data on disk and the fixup needs to be done at load-time but once this
is done the impact on the runtimespeed is minimal, whereas the
jumpcode is on extra indirection. But as stated somewere else the
typical inter-segment jump will be call/tailcall/callmethod/invoke,
which are at least two indirections.
The on disk version is a matter of serializing and deserializing this
PMC.
> Keep in mind that parrot may be in the position where it has to ignore
> or mistrust the metadata, so be really cautious with things you
> propose as required.
Ok to summarize:
ByteCodeSegment = {
bytecode => requiered;
constants => only neccessary if string or num constants;
fixup => (or jumptable) only neccessary if outgoing jumps;
symbols => all possible incomming branchpoints, optional;
JIT => will be filled when bytecode is invoked;
source => surely optional;
debuginfo => also optional;
...
}
bye
boe.
--
Juergen Boemmels boem...@physik.uni-kl.de
Fachbereich Physik Tel: ++49-(0)631-205-2817
Universitaet Kaiserslautern Fax: ++49-(0)631-205-3906
PGP Key fingerprint = 9F 56 54 3D 45 C1 32 6F 23 F6 C7 2F 85 93 DD 47
I would strongly urge any file-based byte-code format to arranged
in such a way that it (or most of it) can simply be mmap-ed in (RO),
analogously to executables.
This means that a Perl server that relies on a lot of modules, and which
forks for each connection (imagine a Perl-based web server), doesn't
consume acres of swap space just to have an in-memory image per Perl
process, of all the modules.
This is a real problem that's hitting me hard with Perl 5 in my day job.
Dave.
--
Any [programming] language that doesn't occasionally surprise the
novice will pay for it by continually surprising the expert.
- Larry Wall
Yes!
Deparsing. that would be great.
mike
sounds good.
could that be seen as similar to shared memory communication with the
compile,
via mem-mapped file interfaces?
mike
> This is a real problem that's hitting me hard with Perl 5 in my day
> job.
>
> Dave.
>
> --
> Any [programming] language that doesn't occasionally surprise the
> novice will pay for it by continually surprising the expert.
> - Larry Wall
Why yes, yes I do. On the other hand, when we hand people bazookas to
deal with their fly problems, we often find they start in on the
elephant problems as well.
The proposal in general interests me--it looks like a general
annotation system we can attach to the bytecode. (I admit, I haven't
read the page you pointed at) I will admit, though, that I was
thinking more about metadata that the engine could use itself, or
would provide to programs running on it, but the scheme you've
outlined may be useful for that.
'Swhat I get for asking a too-general question. :)
"Must" is an awfully strong word, there. We don't really "must" do
anything, though I do realize the feature is useful, hence my
question.
> Moreover, this must not be
>implemented with a special type of metadata block, or by using
>sequentially-increasing numbers. (The first means that any metadata we
>decide to add in the future will be slower than the metadata we add now;
>the second has problems with several third-party tools picking the same
>number.)
I'm afraid extensible metadata is going to live in its own chunk
unless someone can come up with a way to embed it without penalty.
(And I'm generally considering using separate chunks for the metadata
the engine does understand)
> On Thu, Jan 23, 2003 at 09:21:45PM +0100, Juergen Boemmels wrote:
> > My current idea for the in memory format of the bytecode is this:
>
> I would strongly urge any file-based byte-code format to arranged
> in such a way that it (or most of it) can simply be mmap-ed in (RO),
> analogously to executables.
>
> This means that a Perl server that relies on a lot of modules, and which
> forks for each connection (imagine a Perl-based web server), doesn't
> consume acres of swap space just to have an in-memory image per Perl
> process, of all the modules.
This might be possible if the byteorder, wordsize, defaultencoding
etc. are the same in the file on disk and the host.
bye
boe
Which will generally be the case, I expect. Tell a sysadmin that they
can reduce the memory footprint of mod_parrot by 50% by running a
utility (that we provide in the parrot kit) over the library and I
expect you'll see smoke from the keyboard as he/she whips off the
command at supersonic speeds... :)
> >This might be possible if the byteorder, wordsize, defaultencoding
> >etc. are the same in the file on disk and the host.
>
> Which will generally be the case, I expect. Tell a sysadmin that they
> can reduce the memory footprint of mod_parrot by 50% by running a
> utility (that we provide in the parrot kit) over the library and I
> expect you'll see smoke from the keyboard as he/she whips off the
> command at supersonic speeds... :)
It might be even possible to dump the jitted code. This would increase
the startup. Then strip the bytecode to reduce the size of the file
and TADA: Yet another new binary format.
I'm really not sure if I'm serious here
This is the way the bytecode currently works, and we will *not*
switch to any bytecode format that doesn't at least allow the
executable code to be mmapped in.
A strong word for a strong opinion. :^) Besides, I did qualify it with
an "I do think", which is another way to say IMO.
# > Moreover, this must not be
# >implemented with a special type of metadata block, or by using
# >sequentially-increasing numbers. (The first means that any
# metadata we
# >decide to add in the future will be slower than the metadata we add
# >now; the second has problems with several third-party tools
# picking the
# >same
# >number.)
#
# I'm afraid extensible metadata is going to live in its own chunk
# unless someone can come up with a way to embed it without penalty.
# (And I'm generally considering using separate chunks for the metadata
# the engine does understand)
Are you expecting to have chunk type determined by order? If so, what
will you do if a future restructuring means you either don't need chunk
type X or you need a new, highly incompatible version? Will you leave
in an "empty" ghost chunk?
I would suggest (roughly) the following format for a chunk:
TYPE: One 32-bit number
VERSION: One 32-bit number; suggested usage is as four eight-bit
components
SIZE: One 32-bit number of bytes (or maybe 64-bit)
DATA: arbitrary length
For C-heads, think of it like this:
struct Chunk {
opcode_t type;
opcode_t version;
opcode_t size;
void data[];
};
Type IDs less than 256 would be reserved to Parrot (so we have plenty of
room for future expansion); all third-party tools would use some sort of
cryptographic checksum of the tool's name and the data structure's name,
making sure (of course) that their type ID was greater than 255.
If there's a directory of some sort, it should record the type ID and
the offset to the beginning of the chunk. This should allow for a
fairly quick lookup by type. If you think that there might be a demand
for multiple instances of the same type of metadata, you may want to add
a chunk ID of some sort.
Noted. I can see problems with multiline comments across multiline
code, but that's probably rare enough to not really care much about.
I LIKE IT.
Bytecodes have a type? each bytecode has meta-data?
Here are the metadata I have collected from the parrot source code so
far. It should be a set of predicates to define all the other meta-data
needed.
First, this is the core meta-data for storing perl code :
in order of simplicity
identifier_node
Name of things
boolean_type,integer_type,real_type
types of things that are simple
all *_decls have a type that is a type_*
all *_decls have a name that is a type_decl or identifier_node
const_decl
Constant values
var_decl
variable values
The rest of the more complex types need a tree_list
tree_list
function_decl,
parm_decl # list of
array_type
integer_cst, # list of
enumeral_type
integer_cst # list of
record_type,union_type,
field_decl # list of
# a void is very special
void_type
The following are derived types :
pointer_type,reference_type
# function types allow for linkage
function_type,
type_* # we have a list of
# here the user defines its own
type_decl
# this is a commonly defined user type
complex_type,
Heh. I try and avoid the absolute statements. This is all
engineering, and engineering is applied economics--you juggle
features and make compromises to get the thing that meets your needs
as best as possible at a cost you can manage. Allowing extensibility
is Really Keen, but has its associated cost that has to be balanced
against everything else.
Having said that, I think we can do this, but I want a better feel
for what we need, what we want, and what it'll cost before we make a
decision.
>Are you expecting to have chunk type determined by order?
Yes and no. Yes in that I want the first few chunks, the ones that
are required, to be at fixed offsets. Following that will be a
directory, and from there we can index off to wherever we need to.
Cool!
that means we can use opcodes to store the introspector data!
We need to have the meta data paired with the opcodes.
basically this means storing the source code in some ast form in the
meta-data for full reflection and introspection on the expression
level.
mike
> Since it looks like it's time to extend the packfile format and the
> in-memory bytecode layout, this would be the time to start discussing
> metadata. What sorts of metadata do people think are useful to have in
> either the packfile (on disk) or in the bytecode (in memory).
I'm currently simplifying the whole packfile routines. It still does
read the old format, but the compat code is centralized now in one place.
The main change is now this structure:
struct PackFile_funcs {
PackFile_Segment_new_func_t new_seg;
PackFile_Segment_destroy_func_t destroy;
PackFile_Segment_packed_size_func_t packed_size;
PackFile_Segment_pack_func_t pack;
PackFile_Segment_unpack_func_t unpack;
PackFile_Segment_dump_func_t dump;
};
All registered types define these funtions to make pack/unpack/dump work
for their type.
Registered types are consecutively numbered, unknown types still get
unpacked or dumped:
typedef enum {
PF_DIR_SEG,
PF_UNKNOWN_SEG,
PF_FIXUP_SEG,
PF_CONST_SEG,
PF_BYTEC_SEG,
PF_DEBUG_SEG,
PF_MAX_SEG
} pack_file_flags;
All packfiles sizes/offsets are in opcode_t not bytes for simplicity -
though this might need a conversion (but we don't seem to handle
wordsize transforms now anyway).
leo
> On Thu, Jan 23, 2003 at 09:21:45PM +0100, Juergen Boemmels wrote:
>
>>My current idea for the in memory format of the bytecode is this:
>>
>
> I would strongly urge any file-based byte-code format to arranged
> in such a way that it (or most of it) can simply be mmap-ed in (RO),
> analogously to executables.
How many mmap's can $arch have for one program and for all?
Could we hit some limits here, if every module loaded gets (and stays)
mmap()ed.
> Dave.
leo
We certainly could, which I suppose would argue for building in
sufficient smarts to the bytecode loader to switch to file reading if
an mmap fails. It'll be slower, but working is generally a good thing.
> Dan Sugalski <d...@sidhe.org> writes:
> It might be even possible to dump the jitted code. This would increase
> the startup. Then strip the bytecode to reduce the size of the file
> and TADA: Yet another new binary format.
When you then are able to to get the same memory layout for a newly
created interpreter, it might even run ;-)
> I'm really not sure if I'm serious here
> boe
leo
> At 8:39 PM +0000 1/23/03, Dave Mitchell wrote:
>> in such a way that it (or most of it) can simply be mmap-ed in (RO),
>> analogously to executables.
>
>
> This is the way the bytecode currently works, and we will *not* switch
> to any bytecode format that doesn't at least allow the executable code
> to be mmapped in.
s/works/should work/
The file get mmap()ed if possible, then the bytecode get's memcpy'd and
the map is munmap'd.
leo
> I'm currently simplifying the whole packfile routines. It still does
> read the old format, but the compat code is centralized now in one place.
> Registered types are consecutively numbered, unknown types still get
> unpacked or dumped:
>
> typedef enum {
> PF_DIR_SEG,
> PF_UNKNOWN_SEG,
> PF_FIXUP_SEG,
> PF_CONST_SEG,
> PF_BYTEC_SEG,
> PF_DEBUG_SEG,
>
> PF_MAX_SEG
> } pack_file_flags;
Here is a sample dump of a packfile with file/line info generated by
$ imcc -d -o eval.pbc eval.pasm
$ pdump eval.pbc
DIRECTORY => { # 3 segments
type 3 name CONSTANT offs 0x1c length 35
type 4 name BYTECODE offs 0x40 length 14
type 5 name BYTECODE_DB offs 0x4f length 17
}
CONST => [
### snipped (as old) ###
],
BYTECODE => [ # 14 ops at ofs 0x40
0041: 00000349 00000001 00000003 00000057 00000001 00000002 00000347
0048: 00000000 00000001 00000000 00000345 0000001a 00000001 00000000
]
BYTECODE_DB => [ # 17 ops at ofs 0x4f
0050: 6c617665 7361702e 0000006d 00000001 00000002 00000003 00000004 00000005
0058: 00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0060: 00000000
]
(the line array is currently too big (per opcode not per ins ;-))
Anyway, packing/unpacking and dumping above packfile data is working now.
Does anybody want to have a look at the patch?
Should I check in - or send it to the list?
$ diffstat packf.diff
TODO | 5
debug.c | 2
include/parrot/packfile.h | 129 +++---
languages/imcc/TestCompiler.pm | 6
languages/imcc/imclexer.c | 2
languages/imcc/main.c | 2
languages/imcc/pbc.c | 6
languages/imcc/t/harness | 15
languages/imcc/t/syn/eval.t | 61 ++
packdump.c | 4
packfile.c | 848 ++++++++++++++++++++++-------------------
packout.c | 156 +++----
pdump.c | 23 -
13 files changed, 715 insertions(+), 544 deletions(-)
leo
Linux is not the universe, though. And what it'll do depends on the
version. We have to worry about Windows and a half-zillion other
flavors of Unix, at the very least. IIRC, some versions of BSD
weren't too thrilled about a lot of mmaps.
>Note that in Perl5 we already (indirectly) rely on the OS's ability to
>mmap in the library code for any XS-based modules.
No, we use dlopen, which isn't the same thing at all. It can be, but
doesn't have to me.
I just wrote a quick C program that successfully mmap-ed in all 1639
files in my Linux box's /usr/share/man/man1 directory.
Note that in Perl5 we already (indirectly) rely on the OS's ability to
mmap in the library code for any XS-based modules.
--
"But Sidley Park is already a picture, and a most amiable picture too.
The slopes are green and gentle. The trees are companionably grouped at
intervals that show them to advantage. The rill is a serpentine ribbon
unwound from the lake peaceably contained by meadows on which the right
amount of sheep are tastefully arranged." Lady Croom - Arcadia
> Are you expecting to have chunk type determined by order? If so, what
> will you do if a future restructuring means you either don't need chunk
> type X or you need a new, highly incompatible version? Will you leave
> in an "empty" ghost chunk?
>
> I would suggest (roughly) the following format for a chunk:
>
> TYPE: One 32-bit number
> VERSION: One 32-bit number; suggested usage is as four eight-bit
> components
> SIZE: One 32-bit number of bytes (or maybe 64-bit)
> DATA: arbitrary length
>
> For C-heads, think of it like this:
>
> struct Chunk {
> opcode_t type;
> opcode_t version;
> opcode_t size;
> void data[];
> };
I agree with the "roughly" bit, but I'd suggest ensuring that you put
in enough bits to get data[] 64 bit aligned. Mainly because at least 1
architecture exists that has no 32 bit types (Crays I know about; others
may exist. I can't remember if perl 5.8 passes 100% of tests on Crays.
We certainly tried)
> If there's a directory of some sort, it should record the type ID and
> the offset to the beginning of the chunk. This should allow for a
> fairly quick lookup by type. If you think that there might be a demand
> for multiple instances of the same type of metadata, you may want to add
> a chunk ID of some sort.
It might be useful for making "portable" fat bytecode.
On Thu, Jan 23, 2003 at 01:39:03PM -0500, Dan Sugalski wrote:
> At 10:29 PM -0800 1/22/03, James Michael DuPont wrote:
> >You will probably think that this is overkill for parrot,
>
> Why yes, yes I do. On the other hand, when we hand people bazookas to
> deal with their fly problems, we often find they start in on the
> elephant problems as well.
No wonder the rolls of sticky elephant paper never sold.
> The proposal in general interests me--it looks like a general
> annotation system we can attach to the bytecode. (I admit, I haven't
> read the page you pointed at) I will admit, though, that I was
> thinking more about metadata that the engine could use itself, or
> would provide to programs running on it, but the scheme you've
> outlined may be useful for that.
I'm thinking that register usage information from imcc could be of use
to the JIT, as that would save it having to work out things again. So that
probably needs a segment.
Also some way of storing a cryptographic signature in the file, so that you
could compile a parrot that automatically refuses to load code that isn't
signed by you.
On Thu, Jan 23, 2003 at 05:05:54PM -0500, Dan Sugalski wrote:
> Which will generally be the case, I expect. Tell a sysadmin that they
> can reduce the memory footprint of mod_parrot by 50% by running a
> utility (that we provide in the parrot kit) over the library and I
> expect you'll see smoke from the keyboard as he/she whips off the
> command at supersonic speeds... :)
Followed by writs for claims for supersonic RSI addressed to p6i
On Fri, Jan 24, 2003 at 07:59:13AM +0100, Leopold Toetsch wrote:
> Juergen Boemmels wrote:
>
> >Dan Sugalski <d...@sidhe.org> writes:
>
>
> >It might be even possible to dump the jitted code. This would increase
> >the startup. Then strip the bytecode to reduce the size of the file
> >and TADA: Yet another new binary format.
>
>
> When you then are able to to get the same memory layout for a newly
> created interpreter, it might even run ;-)
So the JITted code contains lots of hard references to address in running
interpreter? It's not just dependent on that particular binary's layout?
I guess in future once the normal JIT works, and we've got the pigs flying
nicely then it would be possible to write a Not Just In Time compiler that
saves out assembly code and relocation instructions.
Bah. That's "parrot -o foo.o foo.pmc" isn't it?
Nicholas Clark
> On Thu, Jan 23, 2003 at 02:48:38PM -0800, Brent Dax wrote:
>> struct Chunk {
>> opcode_t type;
>> opcode_t version;
>> opcode_t size;
>> void data[];
>> };
>>
>
> I agree with the "roughly" bit, but I'd suggest ensuring that you put
> in enough bits to get data[] 64 bit aligned.
>>If there's a directory of some sort, it should record the type ID and
>>the offset to the beginning of the chunk.
Putting this together, and inserting an "Id" field above, would give
alignment on a 64 bit boundary for data in PBC - assuming the strings,
data, ... are also N*64 bit wide.
> It might be useful for making "portable" fat bytecode.
As I stated, I changed all sizes/offsets to be opcode_t. Of course this
breaks reading 32 bit PBC on machines with 64 bit opcode_t - but this
was already broken before, e.g.:
header->magic = PackFile_fetch_op(self, cursor++);
If we want this portable, it probably should kook like
header->magic = PackFile_fetch_op(self, &cursor);
where the _fetch_xx has to advance the cursor by the PBC defined wordsize.
A _fetch_cstring and a _fetch_n_opcodes would also be handy. And for the
latter, if the packfile is mmap()ed, it shouldn't fetch anything, but
just set up the code pointer, advance the cursor, and remember, that the
code_segment->code field should better not be freed at destroy time.
> I'm thinking that register usage information from imcc could be of use
> to the JIT, as that would save it having to work out things again. So that
> probably needs a segment.
Yep. imcc does the whole CFG and life analysis, which JIT is doing
again. At least basic blocks and register usage could be passed. Though
register life range in JIT is different and depends on $arch. Calling
(JIT) external functions ends a registers life, so it must be saved
before calling and restored after.
> Also some way of storing a cryptographic signature in the file, so that you
> could compile a parrot that automatically refuses to load code that isn't
> signed by you.
The palladium parrot :)
>>Juergen Boemmels wrote:
>>>It might be even possible to dump the jitted code.
>>When you then are able to to get the same memory layout for a newly
>>created interpreter, it might even run ;-)
> So the JITted code contains lots of hard references to address in running
> interpreter? It's not just dependent on that particular binary's layout?
JIT/i386 does call parrot functions directly e.g. pmc_new_noinit or
string_make, so these would need relocation - or probably slightly
slower but simpler to handle a jump table. We (all? JIT $arch) have at
least one register pointing to parrot data. Including a jump table there
for used parrot functions would do it.
> I guess in future once the normal JIT works, and we've got the pigs flying
> nicely then it would be possible to write a Not Just In Time compiler that
> saves out assembly code and relocation instructions.
>
> Bah. That's "parrot -o foo.o foo.pmc" isn't it?
*g*
> Nicholas Clark
leo
> >Also some way of storing a cryptographic signature in the file, so that you
> >could compile a parrot that automatically refuses to load code that isn't
> >signed by you.
>
>
> The palladium parrot :)
naa. I said "signed by you", not "signed by the RIAA^WMPAA^WMicrosoft"
Nicholas Clark
> At 5:32 PM +0000 1/24/03, Dave Mitchell wrote:
>
>> I just wrote a quick C program that successfully mmap-ed in all 1639
>> files in my Linux box's /usr/share/man/man1 directory.
>
>
> Linux is not the universe, though.
I have it changed to use mmap() bytecode (other segments, with have a
similar thing (i.e. size and opcode_t[size]) will be mmaped too).
If mmap'ing the packfile fails, a fallback to IO reading is there.
leo
Yes, of course. I would do this with a personalized version of
fingerprint.c and generate a separate executable.
> Nicholas Clark
leo
How true. On Solaris, for example, mmap's are aligned on 64k boundaries,
which leads to horrible virtual address space consumption when you map
lots of small things. If we're mmap()ing things, we want to be sure
they're fairly large.
/s
> This means that a Perl server that relies on a lot of modules, and which
> forks for each connection (imagine a Perl-based web server), doesn't
> consume acres of swap space just to have an in-memory image per Perl
> process, of all the modules.
Are you sure the swap space allocation isn't mostly attributable to the poor
locality in the Perl process's data structures ?
--
Jason
Okay, I just ran a program on a a Solaris machines that mmaps in each
of 571 man files 20 times (a total of 11420 mmaps). The process size
was 181Mb, but the total system swap available only decreased by 1.2Mb
(since files mmapped in RO effecctively don't consume swap).
I think Solaris and Linux can both cut this. If other OSes can't, then
we fallback to reading in the file when necessary.
--
Lady Nancy Astor: If you were my husband, I would flavour your coffee
with poison.
Churchill: Madam - if I were your husband, I would drink it.
I was using swap space as a loose term to mean virutal memory consumption
- ie that resource which necessitates buying more RAM and/or swap disks.
The locality wasn't a proplem.
--
A walk of a thousand miles begins with a single step...
then continues for another 1,999,999 or so.
There's always NetBSD if Linux won't run on your hardware :-)
<ducks>
> > How true. On Solaris, for example, mmap's are aligned on 64k boundaries,
> > which leads to horrible virtual address space consumption when you map
> > lots of small things. If we're mmap()ing things, we want to be sure
> > they're fairly large.
>
> Okay, I just ran a program on a a Solaris machines that mmaps in each
> of 571 man files 20 times (a total of 11420 mmaps). The process size
> was 181Mb, but the total system swap available only decreased by 1.2Mb
> (since files mmapped in RO effecctively don't consume swap).
11420 simultaneous mmaps in the same process? (just checking that I
understand you)
> I think Solaris and Linux can both cut this. If other OSes can't, then
> we fallback to reading in the file when necessary.
Maybe I'm paranoid (or even plain wrong) but we (parrot) can handle it
if an mmap fails - we just automatically fall back to plain file loading.
Can dlopen() cope if an mmap fails? Or on a platform which can only
do a limited number of mmaps do we run the danger of exhausting them early
with all our bytecode segments, and then the first time someone attempts
a require POSIX; it fails because the perl6 DynaLoader can't dlopen
POSIX.so? (And by then we've done our could-have-been-plain-loaded
mmaps, so it's too late to adapt)
Nicholas Clark
yep, exactly that. Src code included below.
> Maybe I'm paranoid (or even plain wrong) but we (parrot) can handle it
> if an mmap fails - we just automatically fall back to plain file loading.
> Can dlopen() cope if an mmap fails? Or on a platform which can only
> do a limited number of mmaps do we run the danger of exhausting them early
> with all our bytecode segments, and then the first time someone attempts
> a require POSIX; it fails because the perl6 DynaLoader can't dlopen
> POSIX.so? (And by then we've done our could-have-been-plain-loaded
> mmaps, so it's too late to adapt)
If there's such a platform, then presumably we don't bother mmap at all
for that platform.
to run: cd to a man directory, then C</tmp/foo *>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
main(int argc, char *argv[])
{
int i,j;
int fd;
off_t size;
void *p;
struct stat st;
for (j=0; j<20; j++) {
for (i=1; i<argc; i++) {
fd = open(argv[i], O_RDONLY);
if (fd == -1) {
perror("open"); exit(1);
}
if (fstat(fd, &st) == -1) {
perror("fstat"); exit(1);
}
size = st.st_size;
/* printf("%d %5d %s\n", i, size, argv[i]); */
p = mmap(0, size, PROT_READ, MAP_SHARED, fd, 0);
if (p < 0) {
perror("mmap"); exit(1);
}
close(fd);
}
printf("done loop %d\n",j);
}
sleep(1000);
The problem's actually _virtual_ memory use/fragmentation, not physical
memory or swap. Say you map in 10k small files -- that's 640M virtual
memory, just over a fourth of what's available. Now let's say you're also
using mmap() in your webserver to send large (10M) files quickly over the
network. The small files, if they're long-lived get scattered all over
VA-space, so there's a non-trivial chance that the OS won't be able to
find a 10MB chunk of free addresses at some point.
To see it, you might try changing your program to map and unmap a large
file periodically while mapping the man pages. Then take a look at the
process's address space with /usr/proc/bin/pmap to see what the OS is
doing with the maps.
Weird, I know, but that's why it stuck in my mind. You have to map quite
a few files to get this to happen, but it's a real possibility with a
32-bit address space and a long-running process that does many small
mmap()s and some large ones.
Anyways...
/s
> How true. On Solaris, for example, mmap's are aligned on 64k boundaries,
> which leads to horrible virtual address space consumption when you map
> lots of small things. If we're mmap()ing things, we want to be sure
> they're fairly large.
Is one PBC file a small thing? Or in other words, should we have a low
limit where we start again to malloc and copy PBC files?
Configure option? Commandline switch?
> /s
leo
Yeah, but in pratice, most, if not all the small files will mapped in at
startup. It's no different than the situation at the moment on Solaris
where XS modules require the .so object to be mmapped in.
> Weird, I know, but that's why it stuck in my mind. You have to map quite
> a few files to get this to happen, but it's a real possibility with a
> 32-bit address space and a long-running process that does many small
> mmap()s and some large ones.
But we'll all be using 64-bit processors by the time parrot's released :-)
--
This email is confidential, and now that you have read it you are legally
obliged to shoot yourself. Or shoot a lawyer, if you prefer. If you have
received this email in error, place it in its original wrapping and return
for a full refund. By opening this email, you accept that Elvis lives.
Maybe a config option? The app I'm thinking of was pathological, in that
it mapped in thousands of 20-byte files. Now that I think about it,
unless someone implements something very strangely (or has absolutely
enormous numbers of threads) this shouldn't be an issue.
/s
In purticlar, it would be nice to be able to trust code written by
myself and people I personaly trust, run CPAN code in checked mode, run
code submited by users without access to create IO PMCs, and not run
Micorosoft code at all.
A code signing standard would enable that. It's defining a trust model
that doesn't let the user know what's actualy going on that we have to
be wary of. (Even authenticating the host is potentialy useful...
though I can't think of a good use.)
-=- James Mastros
-=- James Mastros
[...]
> > struct Chunk {
> > opcode_t type;
> > opcode_t version;
> > opcode_t size;
> > void data[];
> > };
will this ever compile?
void data[] is not allowed, and even char data[] is an incomplete
type, so its not allowed in a structure definition. A void * data
pointer seems more appropriate. This way its possible to have one
TableOfContent for the whole bytecode file and every chunk of the file
can be accessed in constant time (no need to scan over the complete
file to reach the last chunk)
> I agree with the "roughly" bit, but I'd suggest ensuring that you put
> in enough bits to get data[] 64 bit aligned. Mainly because at least 1
> architecture exists that has no 32 bit types (Crays I know about; others
> may exist. I can't remember if perl 5.8 passes 100% of tests on Crays.
> We certainly tried)
opcode_t will be 64 bit on this architectures.
[...]
> I'm thinking that register usage information from imcc could be of use
> to the JIT, as that would save it having to work out things again. So that
> probably needs a segment.
>
> Also some way of storing a cryptographic signature in the file, so that you
> could compile a parrot that automatically refuses to load code that isn't
> signed by you.
These ideas show clearly one thing:
The typecode must be extendible.
[...]
> > >It might be even possible to dump the jitted code. This would increase
> > >the startup. Then strip the bytecode to reduce the size of the file
> > >and TADA: Yet another new binary format.
> > When you then are able to to get the same memory layout for a newly
> > created interpreter, it might even run ;-)
>
> So the JITted code contains lots of hard references to address in running
> interpreter? It's not just dependent on that particular binary's
> layout?
And if there are two interpreters in the same process (isn't that the
supposed way of multiple threads) each one has to compile the same
code again?
> I guess in future once the normal JIT works, and we've got the pigs flying
> nicely then it would be possible to write a Not Just In Time compiler that
> saves out assembly code and relocation instructions.
>
> Bah. That's "parrot -o foo.o foo.pmc" isn't it?
And if we make C a parrot supported language we can even build parrot
with parrot?
bye
boe
--
Juergen Boemmels boem...@physik.uni-kl.de
Fachbereich Physik Tel: ++49-(0)631-205-2817
Universitaet Kaiserslautern Fax: ++49-(0)631-205-3906
PGP Key fingerprint = 9F 56 54 3D 45 C1 32 6F 23 F6 C7 2F 85 93 DD 47
> Nicholas Clark <ni...@unfortu.net> writes:
> > > struct Chunk {
> > > opcode_t type;
> > > opcode_t version;
> > > opcode_t size;
> > > void data[];
> > > };
> > I agree with the "roughly" bit, but I'd suggest ensuring that you put
> > in enough bits to get data[] 64 bit aligned. Mainly because at least 1
> > architecture exists that has no 32 bit types (Crays I know about; others
> > may exist. I can't remember if perl 5.8 passes 100% of tests on Crays.
> > We certainly tried)
>
> opcode_t will be 64 bit on this architectures.
For native bytecode, yes. However, consider a bytecode file generated on
a machine with a 32-bit opcode_t that is now being read on a machine with
a 64-bit opcode_t. In that case, it would be helpful if the data were
aligned on a 64-bit boundary.
--
Andy Dougherty doug...@lafayette.edu
> Nicholas Clark <ni...@unfortu.net> writes:
>>> struct Chunk {
>>> opcode_t type;
>>> opcode_t version;
>>> opcode_t size;
>>> void data[];
>>> };
> will this ever compile?
It's similar to "opcode_t *data". If size == 0, no data follow in byte
stream. byte_code_{un,}pack is implemented like this now.
>>I agree with the "roughly" bit, but I'd suggest ensuring that you put
>>in enough bits to get data[] 64 bit aligned.
> opcode_t will be 64 bit on this architectures.
PBC segments and above data are aligned on 4*sizeof(opcode_t) boundary.
> The typecode must be extendible.
If it does follow above conventions not. Only a unique name would be
necessary. But, yes in the long run.
> And if there are two interpreters in the same process (isn't that the
> supposed way of multiple threads) each one has to compile the same
> code again?
No: interpreter->code of both points to the same data and JIT code
already lives in the packfile now.
> And if we make C a parrot supported language we can even build parrot
> with parrot?
And if it runs, yes.
> bye
> boe
leo
> > I guess in future once the normal JIT works, and we've got the pigs
> flying
> > nicely then it would be possible to write a Not Just In Time
> compiler that
> > saves out assembly code and relocation instructions.
> >
> > Bah. That's "parrot -o foo.o foo.pmc" isn't it?
>
> And if we make C a parrot supported language we can even build parrot
> with parrot?
I was just thinking that myself. There are two issues here :
1. The gcc : I have %99 of the information about the function bodies of
parrot c source code in rdf/xml. That could be fed to parrot.
2. The pnet/C : there has been work done by Rhys to make a managed c
compiler for pnet. Gopal has been working on a parrot bytecode emitter
for Pnet.
mike
=====
James Michael DuPont
http://introspector.sourceforge.net/
__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
I'm battling with this in another file format at the moment; if possible can
we please *not* have it sensitive to its own location in the file?
For example, an auto-dearchive zip-file has its index at the end of the
file, so that the code can go at the front. It would be nice if the whole
archive could be signed, rather than just the dearchiving code.
My suggestion: make the signature define which other parts of the file it
applies to, say with a list of region boundaries as byte addresses in the
file; that way signature manipulation remains fairly simple, and it's not
too hard to check that a given section is spanned by a signature. And you
could have multiple signatures applying to different parts of the file (one
to the zip archive, another to the unarchiver).
And how is this going to interact with "-T" or whatever we're going to use?
Under my suggested scheme, the data would be untainted if it's covered by a
verified signature, and tainted if not.
-Martin
Hmmm... bootstrapping ....
> 1. The gcc : I have %99 of the information about the function bodies of
> parrot c source code in rdf/xml. That could be fed to parrot.
That would only be part of the issue ... generating stuff out of
RDF AST's is only half of our trouble ... In fact , I think it would
be much easier if someone managed to convert RTL into Parrot (a gcc
backend) ...
It won't be a "hack" like egcs-jvm since Parrot is already a register
machine and has pointer instructions already ...
> 2. The pnet/C : there has been work done by Rhys to make a managed c
> compiler for pnet. Gopal has been working on a parrot bytecode emitter
> for Pnet.
Well ... I haven't been "working" on parrot bytecode emitter ...
It's just that we have a "pm_codegen.tc" treecc handler inside parrot
which is stubbed up ... until I can do some kind of class generation
for parrot, the C# AST cannot be used for anything useful.
Also Dan was not so hot about having a C compiler for Parrot when he
met Rhys on IRC...
So I'm sticking to compiling C# to Parrot as an aim , and *maybe* Java
as well ...
Gopal
--
The difference between insanity and genius is measured by success
Here are the things I would like to know about a given bytecode :
what line (maybe column) it comes from
Possible comments about it.
If it is a method call, what is the method name,signature,locatoin
If it is a variable or constant, what is the variable name, type, size
If it is a expression , what is the type of it, the size
For a given type, the name, size would be great to store.
Is it going to be possible to store this data in the meta-data,
it does not have to be all there at once, but will the framework handle
it?
Hopefully you have answered this already, and you can just say, rtfm.
Thanks for you patience, i am a bit slow today.
> Dear All,
> I just wanted to ask about a conclusion on the bytecode metadata.
>
> Here are the things I would like to know about a given bytecode :
> what line (maybe column) it comes from
File/line information is already there (imcc -d -o...) and working.
> If it is a method call, what is the method name,signature,locatoin
> If it is a variable or constant, what is the variable name, type, size
> If it is a expression , what is the type of it, the size
> For a given type, the name, size would be great to store.
>
> Is it going to be possible to store this data in the meta-data,
> it does not have to be all there at once, but will the framework handle
> it?
Yep. The framework can now handle all kinds of information in the PBC,
though the details have to be determined. For actually doing something
useful with these data we probably need a PBC PMC class, which can do
something with this data at PASM level and if possible routines like in
jit_debug.c which handle such information over to the debugger.
> Hopefully you have answered this already, and you can just say, rtfm.
Some minutes ago, I did check in a major update of docs/parrotbyte.pod.
> mike
leo
Thats a lot of metadata. Sounds like maybe the metadata is primary
and the bytecode is secondary, in which case perhaps what you
really want is a (metadata) tree decorated with bytecode rather than
a (bytecode) array decorated with metadata.
Of course, the most natural candidate for the metadata would be the
annotated (file & line, etc.) parse tree, or some approximation to it
after compilation-related transformations.
I can imagine a process that loads the tree, and linearizes the
bytecode with the metadata consisting of backpointers to nodes of
the tree, either in band as escaped noop-equivalent bytecode or
out of band in an offset-pointer table.
With a suitable amount of forethought on the tree representation,
you should be able to have good flexibility while still having enough
standardization on how tree-emitting compilers represent typical
debug-related metadata (file, line, etc.) that debuggers and other
tools could be generic.
Regards,
-- Gregor
James Michael DuPont <mdupo...@yahoo.com>
02/04/2003 04:06 AM
To: James Mastros <ja...@mastros.biz>, perl6-i...@perl.org, Leopold
Toetsch <l...@toetsch.at>
cc: Nicholas Clark <ni...@unfortu.net>, Brent Dax <bren...@cpan.org>, Dan
Sugalski <d...@sidhe.org>, perl6-i...@perl.org, James Michael DuPont
<mdupo...@yahoo.com>, Dave Beckett <dave.b...@bristol.ac.uk>,
introspectors <introspecto...@lists.sourceforge.net>, Juergen
Boemmels <boem...@physik.uni-kl.de>, Dave Mitchell <da...@fdgroup.com>
Subject: Re: Bytecode metadata
Dear All,
I just wanted to ask about a conclusion on the bytecode metadata.
Here are the things I would like to know about a given bytecode :
what line (maybe column) it comes from
Possible comments about it.
If it is a method call, what is the method name,signature,locatoin
If it is a variable or constant, what is the variable name, type, size
If it is a expression , what is the type of it, the size
For a given type, the name, size would be great to store.
Is it going to be possible to store this data in the meta-data,
it does not have to be all there at once, but will the framework handle
it?
Hopefully you have answered this already, and you can just say, rtfm.
> Mike --
>
> Thats a lot of metadata. Sounds like maybe the metadata is primary
> and the bytecode is secondary, in which case perhaps what you
> really want is a (metadata) tree decorated with bytecode rather than
> a (bytecode) array decorated with metadata.
The bytecode is primary. This is whats get executed, this is what
needs too be fast (both in startup time and runtime). Some kind of
data is necessary for the bytecode, such as the string
constants. These need also be accessed fast (don't know if this is
called metadata, this is more data). The metadata is only needed in
rare cases e.g. debugging, so it doesn't need to be as fast (but even
here speed is nice)
> Of course, the most natural candidate for the metadata would be the
> annotated (file & line, etc.) parse tree, or some approximation to it
> after compilation-related transformations.
>
> I can imagine a process that loads the tree, and linearizes the
> bytecode with the metadata consisting of backpointers to nodes of
> the tree, either in band as escaped noop-equivalent bytecode or
> out of band in an offset-pointer table.
Bytecode reading must be fast. Ideally it is mmap and start.
Treewalking for bytecodegeneration should be done by the compiler.
> With a suitable amount of forethought on the tree representation,
> you should be able to have good flexibility while still having enough
> standardization on how tree-emitting compilers represent typical
> debug-related metadata (file, line, etc.) that debuggers and other
> tools could be generic.
The tree metadata can sure be some kind of intermediate output of the
compiler (the output of the compiler front end), but normaly this
should be fed into a backend which generates fast running bytecode or
even native code.
bye
b.
I agree that under normal circumstances the bytecode is primary.
I was observing that as more and more metadata is considered,
eventually its quantity (measured, say, in bytes) could approach
or even exceed that of the raw bytecode. In cases where one
would feel such a quantity of metadata is needed, it may not
always be necessary to get greased-weasel speed-of-loading
(but, see below).
I understand the the mmap-and-go idea, although it doesn't
always work out even when mmap is available (for example,
prederef requires a side pointer-array to store its prederef
results). Sometimes its mmap-mumble-go (but, see below).
Certainly, there is nothing to prevent one from having
the linearized bytecode pregenerated in the PBC file even
when a metadata tree is also present (the tree could reference
contiguous chunks of that bytecode by offset-size pairs). If
you don't care about the tree, you don't process it. If you do
process it, you probably produce an index data structure mapping
byte code offsets to tree nodes for the debugger. I believe
we retain high speed with this approach.
We do need to consider how the metadata makes it from the
compiler *through* IMCC to land in the PBC file. The compiler
needs to be able to produce annotated input to IMCC, and IMCC
needs to be able to retain the annotations while it makes its
code improvements and rendering (register mapping, etc.).
I'm thinking that, too, could possibly be a tree. IMCC can pick out
the chunks of IMC, generate bytecode, and further annotate the
tree with the offset and size of the generated PBC chunk. The
tree can be retained as the metadata segment in the PBC file.
Regards,
-- Gregor
Juergen Boemmels <boem...@physik.uni-kl.de>
Sent by: boem...@physik.uni-kl.de
02/04/2003 08:15 AM
To: gre...@focusresearch.com
cc: Perl6 Internals <perl6-i...@perl.org>
Subject: Re: Bytecode metadata
I completly agree with you. For my needs, the meta-data does not have
to be loaded at the same time at all. I can be in a different file for
I care. I just want to know how where we can put it. The Microsoft IL
has a whole section on meta-data, and one wonders what Parrot might be
doing to address the same issues. excuse my ignorance, I am sure you
addressed this, and no I have not read the new pods yet.
> Bytecode reading must be fast. Ideally it is mmap and start.
> Treewalking for bytecodegeneration should be done by the compiler.
yes I agree, I just want to be able to reconstruct the tree for
debugging or reverse engineering (if the compiler that produced the
bytecode whats to produce this).
I would like to prototype some meta-data storage of my gcc
> The tree metadata can sure be some kind of intermediate output of the
> compiler (the output of the compiler front end), but normaly this
> should be fed into a backend which generates fast running bytecode or
> even native code.
That sounds great.
Normally you dont need this information, I just want to know how I can
store it if I *do* need it.
The metadata from the c++ that i am extracting even exceeds the size of
the sourcecode itself.
--- gre...@focusresearch.com wrote:
> b. --
>
> I agree that under normal circumstances the bytecode is primary.
> I was observing that as more and more metadata is considered,
> eventually its quantity (measured, say, in bytes) could approach
> or even exceed that of the raw bytecode. In cases where one
> would feel such a quantity of metadata is needed, it may not
> always be necessary to get greased-weasel speed-of-loading
> (but, see below).
>
> I understand the the mmap-and-go idea, although it doesn't
> always work out even when mmap is available (for example,
> prederef requires a side pointer-array to store its prederef
> results). Sometimes its mmap-mumble-go (but, see below).
>
>
> Certainly, there is nothing to prevent one from having
> the linearized bytecode pregenerated in the PBC file even
> when a metadata tree is also present (the tree could reference
> contiguous chunks of that bytecode by offset-size pairs). If
> you don't care about the tree, you don't process it. If you do
> process it, you probably produce an index data structure mapping
> byte code offsets to tree nodes for the debugger. I believe
> we retain high speed with this approach.
yeah, that is the idea. Reflection and introspector require the
meta-data, that can be read by special reflection operations.
>
>
> We do need to consider how the metadata makes it from the
> compiler *through* IMCC to land in the PBC file. The compiler
> needs to be able to produce annotated input to IMCC, and IMCC
> needs to be able to retain the annotations while it makes its
> code improvements and rendering (register mapping, etc.).
> I'm thinking that, too, could possibly be a tree. IMCC can pick out
> the chunks of IMC, generate bytecode, and further annotate the
> tree with the offset and size of the generated PBC chunk. The
> tree can be retained as the metadata segment in the PBC file.
Sounds good to me. For me, it could also be a graph in triples formats
(subject,predicate,object), and not a tree. This is what I wanted to
know, what is defined, and what needs to be defined.
Regards,
AFAIK, that just holds the offset, line number and filename. IIRC the
JVM had a LineNumberTable and VarNameTable for debugging which were
declared as ``attributes'' to each method in the .class tree.
I suppose VarNameTable is totally irrelevant for Parrot ...
> yes I agree, I just want to be able to reconstruct the tree for
> debugging or reverse engineering (if the compiler that produced the
> bytecode whats to produce this).
Optimisations ? ... (bang, there goes the line numbers ;)
> Normally you dont need this information, I just want to know how I can
> store it if I *do* need it.
>
> The metadata from the c++ that i am extracting even exceeds the size of
> the sourcecode itself.
>
> yeah, that is the idea. Reflection and introspector require the
> meta-data, that can be read by special reflection operations.
I think Parrot is going to *need* reflection operations :) ...
You might be able to extract information like you do with C# ,
with reflection looping over the methods.
Btw, your RDF stuff wouldn't be what I call "metadata" :) .. it's
data itself in a pre-processed format.
> > IMCC can pick out
> > the chunks of IMC, generate bytecode,
.line 42 "life.fubar"
?
Gopal
PS: don't look at me like that , I don't know anything about debugging
eval()...
Fair enough. good point!
> Of course, the most natural candidate for the metadata would be the
> annotated (file & line, etc.) parse tree, or some approximation to it
> after compilation-related transformations.
OK that sounds fine. My current problems with the graphs of meta-data
are the speed of loading. I would like to use something like what you
are talking about with the mmap. Also, dot.net IL has tons of
meta-data, very very much of it.
>
> I can imagine a process that loads the tree, and linearizes the
> bytecode with the metadata consisting of backpointers to nodes of
> the tree, either in band as escaped noop-equivalent bytecode or
> out of band in an offset-pointer table.
Sure, a zippper (Reihverschluss ;) concept.
>
> With a suitable amount of forethought on the tree representation,
> you should be able to have good flexibility while still having enough
> standardization on how tree-emitting compilers represent typical
> debug-related metadata (file, line, etc.) that debuggers and other
> tools could be generic.
OK. Well the current rdf format that I have is ok, so that brings me
back to the idea of using rdf.... Redland supports a bdb, which also
supports fast loading, but is not platform independant.
--- Gopal V <gopa...@symonds.net> wrote:
> If memory serves me right, James Michael DuPont wrote:
> > I just want to know how where we can put it. The Microsoft IL
> > has a whole section on meta-data,
>
> AFAIK, that just holds the offset, line number and filename. IIRC the
>
> JVM had a LineNumberTable and VarNameTable for debugging which were
> declared as ``attributes'' to each method in the .class tree.
>
> I suppose VarNameTable is totally irrelevant for Parrot ...
I dont know that, what is it? Variable name table? If so, i think it
might be good for debugging.
>
> > yes I agree, I just want to be able to reconstruct the tree for
> > debugging or reverse engineering (if the compiler that produced the
> > bytecode whats to produce this).
>
> Optimisations ? ... (bang, there goes the line numbers ;)
If you want to debug, you dont want optimizations. When you run the
debugger in the gcc, it produces a dwarf file, that is the type of
meta-data i am talking about.
>
> > Normally you dont need this information, I just want to know how I
> can
> > store it if I *do* need it.
> >
> > The metadata from the c++ that i am extracting even exceeds the
> size of
> > the sourcecode itself.
> >
> > yeah, that is the idea. Reflection and introspector require the
> > meta-data, that can be read by special reflection operations.
>
> I think Parrot is going to *need* reflection operations :) ...
> You might be able to extract information like you do with C# ,
> with reflection looping over the methods.
You might want to run C# in parrot, then you need it.
>
> Btw, your RDF stuff wouldn't be what I call "metadata" :) .. it's
> data itself in a pre-processed format.
Well read my first answer to the original meta-data post, with all the
links.
I think it is meta-data, all information about the bytecode that might
be interesting to a person, to understand what a give bytecode is and
means, it's context, meaning, usage, and all that. it is all meta-data.
The bytecode is what is needed to run.
>
> > > IMCC can pick out
> > > the chunks of IMC, generate bytecode,
>
> .line 42 "life.fubar"
Again, nice to meet you again Mr. Victory,
Mike
------------------------------------------------------------------------
The first ten million years were the worst, said Marvin, and the second
ten million years were the worst too. The third ten million I didn't
enjoy at all. After that I went into a bit of a decline......
The best conversation I had was over forty million years ago, continued
Marvin.....And that was with a coffee machine.
- Marvin complaining about being left alone for years.
> --- gre...@focusresearch.com wrote:
>
>>Mike --
>>
>>Thats a lot of metadata.
>>
>
> OK that sounds fine. My current problems with the graphs of meta-data
> are the speed of loading.
When you arrange the meta-data as a single opcode stream, you have ~zero
load time for the mmap()ed case.
This means, you delay parsing of this stream to the time, when you are
actually using it.
> mike
leo
Great. I will review the code and see how this can be done. That would
be great!!!
mike
Mapping between local vars and variable names ... because JVM has unlimited
(well virtually unlimited) local vars, this works for the JVM. But since
Parrot has only 32 registers , they get re-used for local-vars .
I think using IMCC will in general mess around the registers numbers for
the temporaries. So it doesn't make sense for Parrot to have a VarNameTable.
> > I think Parrot is going to *need* reflection operations :) ...
> > You might be able to extract information like you do with C# ,
> > with reflection looping over the methods.
>
> You might want to run C# in parrot, then you need it.
Not really for C# support only ... Dynamic invocation will need it...
<?php
$a="foobar";
$a(); // get method name "foobar" from global scope, run
?>
Something of this sort will need to occur .. C# is much easier as you
already know what all types/method there might be and there's no dynamic
member lookup :).
> I think it is meta-data, all information about the bytecode that might
> be interesting to a person, to understand what a give bytecode is and
> means, it's context, meaning, usage, and all that. it is all meta-data.
You could argue about that , but IMHO except the basic Reflection stuff
and debug information , all the other itty-bitty details are useless for
the engine . Better keep it in a seperate file ?. (and build a cool
bytecode analysis tool as well :)
> > .line 42 "life.fubar"
>
> Again, nice to meet you again Mr. Victory,
That's quoted off my fubar-to-IL compiler ... (that's what we call
the f'uped beyond recognition pascal clone we ``implement'' in our lab).
Cheerio,
Gopal