Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

name shortening, shareable images, AUTODIN II, and such

28 views
Skip to first unread message

Craig A. Berry

unread,
Oct 14, 2010, 6:38:40 PM10/14/10
to
I need to predict the shortened name the C compiler will generate when
invoked with /NAMES=SHORTENED, and I need it so I can put it in a linker
options file for a shareable image. I can't use the (now superseded)
Java name shortener because, as its docs say, it doesn't produce the
same names as the C compiler (plus its sources are not provided), and I
can't use CXXDEMANGLE/MANGLE as I can't depend on everyone having C++
installed. I could conceivably build my own CXXDEMANGLE/MANGLE
replacement to root through the demangler database and see what's in
there, but, well, yuck.

There has been a fair amount written about this, but as far as I can
tell nothing that actually solves the problem. There is Ed Vogel's post
here:

<http://groups.google.com/group/comp.os.vms/msg/cc78b4903bcfb2f0?>

which made it almost verbatim into the on-line help for CC/NAMES
(thanks, Ed), the most pertinent bit of which says:

---
A shortened name consists of the first 23 characters of the name
followed by a 7-character Cyclic Redundancy Check (CRC) computed by
looking at the full name, and then a "$".
---

and then it gives an excerpt of a call to LIB$CRC, which it says
provides this checksum. There are references to the MACRO-32 CRC
instruction documented at:

<http://h71000.www7.hp.com/doc/73final/4515/4515pro_026.html#16_cyclicredundancycheckinstruc>

which provides some background but of course says nothing about how
the C and C++ compilers use it.

Being lazy, I didn't bother to code up my own call to LIB$CRC, but
borrowed Hoff's from:

<http://hoffmanlabs.org/vmsfaq/vmsfaq_017.html#progautodin2>

which he also provides, with some discussion, at:

<http://labs.hoffmanlabs.com/node/616#comment-334>

the most encouraging part of which is the comment, "this C code could be
modified into the C (de)mangling scheme."

I took this code and added a wrapper so I could pass an argument to it:

$type hoff_crc.c
#include <lib$routines.h>
#include <descrip.h>
#include <stdio.h>
#include <string.h>


static int CreateCRC32( struct dsc$descriptor_s *InputDataDesc )
{
unsigned int AUTODIN2;
unsigned int Seed = ~0UL;
unsigned int Coefficient = 0x0EDB88320UL;
unsigned int CRCArray[16];

lib$establish( lib$sig_to_ret );

lib$crc_table( (void *) &Coefficient, (void *) CRCArray );
AUTODIN2 = lib$crc( (void *) CRCArray, (void *) &Seed, InputDataDesc );
AUTODIN2 ^= Seed;

return AUTODIN2;
}

int
main(int argc, char **argv)
{
$DESCRIPTOR(input_str, argv[1]);
input_str.dsc$w_length = strlen(argv[1]);
int sum = CreateCRC32(&input_str);
printf ("crc is %x\n", sum);

}

$ cc hoff_crc
$ link hoff_crc
$ mcr []hoff_crc "Please_forgive_this_absurdly_long_symbol_name"
crc is 80b9108

Brilliant. I now have a CRC. I even reproduced these results by using
two other entirely independent CRC32 implementations found in the wild.
However, this CRC is a 32-bit value, and my goal is to reproduce the
"7-character" CRC the compiler purportedly uses for name shortening.

What I was expecting was to take the 8 hex digits of the 32-bit CRC and
throw away the least significant nibble. But that doesn't give me the
answer I'm looking for, which is obtainable by doing:

$ cc/names=short sys$input
int Please_forgive_this_absurdly_long_symbol_name;
^Z
$ type [.cxx_repository]cxx$demangler_db.

PLEASE_FORGIVE_THIS_ABS1ARO4QU$Please_forgive_this_absurdly_long_symbol_name

So the "7-character Cyclic Redundancy Check" it's using is "1ARO4QU",
which bears no obvious relationship to 80b9108 that I can see. It
doesn't take a grizzled propeller-head to realize that several of the
characters in "1ARO4QU" are not even valid hex digits, so that theory
goes out the window.

Clearly there is some additional transformation of the CRC in order to
come up with the 7 characters that are used as part of the shortened
name. Anyone know what it is?

Jose Baars

unread,
Oct 14, 2010, 7:08:54 PM10/14/10
to
On 15 okt, 00:38, "Craig A. Berry" <craigbe...@nospam.mac.com> wrote:
> Clearly there is some additional transformation of the CRC in order to
> come up with the 7 characters that are used as part of the shortened
> name. Anyone know what it is?

No. 1ARO4QU couldn't be the Base-64 encoding of 80b9108?

As the mangle/demangle CRC routine is undocumented, I suppose it is
unsupported, and I didn't feel too confident reverse engineering this
reliably.

So I decided to just read cxx$demangler.db

You could goto libssh2.org and download libssh2, in the vms
subdirectory
in libssh2_make_lib.dcl contains exactly this to build a options file
and a shareable library. I myself am not completely convinced of the
elegance of it all, but at least it works.

Good luck

Craig A. Berry

unread,
Oct 14, 2010, 8:29:09 PM10/14/10
to

Jose Baars wrote:
> On 15 okt, 00:38, "Craig A. Berry" <craigbe...@nospam.mac.com> wrote:
>> Clearly there is some additional transformation of the CRC in order to
>> come up with the 7 characters that are used as part of the shortened
>> name. Anyone know what it is?
>
> No. 1ARO4QU couldn't be the Base-64 encoding of 80b9108?

Good guess, but apparently not:

$ perl -"MMIME::Base64" -e "$x=0x80b9108; print encode_base64($x);"
MTM0OTc1NzUy

Nor is it uuencode:

$ perl -e "$x=0x80b9108; print pack('u*', $x);"
),3,T.3<U-S4R

Whatever it is, it will obviously have to do something to make sure the
result consists of only characters that are valid in symbol names.

> As the mangle/demangle CRC routine is undocumented,

Ah, but it *is* documented, sort of. It says right in the on-line help
for CC/NAMES what CRC they use and even provides sample code for
calculating it -- it just doesn't tell the whole story.

> So I decided to just read cxx$demangler.db

Yes, that is always an option. Thanks for the reply.

Bob Gezelter

unread,
Oct 15, 2010, 5:47:31 AM10/15/10
to
On Oct 14, 8:29 pm, "Craig A. Berry" <craigbe...@nospam.mac.com>

Craig,

I also recommend a high degree of caution. Particularly in name
handling in C/C++. C++ names include the types of the arguments,
because of name overloading (e.g., there can be multiple cases of f(x)
depending on the declared types of x. This is emphatically a C++
feature, not a part of C. If one is setting up a library, one needs to
make sure that the names being generated by C++ are defined so that
they are proper in both ways, and are not likely to change over
versions of the compiler (which is an entirely different compatibility
issue going forward).

Consider the example of the following, admittedly trivial functions:

int thisisareallylongname_nokiddingthistime(int x) {return 1;}
int thisisareallylongname_nokiddingthistime(int *x){return 0;}

Doing a CXX/LIST/MACHINE_CODE is instructive.

I recommend a careful reading of the manual with regards to "external
names".

- Bob Gezelter, http://www.rlgsc.com

Ed Vogel

unread,
Oct 15, 2010, 10:45:16 AM10/15/10
to

"Craig A. Berry" <craig...@nospam.mac.com> wrote in message
news:dM2dnQioxdfsGyrR...@speakeasy.net...

>I need to predict the shortened name the C compiler will generate when
> invoked with /NAMES=SHORTENED, and I need it so I can put it in a linker
> <remainder removed>

Invoke the compiler with /WARN=ENABLE=NAMESHORTENED

This will cause the compile to emit an informational each time it shortens a
name. The informational will give both the old and the shortended name. I
think this will give you all you need.

Ed Vogel

There are lots of interesting messages the compiler can output. I did not
even know for sure that it had this message (the code was added long
ago)....so I tried /WARN=ENABLE=ALL, and sure enough...

Craig A. Berry

unread,
Oct 15, 2010, 11:58:21 AM10/15/10
to

Ed Vogel wrote:
> "Craig A. Berry" <craig...@nospam.mac.com> wrote in message
> news:dM2dnQioxdfsGyrR...@speakeasy.net...
>> I need to predict the shortened name the C compiler will generate when
>> invoked with /NAMES=SHORTENED, and I need it so I can put it in a linker
>> <remainder removed>
>
> Invoke the compiler with /WARN=ENABLE=NAMESHORTENED
>
> This will cause the compile to emit an informational each time it shortens a
> name. The informational will give both the old and the shortended name. I
> think this will give you all you need.

Indeed the message has the relevant information:

$ cc/names=short/WARN=ENABLE=NAMESHORTENED sys$input
int Please_forgive_this_absurdly_long_symbol_name;
^Z
int Please_forgive_this_absurdly_long_symbol_name;
....^
%CC-I-NAMESHORTENED, The external identifier or module name
"PLEASE_FORGIVE_THIS_ABSURDLY_LONG_SYMBOL_NAME" exceeds 31 characters.
The name has been shortened to "PLEASE_FORGIVE_THIS_ABS1ARO4QU$".
at line number 1 in file SYS$INPUT:.;

Thanks for the suggestion, but I think it will be less reliable to parse
the output of the message than to read through cxx$demangler_db.

I should emphasize that I'm not interested in manually identifying the
shortened name on a case-by-case basis -- I already know how to do that.
I need an automated way to predict the shortened name for any arbitrary
symbol from a rapidly changing list of symbols over which I have no
control.

Craig A. Berry

unread,
Oct 15, 2010, 12:26:34 PM10/15/10
to
Bob Gezelter wrote:

> I also recommend a high degree of caution. Particularly in name
> handling in C/C++. C++ names include the types of the arguments,
> because of name overloading (e.g., there can be multiple cases of f(x)
> depending on the declared types of x. This is emphatically a C++
> feature, not a part of C.

You're quite right to toss out the reminder that C name shortening is
the little brother of C++ name mangling, and I have seen your suggestion
elsewhere to use aliases in the linker options file to handle that
larger problem. And for the C++ case, the docs hint that there are
differences between Alpha and Itanium, which would make this even more fun.

Luckily in my case the code is pure C, and extern "C" declarations in
appropriate places ensure it will be treated as such even if thrown up
against a C++ compiler.

I do need to worry about /NAMES=(AS_IS,SHORTENED) as well as
/NAMES=SHORTENED, and I didn't do that in the example I posted earlier.
The question is when upper casing is done, i.e., before or after the
checksum is computed, before or after one of the missing steps after the
checksum is computed, etc.

Jose Baars

unread,
Oct 16, 2010, 7:11:08 AM10/16/10
to
On 15 okt, 18:26, "Craig A. Berry" <craigbe...@nospam.mac.com> wrote:
>and I have seen your suggestion
> elsewhere to use aliases in the linker options file to handle that
> larger problem.

Where?

> The question is when upper casing is done, i.e., before or after the
> checksum is computed, before or after one of the missing steps after the
> checksum is computed, etc.

/NAMES=(AS_IS,SHORTENED) gives different shortened names than
/NAMES=(SHORTENED). Although the point is moot, I guess that means
the checksum is calculated before upper casing.

Jose Baars

unread,
Oct 16, 2010, 7:12:10 AM10/16/10
to
On 16 okt, 13:11, Jose Baars <peutba...@googlemail.com> wrote:
> /NAMES=(SHORTENED). Although the point is moot, I guess that means
> the checksum is calculated before upper casing.
After.

Craig A. Berry

unread,
Oct 16, 2010, 10:23:24 AM10/16/10
to

Jose Baars wrote:
> On 15 okt, 18:26, "Craig A. Berry" <craigbe...@nospam.mac.com> wrote:
>> and I have seen your suggestion
>> elsewhere to use aliases in the linker options file to handle that
>> larger problem.
>
> Where?

Couldn't remember, so I typed "Gezelter alias C++" into the Google and
the first hit was:

<http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1243740>

To be clear, I'm referring to linker aliases specified in the linker
options file, not any other meaning of the word "alias".

>
>> The question is when upper casing is done, i.e., before or after the
>> checksum is computed, before or after one of the missing steps after the
>> checksum is computed, etc.
>
> /NAMES=(AS_IS,SHORTENED) gives different shortened names than
> /NAMES=(SHORTENED). Although the point is moot, I guess that means

> the checksum is calculated [after] upper casing.

So you'd think, but HELP CXX/NAMES has the following note:

<<<
The I64 C++ compiler has some additional encoding rules that
are applied to symbol names after the ABI name mangling
is determined. All symbols with C++ linkage have CRC
encodings added to the name, are uppercased and shorten to
31 characters if necessary. Since the CRC is computed before
the name is uppercased, the symbol name is case-sensitive
even though the final name is in uppercase. /NAMES=AS_IS and
/NAMES=UPPER are not applicable to these symbols.

All symbols without C++ linkage will have CRC encodings
added if they are longer then 31 characters and
/NAMES=SHORTEN is specified. Global variables with C++
linkage are treated as if they have non-C++ linkage for
compatibility with C and older compilers.
>>>

So for symbols with C++ linkage, the checksum is computed before case
leveling. Of course what I'm interested in are the symbols *without* C++
linkage. I guess they could do things in a different order, but it seems
more likely that the difference comes in whatever step or steps follow
the computation of the checksum.

Jose Baars

unread,
Oct 16, 2010, 1:22:02 PM10/16/10
to
On 16 okt, 16:23, "Craig A. Berry" <craigbe...@nospam.mac.com> wrote:

> I guess they could do things in a different order, but it seems
> more likely that the difference comes in whatever step or steps follow
> the computation of the checksum.

Yes, like I said, the point is moot, as the shortening algorithm is
not
known. For C symbols the shortened AS_IS names are different
from the upper cased symbols. Why and how that is achieved would
only be interesting if the shortening algorithm was documented,
supported
and thus reliably reproducible.

Craig A. Berry

unread,
Oct 18, 2010, 11:37:15 PM10/18/10
to

Jose Baars wrote:
> On 16 okt, 16:23, "Craig A. Berry" <craigbe...@nospam.mac.com> wrote:
>
>> I guess they could do things in a different order, but it seems
>> more likely that the difference comes in whatever step or steps follow
>> the computation of the checksum.
>
> Yes, like I said, the point is moot, as the shortening algorithm is
> not known.

Oh it's definitely known, or at least knowable; it's just that no one
possessing that knowledge or capable of finding it has (so far) been
willing to share it.

> For C symbols the shortened AS_IS names are different
> from the upper cased symbols. Why and how that is achieved would
> only be interesting if the shortening algorithm was documented,
> supported and thus reliably reproducible.

Well, it's more documented than any of the alternatives. The internal
structure of the name mangler database (simple though it is) could
change. The various methods for rooting through object code are
architecture-specific and vulnerable to a variety of changes. Parsing
the output of informational messages is similarly fraught with peril.
And even if these methods were more reliable, they wouldn't accommodate
the case where I know a particular set of automatically collected
symbols that I want to export (and only those symbols) but I just need
to know what names the compiler and linker will know them by.

hb

unread,
Oct 19, 2010, 10:14:54 AM10/19/10
to
On Oct 15, 5:58 pm, "Craig A. Berry" <craigbe...@nospam.mac.com>
wrote:

> $ cc/names=short/WARN=ENABLE=NAMESHORTENED sys$input
> int Please_forgive_this_absurdly_long_symbol_name;
> ^Z
> int Please_forgive_this_absurdly_long_symbol_name;
> ....^
> %CC-I-NAMESHORTENED, The external identifier or module name
> "PLEASE_FORGIVE_THIS_ABSURDLY_LONG_SYMBOL_NAME" exceeds 31 characters.
> The name has been shortened to "PLEASE_FORGIVE_THIS_ABS1ARO4QU$".
> at line number 1 in file SYS$INPUT:.;

The compiler message shows that the uppercased name is too long and
shortened, so do the CRC on the uppercase name. Use the one Ed showed,
it differs from Hoff's: there is no final XOR. Print the result with
base 32. That seems to do it.

Craig A. Berry

unread,
Oct 22, 2010, 12:08:29 AM10/22/10
to

Ah, good catch. I suppose there is richer data (i.e. not all zeros) in
the higher bits without the inversion.

> Print the result with base 32. That seems to do it.

I would never, ever have guessed that. Thanks for pointing it out. And
there are several different base32 encodings, but I found the right one
eventually.

With those hints I was at last able to reproduce what the compiler does.
My program is attached and includes routines for name shortening if you
don't want to use it as a standalone program. It probably needs more
testing, error handling, and general robustification, but it seems to be
working nicely and produces the same output as what goes in the name
mangler database for ease of comparison. Released under the MIT license.

$ mcr []vms_shorten_symbol "Please_forgive_this_absurdly_long_symbol_name"
PLEASE_FORGIVE_THIS_ABS1ARO4QU$Please_forgive_this_absurdly_long_symbol_name
$ mcr []vms_shorten_symbol "Please_forgive_this_absurdly_long_symbol_name" 1
Please_forgive_this_abs3rv8rnn$Please_forgive_this_absurdly_long_symbol_name


vms_shorten_symbol.c

Jose Baars

unread,
Oct 22, 2010, 11:32:48 AM10/22/10
to
On 22 okt, 06:08, "Craig A. Berry" <craigbe...@nospam.mac.com> wrote:

Great! Did some quick tests and it really works!
I missed out on the exact way you came to this, but your tenacity is
much appreciated.

Jose Baars

unread,
Oct 22, 2010, 12:43:36 PM10/22/10
to
On 22 okt, 17:32, Jose Baars <peutba...@googlemail.com> wrote:

Just out of interest, on VMS, the crc32 function can be rewritten like
this
( based of Hoff's example earlier, and the fact the AUTODIN table is
already
available. );

UINT32 crc32( const char *inputdata )
{
UINT32 crc32;
UINT32 seed = ~0UL;
UINT32 Coefficient = 0x0EDB88320UL; /* AUTODIN II */
UINT32 CRCArray[16];
struct dsc$descriptor inputdatad;


inputdatad.dsc$w_length = strlen( inputdata );
inputdatad.dsc$b_dtype = DSC$K_DTYPE_T;
inputdatad.dsc$b_class = DSC$K_CLASS_S;
inputdatad.dsc$a_pointer = (char *)inputdata;

lib$crc_table( (void *) &Coefficient, (void *) CRCArray );

crc32 = lib$crc( (void *) CRCArray, (void *) &seed, &inputdatad );

return ~crc32;
}

Craig A. Berry

unread,
Oct 22, 2010, 1:42:09 PM10/22/10
to

Jose Baars wrote:
> On 22 okt, 17:32, Jose Baars <peutba...@googlemail.com> wrote:
>
> Just out of interest, on VMS, the crc32 function can be rewritten like
> this ( based of Hoff's example earlier, and the fact the AUTODIN table is
> already available. );


Yep, that certainly works and I'd already confirmed I got the same
checksum using that method and the one I posted. And there are
variations. You can store the precomputed table, or you can retrieve it
from lib$crc_table. Regardless of how you got the table, you can use it
with lib$crc, or with other implementations.

I ended up using a larger table (256 elements instead of the 16 supplied
by lib$crc_table), which allows a faster, simpler calculation to be
used. Speed probably doesn't matter for name shortening like it would
for network packets or other uses, but portability might. By not using
any VMS-specific routines, my code can run on just about anything (and I
tested it on Mac OS X). So a cross-platform package that needs to
include a linker options file in the distribution can do so correctly
regardless of what platform the maintenance work is being done on.

Thanks once again to "hb" (Becker Ismaning?) who provided the essential
clues to get this working.

0 new messages