Decompressing content-encoding="asc-gzip"

247 views
Skip to first unread message

Bryan W. Taylor

unread,
Apr 17, 2003, 6:24:57 PM4/17/03
to
I have a file made by a vendor's tool that looks like it gives a MIME
type:
application/x-xfdl;content-encoding="asc-gzip"

It would seem to be asome kind of gzip compressed XML dialect called
XFDL. I want to extract the underlying XML. How do I actually
uncompress this without using vendor tools? I have tried to use gzip,
but it doesn't seem to believe that the file is actually a gzip file.
It seems that "asc-gzip" is not the same as "gzip", because the file
appears to be in a base-64 ascii encoding, because it displays as text
just fine (no weird characters).

The first three lines of the file are:
--------------start---------------
application/x-xfdl;content-encoding="asc-gzip"
EBHqYHic7V1tc+K2Fv68d+b+B08+tdMSMO+0lDsOmCxNAgmQvc3OzuwYcIizDuZik2T7669sgyRb
L5apA8lGaU82lnT86BwdyZL8IJr/eX6wlUdz5VrO4o8j9bhw9J/Wv//V/KvbOUfJlWM/A6R/aN46
---------------end----------------

The rest of the file is more lines like the last two.

Anybody have any ideas?

liang

unread,
Apr 17, 2003, 9:41:22 PM4/17/03
to
I guess it can be decoded in this way: first, base-64 decoding; then gzip
decoding.

"Bryan W. Taylor" <bryan_w...@yahoo.com> wrote in message
news:11d78c87.0304...@posting.google.com...

Bryan W. Taylor

unread,
Apr 18, 2003, 10:37:47 AM4/18/03
to
What utilies are there for dealing with base-64?

I know about the perl module MIME::Base64, but it seems rather low
level. I've been trying to convert the individual lines and to
construct a presumably gzip file out of it, but I haven't been able to
get it to work. It seems like I should strip off everything but the
base-64 characters, including the header and the whitespace and
newlines and convert it.

Unfortunately I haven't been able to make this work. I'm wondering if
the gzip format has header bytes that are stripped off before the
base-64'ing.

By the way, there are lots of these compressed XFDL files out there,
because PureEdge sells products that create them. The US Air Force
uses these. For example, the AF70 is a form for Flight Plans that is
available at http://www.e-publishing.af.mil/formfiles/af/af70/af70.xfd

The challenge is to convert the above file to readable XML.

"liang" <leo19...@hotmail.com> wrote

Kyle Jones

unread,
Apr 18, 2003, 1:13:47 PM4/18/03
to
Bryan W. Taylor <bryan_w...@yahoo.com> wrote:
> What utilies are there for dealing with base-64?

Here's a C program. It takes base64 on stdin and produces the
decoded bytes on stdout.

/* public domain */

/* BASE64 on stdin -> converted data on stdout */

#include <stdio.h>

#ifdef _WIN32
#ifndef WIN32
#define WIN32
#endif
#endif

#ifdef WIN32
#include <io.h>
#include <fcntl.h>
#endif

unsigned char alphabet[64] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

int
main()
{
static char inalphabet[256], decoder[256];
int i, bits, c, char_count, errors = 0;

#ifdef WIN32
_setmode( _fileno(stdout), _O_BINARY);
#endif

for (i = (sizeof alphabet) - 1; i >= 0 ; i--) {
inalphabet[alphabet[i]] = 1;
decoder[alphabet[i]] = i;
}

char_count = 0;
bits = 0;
while ((c = getchar()) != EOF) {
if (c == '=')
break;
if (c > 255 || ! inalphabet[c])
continue;
bits += decoder[c];
char_count++;
if (char_count == 4) {
putchar((bits >> 16));
putchar(((bits >> 8) & 0xff));
putchar((bits & 0xff));
bits = 0;
char_count = 0;
} else {
bits <<= 6;
}
}
if (c == EOF) {
if (char_count) {
fprintf(stderr, "base64 encoding incomplete: at least %d bits truncated",
((4 - char_count) * 6));
errors++;
}
} else { /* c == '=' */
switch (char_count) {
case 1:
fprintf(stderr, "base64 encoding incomplete: at least 2 bits missing");
errors++;
break;
case 2:
putchar((bits >> 10));
break;
case 3:
putchar((bits >> 16));
putchar(((bits >> 8) & 0xff));
break;
}
}
exit(errors ? 1 : 0);
}

Mike Marshall

unread,
Apr 18, 2003, 2:13:36 PM4/18/03
to

>Bryan W. Taylor <bryan_w...@yahoo.com> wrote:
> > What utilies are there for dealing with base-64?

kyle_...@wonderworks.com (Kyle Jones) writes:
>Here's a C program. It takes base64 on stdin and produces the
>decoded bytes on stdout.

Below is one I wrote to help me figure out base64. Do a google search on
"mpack" to find a production ready tool...

/* decode(char *quad, char *out)*/
/* notes at end of file */
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
main(){
union b64 {
unsigned int buffer;
char result[4];
} b;

static char b64chars[64] = {
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P',
'Q','R','S','T','U','V','W','X','Y','Z','a','b','c','d','e','f',
'g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v',
'w','x','y','z','0','1','2','3','4','5','6','7','8','9','+','/'
};

int fd,i;
unsigned int index;
char buf[1];

if ((fd = open("base64file",O_RDONLY,0)) == -1) {
printf("can't open input.\n");
exit(0);
}

b.buffer=0;
i=0;
while (read(fd,buf,1) > 0){
if (buf[0] != '\n') {
for (index=0;index<64;index++) {
if (buf[0] == b64chars[index]) break;
}

if (buf[0] == '=') {
while(i<3) {
b.buffer = b.buffer<<6;
i++;
}
printf("%c%c%c",b.result[1],b.result[2],b.result[3]);
break;
}

if (index > 63) {
printf("index out of bounds\n");
exit(0);
}

b.buffer+=index;
if (i < 3) b.buffer = b.buffer<<6;
i++;

if (i>3) {
printf("%c%c%c",b.result[1],b.result[2],b.result[3]);
b.buffer=0;
i=0;
}
}
}
puts("");
}
/*
base64 is a mapping of arbitrary bites into a representation
that can safely be pumped through all the myriad of gateways,
EBCIDIC to ASCII converters and whatever else a mail message
might encounter on the Internet on its way from point a to
point b.

A base64 encoder outputs 4 bytes for each 3 bytes it read in.

"The encoding process represents 24-bit groups of input bits as
output strings of 4 encoded characters. Proceeding from left to
right across a 24-bit input group, each 6-bit group is used as an
index into an array of 64 safe characters."

*/

Bryan W. Taylor

unread,
Apr 18, 2003, 5:35:31 PM4/18/03
to
kyle_...@wonderworks.com (Kyle Jones) wrote:
> Bryan W. Taylor <bryan_w...@yahoo.com> wrote:
> > What utilies are there for dealing with base-64?
>
> Here's a C program. It takes base64 on stdin and produces the
> decoded bytes on stdout.

Well, still no joy, although I'm convinced the problem is not the
base64 conversion. Here's what I've tried:

1) Get the file http://www.e-publishing.af.mil/formfiles/af/af70/af70.xfd
2) Remove the header line and put in a file called base64file
3A) Convert base64file to output.xfd.gz using the C utility posted
3B) Convert base64file to output2.xfd.gz using perl code based on
MIME::base64
4) Check md5sum: 4cd2134b23ae1e41fee6403ececdbd03 for both output
files [this tells me that I'm likely getting the base64 conversion
correct]
5) gunzip the resulting files to get
gunzip: out.xfd.gz: not in gzip format

So I question whether this is really just a simple base64'd gzip file.
I'm not sure how to make progress at this point. Some possiblities
are:
1) They are applying some obfuscating transformation after gziping
2) They are stripping gzip header information before gziping
3) It's not really base64, it just appears that way

I guess I'm stuck unless anybody else has some ideas. Thanks already
for the help

Kjetil Torgrim Homme

unread,
Apr 18, 2003, 11:49:45 PM4/18/03
to
[Bryan W. Taylor]:

>
> So I question whether this is really just a simple base64'd gzip file.
> I'm not sure how to make progress at this point. Some possiblities
> are:
> 1) They are applying some obfuscating transformation after gziping
> 2) They are stripping gzip header information before gziping
> 3) It's not really base64, it just appears that way
>
> I guess I'm stuck unless anybody else has some ideas. Thanks already
> for the help

hi, I tacked four bytes of GZIP magic at the top, and gzip gave me:

<?xml version="1.0"?>
<XFDL version="5.1.0">
<vfd_date>14/11/2002</vfd_date>
<formid content="array">

I got a CRC error, eventually. I guess you'll get better results if
you write your own small program using zlib.

--
Kjetil T. | read and make up your own mind
| http://www.cactus48.com/truth.html

Bryan W. Taylor

unread,
Apr 19, 2003, 5:46:18 PM4/19/03
to
Kjetil Torgrim Homme <kjet...@haey.ifi.uio.no> wrote
> [Bryan W. Taylor]:
> >
> > So I question whether this is really just a simple base64'd gzip file.
> > I'm not sure how to make progress at this point. Some possiblities
> > are:
> > 1) They are applying some obfuscating transformation after gziping
> > 2) They are stripping gzip header information before gziping
> > 3) It's not really base64, it just appears that way
> >
> > I guess I'm stuck unless anybody else has some ideas. Thanks already
> > for the help
>
> hi, I tacked four bytes of GZIP magic at the top, and gzip gave me:
>
>
> <?xml version="1.0"?>
> <XFDL version="5.1.0">
> <vfd_date>14/11/2002</vfd_date>
> <formid content="array">
>
> I got a CRC error, eventually. I guess you'll get better results if
> you write your own small program using zlib.

That's it! I guess you've demonstrated that it's case #2.

Just to confirm, I think you are saying you did the following:
1) remove the first line
2) decode the rest as base64
3) Add four bytes to the front of the raw binary output
4) decompress with gzip

I suppose you omitted saying what those four bytes are so that I'd
have to go read the gzip RFC. Google takes me to RFC1952 at
http://www.ietf.org/rfc/rfc1952.txt

It says the file format starts:
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
where ID1 = \x1f
ID2 = \x8b
CM = is a flag that is usually set to \x08
and FLG = is a bitmapped flag:
bit 0 FTEXT
bit 1 FHCRC
bit 2 FEXTRA
bit 3 FNAME
bit 4 FCOMMENT

Care to say what value of FLG you tried?

Kjetil Torgrim Homme

unread,
Apr 19, 2003, 9:41:47 PM4/19/03
to
[Bryan W. Taylor]:

>
> Care to say what value of FLG you tried?

I just copied the first four bytes of a gzip file I had lying around.
that may be why I got a CRC after a while :-)

Bryan W. Taylor

unread,
Apr 21, 2003, 1:32:12 AM4/21/03
to
Kjetil Torgrim Homme <kjet...@haey.ifi.uio.no> wrote :

> I just copied the first four bytes of a gzip file I had lying around.
> that may be why I got a CRC after a while :-)

No, it's gzipped in pieces. The perl code below will decompress a file
whose type is listed as application/x-xfdl;content-encoding="asc-gzip"

After a base64 unwind, the result is sequence of gzipped piecec, with
non-standard headers and trailers, as compared to RFC 1952. The actual
data encoding appears to be standard gzip per RFC 1951, which means
that standard zlib based tools will work on the individual pieces. I
use PerlIO::gzip below, which does the real work.

The first two bytes of each piece give the length of the compressed
piece. The next two bytes give the length of the uncompressed piece
(which is generally 60000 = 0xEA60 for all but the last piece). Then
there are four bytes that I haven't quite figured out (usually 0x789C
for the first two bytes and often 0xED5D or 0xED9D for the second
piece). Next comes the compressed data block. After each data block is
a 4 byte trailer. I don't understand this trailer either, but it
doesn't matter because the code below seems to work.

-------------------------------------------------------
#! /usr/bin/perl
# This program is licenced under the GNU Public licence (GPL)
# and/or the Perl Artistic Licence

use MIME::Base64;
use PerlIO::gzip;

# Process command line arguments
#Set to -D0 for quiet. Set to -D2 or -D3 for more information
print "Usage: gunzip_xfd.pl [-h] [-d(0|1|2|3)] infile [outfile] \n"
and exit
if (uc(substr($ARGV[0],0,2)) eq "-H") or not $ARGV[0];
$debug = (uc(substr($ARGV[0],0,2)) eq "-D") ? substr(shift,2) : 1;
$infile = shift or die "Give filename to decompress\n";
$outfile = shift or $outfile = "output.xfd";
unlink $outfile; #destroy previous copy

open(INFILE, "< $infile") or die "can't open $infile: $!";
print "Decompressing file: $infile to $outfile\n";

# Drop first line: application/x-xfdl;content-encoding="asc-gzip"
my $firstline = <INFILE>;
$base64 .= $_ while (<INFILE>);
my $rawbytes = decode_base64($base64);

print "decoded ". length($rawbytes). " bytes from base-64 encoding\n"
if $debug;
$start = 0; # this is the starting byte of the current piece
$start = process_piece($start, $rawbytes) until $start >=
length($rawbytes) ;
print "gunzipped $start bytes in $cnt pieces\n" if $debug;

close(INFILE);


sub process_piece {

my $start = shift;
my $rawbytes = shift;

my $header1 = "\x1f\x8b\x08\x00";
my $piecelengthhex = substr($rawbytes, $start, 2);
my $header2 = substr($rawbytes, $start+2, 6);
my $header = $header1 . $piecelengthhex . $header2;

my $datastart = $start + length($piecelengthhex . $header2);
my $end = $start + unpack("n", $piecelengthhex);

my $piecebytes = substr($rawbytes, $datastart, $end - $datastart);

print $cnt . " gunzip_piece at=$start hex=" .
unpack("H4",pack("n",$start))
if $debug >= 2; $cnt++;
gunzip_piece($outfile, $header, $piecebytes);
return $end + 4; # skip four byte trailer
}

sub gunzip_piece {
my $outfile = shift;
my $gzfile = "$outfile.gz";
my $header = shift;
my $piecebytes = shift;

print " Header: " . unpack("H8", substr($header,0,4)) .
"-" . unpack("H8", substr($header,4,4)) .
"-" . unpack("H8", substr($header,8,4)) . "\n" if
$debug >= 3;
print "Piece startbytes: " .
unpack("H8", substr($piecebytes,0,4)) if $debug >= 3;
print " Piece endbytes: " .
unpack("H8", substr($piecebytes,-4)) if $debug >= 3;

open(GZFILE, "> $gzfile");
print GZFILE $header . $piecebytes;
# Piecelength doesn't count the 4 magic gzip bytes
$piecelength = length($header . $piecebytes) - 4;
print " Decompressed $piecelength bytes\n" if $debug >= 2;
close(GZFILE);

# Now un-gzip this piece to the output file
open(GZFILE, "<:gzip(autopop)", $gzfile) or die $!;
open(OUTFILE, ">> $outfile");

$piecestr = "";
$piecestr .= $line while ($line = <GZFILE>);
print OUTFILE $piecestr;

close(GZFILE);
unlink $gzfile;
close(OUTFILE);
}

Bharat K

unread,
Dec 28, 2004, 9:21:24 AM12/28/04
to
I have a doc that is abt 70 KB large. The perl script above takes only
about 30% of the file. My pureedge document is about 3 pages long. And
i need to create a script that would take data from a DB and fill out
about 10000 applications and send in a batch.
So i need to from the pureedge xfdl get an readable XML format to go
ahead with my script.
Also i get the foll. error.

Decompressing file: Filled.xfd to output.xfd
decoded 56732 bytes from base-64 encoding
No such file or directory at gunzip_xfd.pl line 101, <INFILE> line 997.

The file FILLED.xfd is about 70 KB. when i open the output.xfd, it
contains only 14 out of the some 50 columns in the form..
could someone help me out here !!

Bharat K

unread,
Jan 4, 2005, 6:50:42 AM1/4/05
to
If someone can do it on their machines, then that would be OK too. I
can send the file over.
Reply all
Reply to author
Forward
0 new messages