My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)

robic0

unread,

Dec 21, 2005, 2:59:06 AM12/21/05

to

This post is in response to someone who asked for help trying to
parse xml into a data structure. The poster couldn't install
XML::Parser or XML::Simple. I replied a few times with some
partial code. Good to my word, here is the core of a cut & paste
non-Perl-module based, raw, robust data xml parser into Perl
data structures. Its about 140 lines of code. I imagine its
about 3 times faster than the XML parsers out there, didn't time
it. It doesen't use the overhead of SAX or nodes.

This installment is released prematurely without the fancy
XML::Simple options yet. This is a typical "force array"
version (see the sub's below). I wanted to wait until tommorow
to post this but, I already know how to do it but don't have the
time tonight, however this is fairly final, and so I release
it with the understanding that its shortcomings will be fixed
in a day or so.

I've spent 4 days on this. You have to read between the lines
to insert your xml file open or just cut and paste your xml
to $gabage1. I've left that part up to you. The output
and data are legitimate. It won't look like XML:Simple
in the default settings. I maintaine a root here and some other
things. However, I will post a mod tommorow. The output and
parsing is completely legitimate. The parsing is probably
much faster than the modules on CPAN.

Let me know if you have any suggestions for improvement.
I want to keep it under 200 lines for a complete cut & paste
solution. It doesen't use any parser out there. Its parser
is built in. I don't think this method is used anywhere
in the XML world, you may want to check for possible multiple
speed enhancement.

Posting changes tommorow on this.
Contact info:
email: robic0-AT-yahoo.com

========================================================
use strict;
use warnings;
use Data::Dumper;

open DATA "datafile" or die "can't open datafile...";
my $gabage1 = <DATA>;
close DATA;

my @xml_files = ($gabage1);

my $debug = 0;
my $rmv_white_space = 1;

## -- XML start & end regexp substitution delimeter chars --
## match side , substitution side
## -----------------------/-------------------------
my @S_dlim = ('\[' , '['); # use these for reading (debug)
my @E_dlim = ('\]' , ']');
#my @S_dlim = (chr(140) , chr(140)); # use these for production
#my @E_dlim = (chr(141) , chr(141));

for (@xml_files)
{
if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}
print "\n",'='x30,"\n$_\n\n" if ($debug);

my $ROOT = {}; # container
my ($last_cnt, $cnt, $i) = (-1, 1, 0);

# should only need 2 iterations max, but wth
while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <?XML-Version ?> , have to check the format of '<?'
while (s/<\?([^<>]*)\?>//i) {} # to void xml
versioning
# while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug); $cnt++}

## 
# while (s///i) {} # to void comments
while (s//$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt  = $1\n" if ($debug);
$ROOT->{$cnt} = { comment => $1 };
$cnt++;
}
# Comments, need to have "anything but 
here" (revisit)
# while
(s/)]*)-->/$S_dlim[1]$cnt$E_dlim[1]/i) { print "$cnt
 = $1\n" if ($debug); $cnt++}

## <Tag/> , no content
while
(s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug);
$ROOT->{$cnt} = { $1 => '' };
$cnt++;
}
## <Tag Attributes/> , no content
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2\n" if ($debug);
$ROOT->{$cnt} = { $1 => getAttrHash($2) };
$cnt++;
}
## <Tag> Content </Tag>
while
(s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {
my ($key); my $hcontent =
getContentHash($2, $ROOT);
if (keys (%{$hcontent}) > 1) {
$unknown = $hcontent;
}
else { ($key,$unknown) = each
(%{$hcontent}); }
}
$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}
## <Tag Attributes> Content </Tag>
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2, content: $3\n" if
($debug);
my $hattrib = getAttrHash($2);
my $hcontent = getContentHash($3, $ROOT);

while (my ($key,$val) = each (%{$hcontent})) {
$hattrib->{$key} = $val;
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}
$i++ if ($last_cnt != $cnt);
}
if (/<|>/) {
print "($i) XML problem, malformed, syntax or tag
closure:\n$_";
} else {
print "$i itterations\n\n";
#print Dumper($ROOT);
my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element}) {
my $tmp = {};
%{$tmp} = %{$ROOT->{$outer_element}};
print Dumper($tmp);
}
}
}
##
sub getAttrHash
{
my $attstr = shift;
my $ahref = {};
return $ahref unless (defined $attstr);
while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
]*//i) {
$ahref->{$1} = $2;
}
return $ahref;
}
##
sub getContentHash
{
my ($attstr,$hStore) = @_;
my $ahref = {};
return $ahref unless (defined $attstr && defined $hStore);
my @ary = ();
while ($attstr =~
s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
if (defined $1) {
push (@ary, $1);
}
elsif (defined $2 && exists $hStore->{$2}) {
my ($key,$val) = each (%{$hStore->{$2}});

# here, force array is in effect (aka: simple)
# (this will be modified in a day or so)
################
if (exists $ahref->{$key})
{
#print "getChash - $key\n";
push (@{$ahref->{$key}}, $val);

} else {
$ahref->{$key} = [$val];
# $ahref->{$key} = $val;
}
################
}
}
if (scalar(@ary) == 1) {
$ahref->{'content'} = $ary[0];
} elsif (scalar(@ary) > 1) {
$ahref->{'content'} = [@ary];
}
return $ahref;
}

__END__

$VAR1 = {
'document' => {
'WMSNameSpaceVersion' => '2.0',
'comment' => [
' Control Protocol ',
' Data Protocol ',
' Feedback Protocol ',
' Network Source '
],
'node' => [
{
'opcode' => 'create',
'comment' => [
' Object Store
'
],
'name' => 'Control Protocol',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
RTSP ',
'
Sessionless Multicast '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTSP',

'node' => [

{

'opcode' => 'create',

'value' => '{308786f0-8b15-11d2-b25f-006097d2e41e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'RTSP,RTSPA,RTSPT,RTSPU,RTSPM',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'Sessionless Multicast',

'node' => [

{

'opcode' => 'create',

'value' => '{f9377800-f38d-11d2-b26c-006097d2e41e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'MCAST,RTP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties'
}
]
},
{
'opcode' => 'create',
'comment' => [
' Object Store
'
],
'name' => 'Data Protocol',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
RTP ',
'
RTP/ASF ',
'
RTP/AVP ',
'
RTP/FEC ',
'
RTP/WMS-FEC '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP',

'node' => [

{

'opcode' => 'create',

'value' => '{cbfb2e20-ab7b-11d2-b261-006097d2e41e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'x-asf-pf',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/ASF',

'node' => [

{

'opcode' => 'create',

'value' => '{149a44be-dc14-4e94-9cb0-c0268e77df9e}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'x-asfv2-pf,x-asfv2-grp-pf,x-asfv2-frag-pf',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/AVP',

'node' => [

{

'opcode' => 'create',

'value' => '{d7335e2e-62eb-4ad0-96cd-b31c9d0f9f85}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'PCMU,L8,L16,MPA,G726-24,G726-40',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/FEC',

'node' => [

{

'opcode' => 'create',

'value' => '{02DEFE42-F8FC-11d2-8670-00C04F6890ED}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'parityfec',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTP/WMS-FEC',

'node' => [

{

'opcode' => 'create',

'value' => '{EDAB8E6B-746C-40db-A885-9E4A9EEF27A2}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'wms-fec',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties'
}
]
},
{
'opcode' => 'create',
'comment' => [
' Object Store
'
],
'name' => 'Feedback Protocol',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
RTCP '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'RTCP',

'node' => [

{

'opcode' => 'create',

'value' => '{ecfddc81-184e-11d3-ae84-00a0c95ec3f0}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'x-wms-rtx',

'name' => 'Format',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => 'RTP/AVP',

'name' => 'Protocol',

'type' => 'string'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties'
}
]
},
{
'opcode' => 'create',
'comment' => [
' Object Store
',
' Shared
Properties '
],
'name' => 'Network Source',
'node' => [
{
'opcode' =>
'create',
'comment' => [
'
WMS Http Network Source ',
'
WMS Mms Network Source ',
'
WMS Msbd Network Source ',
'
WMS Network Source '
],
'name' =>
'Object Store',
'node' => [
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Http Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{566A2EFF-5651-4020-AC1A-EB48E4571EA3}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'HTTP',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x50',

'name' => 'DefaultHttpServerPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1bb',

'name' => 'DefaultHttpServerSSLPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x8',

'name' => 'PacketBuffers',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'EnableHTTP1_1',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1e',

'name' => 'OpenTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x64',

'name' => 'SecondSegmentTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ControlAdapter',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x55',

'name' => 'PercentBWUsageForAccelStreaming',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3',

'name' => 'Proxy Setting',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ProxyHostName',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x50',

'name' => 'ProxyPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'ProxyBypassForLocal',

'type' => 'int32'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Mms Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{DCF6C8B2-F6C0-461b-82DA-35945EADF54A}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'MMS,MMST,MMSU',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x6db',

'name' => 'DefaultServerPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x4',

'name' => 'MaxReadHeaderRetries',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x8',

'name' => 'PacketBuffers',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'FirstDropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropBurstDuration',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'PacketPairDropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x2',

'name' => 'NackAlgorithm',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRateMultiplier',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x5dc',

'name' => 'NackBurst',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3e8',

'name' => 'NackTraceInterval',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRetry',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'IgnoreServerVersion',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'EnableMmsDistribution',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'AssertStrangeErrors',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x5a',

'name' => 'InactivityTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x20',

'name' => 'OpenTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x55',

'name' => 'PercentBWUsageForAccelStreaming',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'FunnelAdapter',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ControlAdapter',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'Proxy Setting',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ProxyHostName',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x6db',

'name' => 'ProxyPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'ProxyBypassForLocal',

'type' => 'int32'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Msbd Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{FB74F625-7D25-4455-B840-7B870B5B9322}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'ASFM',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x8',

'name' => 'PacketBuffers',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'FirstDropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropBurstDuration',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3a98',

'name' => 'McastTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'EnableIGMPv3',

'type' => 'int32'

}

]

}

]
},
{

'opcode' => 'create',

'comment' => [

' Properties '

],

'name' => 'WMS Network Source',

'node' => [

{

'opcode' => 'create',

'value' => '{ad763fa6-3b90-41ab-bd44-4f832beee55f}',

'name' => 'CLSID',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'Enabled',

'type' => 'int32'

},

{

'opcode' => 'create',

'name' => 'Properties',

'node' => [

{

'opcode' => 'create',

'value' => 'RTSP,XSDP,RTP,RTSPA,RTSPT,RTSPU,RTSPM',

'name' => 'Source Type',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'EnableATM',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'MaximumMTU',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x14',

'name' => 'FirewallTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1e',

'name' => 'OpenTimeout',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'RtxDropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'FirstDropGracePeriod',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'DropBurstDuration',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'PacketPairDropProb',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x2',

'name' => 'NackAlgorithm',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRateMultiplier',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x5dc',

'name' => 'NackBurst',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3e8',

'name' => 'NackTraceInterval',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x1',

'name' => 'NackRetry',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'BurstProtection',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'EmulateNetworkDisconnect',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'AssertStrangeErrors',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x55',

'name' => 'PercentBWUsageForAccelStreaming',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'Proxy Setting',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '',

'name' => 'ProxyHostName',

'type' => 'string'

},

{

'opcode' => 'create',

'value' => '0x22a',

'name' => 'ProxyPort',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x0',

'name' => 'ProxyBypassForLocal',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x3e8',

'name' => 'PktGracePeriodAtEOSForBPP',

'type' => 'int32'

},

{

'opcode' => 'create',

'value' => '0x9c4',

'name' => 'PktGracePeriodAtEOSForODP',

'type' => 'int32'

}

]

}

]
}
]
},
{
'opcode' =>
'create',
'name' =>
'Shared Properties',
'node' => [
{

'opcode' => 'create',

'name' => 'Local'
}
]
}
]
}
]
}
};

__DATA__

</node>

</node>

</node>

</node>

</node>

</node>

</node>

</node>

</node>

</node>

</node>

</node>

Tad McClellan

unread,

Dec 21, 2005, 9:10:56 AM12/21/05

to

robic0 <> wrote:

> ## -- XML start & end regexp substitution delimeter chars --

delimeter: noun, scale used to weigh and price cold cuts.
also the unit of length for salamis. -- Uri Guttman

(Message-ID: <x74quoa...@mail.sysarch.com>)

--
Tad McClellan SGML consulting
ta...@augustmail.com Perl programming
Fort Worth, Texas

mirod

unread,

Dec 21, 2005, 4:01:21 PM12/21/05

to

robic0 wrote:
> This post is in response to someone who asked for help trying to
> parse xml into a data structure. The poster couldn't install
> XML::Parser or XML::Simple. I replied a few times with some
> partial code. Good to my word, here is the core of a cut & paste
> non-Perl-module based, raw, robust data xml parser into Perl
> data structures. Its about 140 lines of code. I imagine its
> about 3 times faster than the XML parsers out there, didn't time
> it. It doesen't use the overhead of SAX or nodes.

This does not seem to be an XML parser. For example a (very!) cursory
glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
(tag or attribute name), where the XML spec shows it is a tad more
complex (see http://www.xml.com/axml/target.html#NT-Name).

Writing a complete XML parser is fairly hard, indeed a lot harder than
writing a quasi-XML parser, like what you wrote.

You could have refered the OP to SOAP::Lite
(http://search.cpan.org/dist/SOAP-Lite/), which includes a pure-perl
XML::Parser replacement (with some explicit limitations).

As it is I think your code is a bit dangerous, as it risks being
re-used by people who will not understand its limitations

--
mirod

Matt Garrish

unread,

Dec 21, 2005, 4:11:23 PM12/21/05

to

"mirod" <mi...@xmltwig.com> wrote in message
news:1135198880.9...@z14g2000cwz.googlegroups.com...

> robic0 wrote:
>> This post is in response to someone who asked for help trying to
>> parse xml into a data structure. The poster couldn't install
>> XML::Parser or XML::Simple. I replied a few times with some
>> partial code. Good to my word, here is the core of a cut & paste
>> non-Perl-module based, raw, robust data xml parser into Perl
>> data structures. Its about 140 lines of code. I imagine its
>> about 3 times faster than the XML parsers out there, didn't time
>> it. It doesen't use the overhead of SAX or nodes.
>
> This does not seem to be an XML parser. For example a (very!) cursory
> glance seems to indicate that it considers [0-9a-zA-Z]+ to be a NAME
> (tag or attribute name), where the XML spec shows it is a tad more
> complex (see http://www.xml.com/axml/target.html#NT-Name).
>
> Writing a complete XML parser is fairly hard, indeed a lot harder than
> writing a quasi-XML parser, like what you wrote.
>

It's always good to point out garbage when one sees it, but it's well known
(proven through numerous posts) that rob knows nothing about xml or markup
languages in general. He's probably just looking for an excuse to swear and
call himself a code god (or whatever he's into these days), so don't be
surprised if that's what you get (i.e., don't bother responding).

Matt

robic0

unread,

Dec 21, 2005, 8:09:19 PM12/21/05

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

>Posting changes tommorow on this.
>Contact info:
>email: robic0-AT-yahoo.com
>

Alot of bug fixes and modifications.
The first version had many problems.
This is clean version (.9) with options:
ForceArray
Keeproot.
Keepcomments

This works exceptionally well... Let me know
if you try it.
I'm so burned out on this there probably won't
be any updates for along time unless otherwise
if'n I change my mind.

See ya

print <<EOM;
# XML Regex Parser
# Version .9
# 12/21/05
# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------
EOM

use strict;
use warnings;
use Data::Dumper;

#open DATA, "datafile" or die "can't open datafile...";
#my $gabage1 = <DATA>;
#close DATA;

my $gabage2 = '

<XMLDATA>
<Submission SubmissionID="688904">
<Category CategoryName="Storage/Adapter or Controller">
<Driver FolderName="driver000">
<Language LanguageName="English">
<PackageCreationLocation
FolderName="G:\truyen\WHQL\Athena\raid\driver" />
</Language>
</Driver>
</Category>
</Submission>
</XMLDATA>
';

my $gabage3 = '

<big name="asdf" date="33" >
asdf
<in1>

<in2>jjjj</in2>

<in3>asbefas</in3>
</in1>
asdfb
</big>

';

my @xml_strings = ($gabage2, $gabage3);

my $VERSION = .9;

my $debug = 0;
my $rmv_white_space = 1;

my $ForceArray = 0;
my $KeepRoot = 0;
my $KeepComments = 1;

## -- XML, start & end regexp substitution delimiter chars --

## match side , substitution side
## -----------------------/-------------------------
my @S_dlim = ('\[' , '['); # use these for reading (debug)
my @E_dlim = ('\]' , ']');
#my @S_dlim = (chr(140) , chr(140)); # use these for production
#my @E_dlim = (chr(141) , chr(141));

for (@xml_strings)
{

print "\n",'='x30,"\n$_\n\n";

if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}

my $ROOT = {}; # container
my ($last_cnt, $cnt, $i) = (-1, 1, 0);

# should only need 2 iterations max, but wth
while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <?XML-Version ?> , have to check the format of '<?'
while (s/<\?([^<>]*)\?>//i) {} # to void xml
versioning
# while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug); $cnt++}

## 
if (!$KeepComments) {

while (s///i) {} # to void
comments

} else {

my $hcontent = getContentHash($2,
$ROOT);
$unknown = $hcontent;

if (keys (%{$hcontent}) > 1) {

if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
} elsif (exists $hcontent->{'content'}
&& scalar(@{$hcontent->{'content'}}) == 1) {

if ($ForceArray ) {
$unknown =
$hcontent->{'content'};
} else {
$unknown =
${$hcontent->{'content'}}[0];

}
}
}
$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}
## <Tag Attributes> Content </Tag>
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2, content: $3\n" if
($debug);
my $hattrib = getAttrHash($2);

if (length($3) > 0) {

my $hcontent = getContentHash($3,
$ROOT);

if (keys (%{$hcontent}) > 1) {

if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }

}
while (my ($key,$val) = each
(%{$hcontent})) {
$hattrib->{$key} = $val;
}
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}

if ($last_cnt != $cnt) {
$i++ ; print "** End pass $i\n" if ($debug);
}
}
if (/<|>/) {
print "($i) XML problem: malformed, syntax or tag
closure:\n$_";
} else {
print "\n** Itterations = $i\n** ForceArray =
$ForceArray\n** KeepRoot = $KeepRoot\n** KeepComments =
$KeepComments\n\n";

#print Dumper($ROOT);
my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element}) {

my $htodump = $ROOT->{$outer_element};
if (!$KeepRoot && keys (%{$htodump}) == 1) {
my ($key,$val) = each (%{$htodump});
$htodump = $val;
}
my $tmp = {};
%{$tmp} = %{$htodump};
print Dumper($tmp);
} else {print "nothing to output!\n";}
}
}
##
sub adjustForSingleItemArrays
{
my $href = shift;
## if $val is an array ref and has one element
## set $href->{$key} equal to the element
while (my ($key,$val) = each (%{$href})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1) {
$href->{$key} = $val->[0];

}
}
}
}
##
sub getAttrHash
{
my $attstr = shift;
my $ahref = {};
return $ahref unless (defined $attstr);
while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
]*//i) {
$ahref->{$1} = $2;
}
return $ahref;
}
##
sub getContentHash
{
my ($attstr,$hStore) = @_;
my $ahref = {};
return $ahref unless (defined $attstr && defined $hStore);
my @ary = ();
while ($attstr =~
s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
if (defined $1) {
push (@ary, $1);
}
elsif (defined $2 && exists $hStore->{$2}) {
my ($key,$val) = each (%{$hStore->{$2}});

if (exists $ahref->{$key}) {

push (@{$ahref->{$key}}, $val);
} else {
$ahref->{$key} = [$val];
}
}
}

if (scalar(@ary) > 0) { $ahref->{'content'} = [@ary]; }
## if $val is an array ref and has one element and it
## is a hash ref, set {$key} equal to hash ref
if (!$ForceArray) {
while (my ($key,$val) = each (%{$ahref})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1 &&
ref($val->[0]) eq "HASH") {
$ahref->{$key} = $val->[0];
}
}
}
}
return $ahref;
}

__END__

# XML Regex Parser
# Version .9
# 12/21/05
# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------

==============================

** Itterations = 2
** ForceArray = 0
** KeepRoot = 0
** KeepComments = 1

$VAR1 = {
'Submission' => {
'SubmissionID' => '688904',
'Category' => {
'Driver' => {
'Language'
=> {

'LanguageName' => 'English',

'PackageCreationLocation' => {

'FolderName' => 'G:\\truyen\\WHQL\\Athena\\raid\\driver'

}

},
'FolderName'
=> 'driver000'
},
'CategoryName' =>
'Storage/Adapter or Controller'
}
}
};

==============================

<big name="asdf" date="33" >
asdf
<in1>

<in2>jjjj</in2>

<in3>asbefas</in3>
</in1>
asdfb
</big>

** Itterations = 1
** ForceArray = 0
** KeepRoot = 0
** KeepComments = 1

$VAR1 = {
'date' => '33',
'name' => 'asdf',
'content' => [
'asdf',
'asdfb'
],
'in1' => {
'small' => {
'ueue' => 'second',
'biz' => 'wefwf'
},
'in2' => 'jjjj',
'comment' => ' howdy folks ',
'in3' => 'asbefas'
}
};

robic0

unread,

Dec 21, 2005, 8:57:25 PM12/21/05

to

Hey, I don't know how but you started a new "Re:" thread.
I just posted up on the original thread midly reworked code.
If you would like to try it out feel free.

This is indeed xml parser framework logic. There is nothing left now
but incidentals to bring it up to the XML spec like tag naming,
special character escape sequences ("&amp",...). Its not made the
same as XML::Parser or SAX. This is something entirely different.
The thrust was to parse the xml into a valid data structure.

The direction this could take is anybodys guess but I have alot
of imagination. I don't think writing a complete xml parser is
fairly hard. I wrote this framework in 4 days and I've used xml
parsers before. The parsing is done purely with regexp however
pulling out the data is real-time as the substitution progresses.
As the substitution moves forward, the xml string shrinks so the
subsequent regex searches get exponentially short resulting in
an extremely efficient and fast parse.

I welcome you to try it out. Perhaps do some time comparisons
with any other parser out there. I may do some more on it
in the next few days.

Post to the thread I'm posting the code to so I can get your
feedback. That is where I will post the next version.

And pay no attention to Matt Garish or Tad McClelan... my
underlings!

robic0
--------------------------------
"AMERICAN" and proud of it!

robic0

unread,

Dec 23, 2005, 1:42:08 AM12/23/05

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

>This post is in response to someone who asked for help trying to
>parse xml into a data structure.

This will fix the final issues with "ForceArray".
Comments have an issue with enclosed "<" or ">" in this
version, other than that they will process normally.
Its a regex issue (shortcoming in my opinion) that can't
match a "not" string. Where I need .
Where (.*)(?!<!--) won't work in an expression. But I'll
work around that.

This is version .901 from 12-22-05 is the one you want.
This is close to the last post as far as this newsgroup.
Sorry, but I had to get it stable. I've run this on every
big and wierd xml file I could get my hands on. I'm
satisfied with it.

See ya...

print <<EOM;

# XML Regex Parser
# Version .901 - 12/22/05

# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------
EOM

use strict;

use warnings;
use Data::Dumper;

#open DATA, "sumfile.xml" or die "can't open datafile...";
#my $gabage1 = join ('', <DATA>);
#close DATA;

my $gabage3 = '

<bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
</in1>
<in2>some in3 content</in2>
asdfb
</big>

';

my @xml_strings = ($gabage3);

my $VERSION = .901;
my $debug = 1;

my $rmv_white_space = 1;
my $ForceArray = 0;
my $KeepRoot = 0;

my $KeepComments = 0;

## -- XML, start & end regexp substitution delimiter chars --

## match side , substitution side

## ----------------------/-------------------------------
my @S_dlim = ('\[' , '['); # use these for debug

my @E_dlim = ('\]' , ']');
#my @S_dlim = (chr(140) , chr(140)); # use these for production
#my @E_dlim = (chr(141) , chr(141));

## -- Process xml data --
##
for (@xml_strings)
{

print "\n",'='x30,"\n$_\n\n";

if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}

my $ROOT = {}; # container
my ($last_cnt, $cnt, $i) = (-1, 1, 0);

# should only need 2 iterations max, but wth
while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <?XML-Version ?> , have to check the format of '<?'
while (s/<\?([^<>]*)\?>//i) {} # to void xml
versioning
# while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/i)
{ print "$cnt <$1> = \n" if ($debug); $cnt++}

## , nesting not processed,
## also comments can't have "<" or ">" this version.
if (!$KeepComments) {
while (s///s) {} # to void
comments
} else {
while
(s//$S_dlim[1]$cnt$E_dlim[1]/s) {
# while
(s//$S_dlim[1]$cnt$E_dlim[1]/s) {

print "$cnt  = $1\n" if
($debug);
$ROOT->{$cnt} = { comment => $1 };
$cnt++;
}
}

## <Tag/> , no content
while
(s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = \n" if ($debug);
$ROOT->{$cnt} = { $1 => '' };
$cnt++;
}
## <Tag Attributes/> , no content
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*\/>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2\n" if ($debug);
$ROOT->{$cnt} = { $1 => getAttrHash($2) };
$cnt++;
}
## <Tag> Content </Tag>
while
(s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {

my $hcontent = getContentHash($2,
$ROOT);
$unknown = $hcontent;

if (keys (%{$hcontent}) > 1) {

if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
} else {

if (exists
$hcontent->{'content'} && scalar(@{$hcontent->{'content'}}) == 1) {

if (!$ForceArray ) {

$unknown =
${$hcontent->{'content'}}[0];

} else {$unknown =
$hcontent->{'content'}; }
}
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }

}
}
$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}
## <Tag Attributes> Content </Tag>
while (s/<([0-9a-zA-Z]+)([ ]+[0-9a-zA-Z]+[ ]*=[
]*"[^<]*")+[ ]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/i) {
print "$cnt <$1> = attr: $2, content: $3\n" if
($debug);
my $hattrib = getAttrHash($2);

if (length($3) > 0) {

my $hcontent = getContentHash($3,
$ROOT);

if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }

while (my ($key,$val) = each
(%{$hcontent})) {
$hattrib->{$key} = $val;
}
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}

if ($last_cnt != $cnt) {
$i++ ; print "** End pass $i\n" if ($debug);
}
}

if (/<|>/) {
print "($i) XML problem: malformed, syntax or tag
closure:\n$_";
} else {

print "\n** Itterations = $i\n** ForceArray =
$ForceArray\n** KeepRoot = $KeepRoot\n** KeepComments =

$KeepComments\n\n";

#print Dumper($ROOT);
my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element}) {

my $htodump = $ROOT->{$outer_element};
if (!$KeepRoot && keys (%{$htodump}) == 1) {
my ($key,$val) = each (%{$htodump});
$htodump = $val;
}

my $tmp = {};

%{$tmp} = %{$htodump};
print Dumper($tmp);
} else {print "nothing to output!\n";}
}
}
##
sub adjustForSingleItemArrays
{
my $href = shift;
## if $val is an array ref and has one element
## set $href->{$key} equal to the element
while (my ($key,$val) = each (%{$href})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1) {
$href->{$key} = $val->[0];
}
}
}
}

##
sub getAttrHash
{
my $attstr = shift;
my $ahref = {};
return $ahref unless (defined $attstr);
while ($attstr =~ s/[ ]*([0-9a-zA-Z]+)[ ]*=[ ]*"([^=]*)"[
]*//i) {
$ahref->{$1} = $2;
}
return $ahref;
}
##
sub getContentHash
{
my ($attstr,$hStore) = @_;
my $ahref = {};
return $ahref unless (defined $attstr && defined $hStore);
my @ary = ();
while ($attstr =~
s/([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//i) {
if (defined $1) {
push (@ary, $1);
}
elsif (defined $2 && exists $hStore->{$2}) {
my ($key,$val) = each (%{$hStore->{$2}});

if (exists $ahref->{$key}) {

push (@{$ahref->{$key}}, $val);
} else {
$ahref->{$key} = [$val];
}
}
}

if (scalar(@ary) > 0) { $ahref->{'content'} = [@ary]; }
## if $val is an array ref and has one element and it
## is a hash ref, set {$key} equal to hash ref
if (!$ForceArray) {
while (my ($key,$val) = each (%{$ahref})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1 &&
ref($val->[0]) eq "HASH") {
$ahref->{$key} = $val->[0];
}
}
}
}
return $ahref;
}

__END__

# XML Regex Parser
# Version .901 - 12/22/05

# Copyright 2005,
# by robic0-At-yahoo.com
# -----------------------

==============================

<bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
</in1>
<in2>some in3 content</in2>
asdfb
</big>

1 = attr: biz="wefwf" ueue = "second"
2 <in2> = jjjj
3 <insF> = 2
4 <inside> = asgfasdf[3]sdfb
5 <bar> = [4]
6 <in1> = [2][1][5]
7 <in2> = some in3 content
8 <big> = attr: name="asdf" date="33", content: asdf[6][7]asdfb
** End pass 1

** Itterations = 1
** ForceArray = 0
** KeepRoot = 0

** KeepComments = 0

$VAR1 = {
'in2' => 'some in3 content',

'date' => '33',
'name' => 'asdf',
'content' => [
'asdf',
'asdfb'
],
'in1' => {
'small' => {
'ueue' => 'second',
'biz' => 'wefwf'
},

'bar' => {
'inside' => {
'insF' => '2',
'content' => [

'asgfasdf',
'sdfb'
]
}
},
'in2' => 'jjjj'
}
};

Tad McClellan

unread,

Dec 23, 2005, 7:53:28 AM12/23/05

to

robic0 <> wrote:

> Comments have an issue with enclosed "<" or ">" in this
> version, other than that they will process normally.
> Its a regex issue (shortcoming in my opinion)

Then you do not understand the mathematics underpinning
regular expressions (ie. set theory).

> that can't
> match a "not" string. Where I need .

If you are processing XML, then you do not need that, as
Comment Declarations cannot be nested.

> This is version .901 from 12-22-05 is the one you want.

No sensible person will want XML processing code written by
someone who has demonstrated repeatedly that they do not
understand the data that is being processed.

Bart Van der Donck

unread,

Dec 23, 2005, 6:31:03 PM12/23/05

to

robic0 wrote:

> On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
>
> >This post is in response to someone who asked for help trying to
> >parse xml into a data structure.
>
> This will fix the final issues with "ForceArray".
> Comments have an issue with enclosed "<" or ">" in this
> version, other than that they will process normally.
> Its a regex issue (shortcoming in my opinion) that can't
> match a "not" string. Where I need .
> Where (.*)(?!<!--) won't work in an expression. But I'll
> work around that.
>
> This is version .901 from 12-22-05 is the one you want.
> This is close to the last post as far as this newsgroup.
> Sorry, but I had to get it stable. I've run this on every
> big and wierd xml file I could get my hands on. I'm
> satisfied with it.

[ code snipped ]

It's very hard to run your code. You are messing up the line ends in
your post. I 've uploaded a corrected version to
www.dotinternet.be/temp/code.txt.

Your software produces errors when using namespaces:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
<mytag>content</mytag>
<html:br/>
</root>

Your software produces errors when using a DOCTYPE:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<root>
<mytag>content</mytag>
</root>

Your software produces errors when argument values are enclosed by `` '
´´ instead of `` " ´´:

<?xml version='1.0' encoding='UTF-8'?>
<root>
<mytag myargument='argvalue'>content</mytag>
</root>

XML is case sensitive; your program doesn't seem to bother:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<mYTag myargument="argvalue">content</mytag>
</root>

I'm using Microsoft XP's XML parser to check the XML well-formedness.

Your program has many shortcomings.

--
Bart

robic0

unread,

Dec 23, 2005, 6:56:10 PM12/23/05

to

On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <ba...@nijlen.com>
wrote:

>
>It's very hard to run your code. You are messing up the line ends in
>your post. I 've uploaded a corrected version to
>www.dotinternet.be/temp/code.txt.
>

Please don't correct and post code I've written on this.
I'm taking it to a higher level every day. My thoughts on
this won't take it where you want to go. Its my idea
and I'll do just about anything I want with it! The code
strain emminates from my creativity, I gave it birth and
I will progress it. Email me, or post code on specific xml
that doesen't work. Either you get a exception bail out
or you get my general error. Not all xml constucts are
implemented. !DOCTYPE not done yet. Its an infant now,
just the basics. Trust me, I'm gonna do it all.

If you got a host for me that would be great!
I'm going to expand this to every xml construct out there.

robic0

unread,

Dec 23, 2005, 7:19:25 PM12/23/05

to

On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <ba...@nijlen.com>
wrote:

>robic0 wrote:

>
>> On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
>>
>> >This post is in response to someone who asked for help trying to
>> >parse xml into a data structure.

[snip]

>It's very hard to run your code. You are messing up the line ends in
>your post.

I'm not messing up "line ends"..

> I 've uploaded a corrected version to
>www.dotinternet.be/temp/code.txt.

You didn't write the code, you can't correct it..

>
>Your software produces errors when using namespaces:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
> <mytag>content</mytag>
> <html:br/>
> </root>
>

Uh, namespaces? wha where?

>Your software produces errors when using a DOCTYPE:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <root>
> <mytag>content</mytag>
> </root>
>

"<!DOCTYPE..." is not implemented, don't use that xml

>Your software produces errors when argument values are enclosed by `` '

>创 instead of `` " 创:

>
> <?xml version='1.0' encoding='UTF-8'?>
> <root>
> <mytag myargument='argvalue'>content</mytag>
> </root>
>

Ok, I'l give you that, if '|" is ok for attribute's then I'll put it
in

>XML is case sensitive; your program doesn't seem to bother:

Thought that was the case, I turned off case sensitivity, I'll
put it back on

>
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
> <mYTag myargument="argvalue">content</mytag>
> </root>
>
>I'm using Microsoft XP's XML parser to check the XML well-formedness.
>
>Your program has many shortcomings.
>

My program has a solid framework I wrote in 4 days. I've run it on
every single MShit OS xml on my machine. It works perfect ...

Don't know what you want. Either you want what I wrote your you just
want to bust balls of a software designer. Can't figure out which you
want. One more comment like the one above and I won't post a personal
reply like this one!
if ever you should

robic0

unread,

Dec 23, 2005, 7:34:02 PM12/23/05

to

On Fri, 23 Dec 2005 16:19:25 -0800, robic0 wrote:

>On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <ba...@nijlen.com>
>wrote:
>
>>robic0 wrote:
>>
>>> On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
>>>
>>> >This post is in response to someone who asked for help trying to
>>> >parse xml into a data structure.
>[snip]
>>It's very hard to run your code. You are messing up the line ends in
>>your post.
>I'm not messing up "line ends"..
>> I 've uploaded a corrected version to
>>www.dotinternet.be/temp/code.txt.
>You didn't write the code, you can't correct it..
>>
>>Your software produces errors when using namespaces:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <root xmlns:html="http://www.w3.org/TR/REC-html-4.0">
>> <mytag>content</mytag>
>> <html:br/>
>> </root>
>>
>Uh, namespaces? wha where?

<html:br/>
^
Only \w are allowed in tag names now.
This character can be allowed.
I won't do it until the ramifications of a ":" are clear.
Send me the spec on tags, delimeters that runnon without space
within tags.
I'll see what I can do.

Matt Garrish

unread,

Dec 23, 2005, 7:29:45 PM12/23/05

to

<robic0> wrote in message news:on2pq15rrr0jo0nef...@4ax.com...

> On 23 Dec 2005 15:31:03 -0800, "Bart Van der Donck" <ba...@nijlen.com>
> wrote:
>>
>>It's very hard to run your code. You are messing up the line ends in
>>your post. I 've uploaded a corrected version to
>>www.dotinternet.be/temp/code.txt.
>>
> Please don't correct and post code I've written on this.
> I'm taking it to a higher level every day. My thoughts on
> this won't take it where you want to go. Its my idea
> and I'll do just about anything I want with it! The code
> strain emminates from my creativity, I gave it birth and
> I will progress it. Email me, or post code on specific xml
> that doesen't work.

I don't think anyone wants your garbage.

Now how about the part where you start dealing with the fact that xml is not
constrained to single lines. Your little toy has a lot of trouble with:

and also:

Maybe you should learn XML *before* trying to write this parser of yours.

Matt

robic0

unread,

Dec 23, 2005, 8:08:26 PM12/23/05

to

On Fri, 23 Dec 2005 19:29:45 -0500, "Matt Garrish"
<matthew...@sympatico.ca> wrote:
>Now how about the part where you start dealing with the fact that xml is not
>constrained to single lines. Your little toy has a lot of trouble with:
>

Huh, constrained to single lines?
Wha, where?

>
>

Comments are a problem for now. I have a workaround
for the near future. I've posted a general complaint
about this Regex problem to the general forum.

>and also:
>
><myplace
> city="here"
> province="there"/>
>

"white space" is not considered as a seperator yet, only " ". If its
xml complieant I will enact it.

>Maybe you should learn XML *before* trying to write this parser of yours.

Maybe you should not get or use any my software. If I find out you did
I will sue you!!!!
>
>Matt
>

Tad McClellan

unread,

Dec 24, 2005, 9:45:36 AM12/24/05

to

robic0 <> wrote:

> I will sue you!!!!

I doubt it.

You'd have to stop cowering behind anonymity to sue.

You don't have the guts for it.

Matt Garrish

unread,

Dec 24, 2005, 11:57:13 AM12/24/05

to

<robic0> wrote in message news:kf7pq15pfhs754mu2...@4ax.com...

> On Fri, 23 Dec 2005 19:29:45 -0500, "Matt Garrish"
> <matthew...@sympatico.ca> wrote:
>>Now how about the part where you start dealing with the fact that xml is
>>not
>>constrained to single lines. Your little toy has a lot of trouble with:
>>
> Huh, constrained to single lines?
> Wha, where?
>
>>
>>
> Comments are a problem for now. I have a workaround
> for the near future. I've posted a general complaint
> about this Regex problem to the general forum.
>
>>and also:
>>
>><myplace
>> city="here"
>> province="there"/>
>>
> "white space" is not considered as a seperator yet, only " ". If its
> xml complieant I will enact it.
>

Exactly my point. The last XML processor I built took three weeks just to
write the design for and another 1.5 months to build. And I didn't write my
own parsers; I used a combination of DOM and SAX parsing. You don't know XML
and are proud that you've spent four days designing and writing on the fly
this parser of yours. Are you beginning to see why we don't take you
seriously.

>
> Maybe you should not get or use any my software. If I find out you did
> I will sue you!!!!
>

Maybe you should consider the legal ramifications of what you've done. You
posted the code here asking for help fixing it on the premise that it is
free and open code. By doing so, you've entered an agreement with everyone
on clpm who responds in any way to your code that this will always be the
case. Though I don't believe you could ever make a cent off it, bear in mind
that I have a real cause for legal action if I find out you use this code in
any commercial product (and that includes reproducing it for an employer).

By the way, have you put any thought into the public interface for this
thing? It's nice that it runs line-by-line and uses regexes to find tags,
but that's totally useless for XML parsing. Does it handle events like a SAX
parser? (Not that I see.) Does it build a parent/child tree? (Again, I don't
see anywhere that you can tell what the relationship is between any set of
tags.) Or is this just an exercise in writing regular expressions?

Matt

robic0

unread,

Dec 26, 2005, 10:39:30 PM12/26/05

to

On Sat, 24 Dec 2005 11:57:13 -0500, "Matt Garrish"
<matthew...@sympatico.ca> wrote:

Man you make me laff!

>By the way, have you put any thought into the public interface for this
>thing? It's nice that it runs line-by-line and uses regexes to find tags,
>but that's totally useless for XML parsing. Does it handle events like a SAX
>parser? (Not that I see.) Does it build a parent/child tree? (Again, I don't
>see anywhere that you can tell what the relationship is between any set of
>tags.) Or is this just an exercise in writing regular expressions?
>
>Matt
>

Since its out of sequence, its totally useless for event driven SAX.
However, in-line handling of contents could be re-directed for
special character handling.
Specific accumulation of special "tag" data could be handled too.
You have to think outside the box on this. Definetly the data
structure indenture is right on the money. To modify that data
in-line or pull off just the data you want is no problem.

To tell you the truth, there's a bunch this can do.
You better try to stay off the "negative" machine a little more.
Try the "positive" machine for a while. And oh well, if it flops
who cares, but it punches out some awsome timed data right now.
The technique is new, in my opinion its worth the effort.

Keep the comments coming... I don't care if its negative,
it leads me in the right direction. If I have to swear to get
some feedback so be it.

robic0

unread,

Dec 26, 2005, 10:52:26 PM12/26/05

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

I'm back on the job.
I'm going to post some new code this week that
complies with XML spec.

This is the solution for the Comment/CDATA paradigm
that will be incorporated in the new version:

use strict;
use warnings;

$_ = '
<![CDATA[  some text ]]>

';

#### This section of parser deals with
#### circular non-markup imbedding issues.
#### (one inside the other, and so forth)
#### So far just comments & cdata.
#### Use the general substitution magic.
#### This is valid because nesting of
#### comments nor cdata is allowed.

my $cnt = 1;
my %root = ();
my %cdata_elements = ();

print "\n";

# -- Comments (done first) --
while (s/()/[$cnt]/s) {
$root{$cnt} = $1;
print "$cnt = Questionable comment: $1\n"; $cnt++;
}
print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
# -- CDATA (done second) --
while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
{
# reconstitute cdata element contents
my $cdata_contents = $1;
my $str = '';
while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
{
if (defined $1)
{
$str .= $1;
}
elsif (defined $2 && exists $root{$2})
{
$str .= $root{$2};
delete $root{$2};
}
else {
my $j = 0; # shouldn't get here
}
}
$root{$cnt} = $str;
$cdata_elements{$cnt} = '';

print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
}
# -- Process leftover comments that are real --
while (my ($key,$val) = each (%root)) {
if (!defined $cdata_elements{$key}) {
# This $root re-assignment is not really necessary
# since $1 will contain the processing text that
# will be processed here, then never used again.
$root{$key} =~ s//$1/s;
print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
}
}

__END__

1 = Questionable comment: 
2 = Questionable comment: 
3 = Questionable comment: 
4 = Questionable comment:

============================================================

The "Real" Stuff -->

5 = REAL CDATA:  some text

4 = REAL COMMENT: This is a real comment

3 = REAL COMMENT:
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>

Bart Van der Donck

unread,

Dec 27, 2005, 7:19:29 AM12/27/05

to

robic0 wrote:

> I'm back on the job.
> I'm going to post some new code this week that
> complies with XML spec.

There is more than meets the eye.

An XML file may be well-formed, but invalid if it doesn't comply with
its DTD. Would your program complain about that ?

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE root [
<!ELEMENT root ((mytag|mytag2),myothertag+,notrequiredtag?)>
<!ELEMENT mytag (#PCDATA)>
<!ELEMENT myothertag (#PCDATA)>
]>
<root>
<mytag>content 1</mytag>
<myothertag>content 2</myothertag>
</root>

What about the declaration of entities ?

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE root [
<!ENTITY my_entity "this content was set by !ENTITY">
]>
<root>
<mytag>&my_entity;</mytag>
<myothertag>content 2</myothertag>
</root>

What about an ATTLIST ?

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE root [
<!ATTLIST mytag
att1 CDATA #REQUIRED
att2 CDATA #IMPLIED>
<!ATTLIST myothertag att3 CDATA #FIXED
"this content was set by !ATTLIST">
]>
<root>
<mytag att1="attvalue1" att2="attvalue2">content 1</mytag>
<myothertag>content 2</myothertag>
</root>

What you gonna do with specific XSL tags ?

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<root>
<xsl:sort select="@ID" order="ascending" />
<mytag>
<xsl:attribute name='{name()}'>
<xsl:value-of select="." />
</xsl:attribute>
</mytag>
</root>
</xsl:stylesheet>

What about the rules from an XML schema ?

<?xml version="1.0" encoding="UTF-8"?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="root">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="mytag" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>

It would be a good idea to decode numeric character references:

<?xml version="1.0" encoding="UTF-8"?>
<root>

Same for the non-numeric ones:

<?xml version="1.0" encoding="UTF-8"?>
<root>

I would recommend "Perl & XML - XML Processing with Perl" by Erik T.
Ray & Jason McIntosh (edited by O'Reilly). Very good book. See
http://www.oreilly.com/catalog/perlxml/.

You need to learn more about XML:

http://www.w3.org/XML/
http://www.xml.com/
http://www.w3schools.com/xml/default.asp (tip!)

--
Bart

Matt Garrish

unread,

Dec 27, 2005, 9:37:12 AM12/27/05

to

<robic0> wrote in message news:evc1r1d8jt63q25vm...@4ax.com...

> On Sat, 24 Dec 2005 11:57:13 -0500, "Matt Garrish"
> <matthew...@sympatico.ca> wrote:
>
> Man you make me laff!
>

Well, at least you're getting as much out of this as I am. It would be nice
if you could drop the script-kiddie talk and write proper English sentences
in the future, though.

>
>>By the way, have you put any thought into the public interface for this
>>thing? It's nice that it runs line-by-line and uses regexes to find tags,
>>but that's totally useless for XML parsing. Does it handle events like a
>>SAX
>>parser? (Not that I see.) Does it build a parent/child tree? (Again, I
>>don't
>>see anywhere that you can tell what the relationship is between any set of
>>tags.) Or is this just an exercise in writing regular expressions?
>>
>>

> Since its out of sequence, its totally useless for event driven SAX.

That's exactly my point. What is this thing supposed to do? The (very
simple) point of an XML parser is to verify the integrity of the document
(validation: either well-formedness or compliance to a dtd or schema) and/or
allow you to access the content.

Your parser has no appreciation of nesting beyond the very trivial, so there
is no way that it can check well-formedness. It (you) also doesn't
understand dtds or schemas, and don't realize how nearly impossible it's
going to be for your parser to validate against one.

To get back to my original point, however, your parser does not build a
tree, so that makes it useless for half the applications of a parser. It
also doesn't handle events like a SAX parser, which makes it useless for the
other half. I'm honestly curious what real world application you think this
is going to have?

Oh, and when are you going to start handling xpath queries?

Matt

Matt Garrish

unread,

Dec 27, 2005, 9:39:29 AM12/27/05

to

"Bart Van der Donck" <ba...@nijlen.com> wrote in message
news:1135685969....@o13g2000cwo.googlegroups.com...

> robic0 wrote:
>
>> I'm back on the job.
>> I'm going to post some new code this week that
>> complies with XML spec.
>

>
> You need to learn more about XML:
>

True, but I for one am learning much by seeing a code god in action... : )

Matt

Jürgen Exner

unread,

Dec 27, 2005, 2:44:43 PM12/27/05

to

Matt Garrish wrote:
> <robic0> wrote in message

>> Man you make me laff!
>

> It would
> be nice if you could drop the script-kiddie talk and write proper
> English sentences in the future, though.

Well, actually I think it was just a typo and he meant to write "luff". It's
a common mistake for non-native English speakers to write vowels the way
they are pronounced.

Of course the semantic of that sentence in the context of Perl is still a
mystery.

jue

PS: Am I glad that my kill file is working fine.

Matt Garrish

unread,

Dec 27, 2005, 2:52:36 PM12/27/05

to

"Jürgen Exner" <jurg...@hotmail.com> wrote in message
news:LQgsf.173$3Y3.92@trnddc02...

> Matt Garrish wrote:
>> <robic0> wrote in message
>>> Man you make me laff!
>>
>> It would
>> be nice if you could drop the script-kiddie talk and write proper
>> English sentences in the future, though.
>
> Well, actually I think it was just a typo and he meant to write "luff".
> It's a common mistake for non-native English speakers to write vowels the
> way they are pronounced.
>

But he's a proud American, you forget. It's still a long way from the
*laugh* I'm getting from him... : )

Matt

Jürgen Exner

unread,

Dec 27, 2005, 3:24:49 PM12/27/05

to

Matt Garrish wrote:
> "Jürgen Exner" <jurg...@hotmail.com> wrote in message
> news:LQgsf.173$3Y3.92@trnddc02...
>> Matt Garrish wrote:
>>> <robic0> wrote in message
>>>> Man you make me laff!
>>>
>>> It would
>>> be nice if you could drop the script-kiddie talk and write proper
>>> English sentences in the future, though.
>>
>> Well, actually I think it was just a typo and he meant to write
>> "luff". It's a common mistake for non-native English speakers to
>> write vowels the way they are pronounced.
>>
>
> But he's a proud American, you forget.

Well, unfortunately many of those do qualify as non-native English speakers
;-(

jue

John Bokma

unread,

Dec 27, 2005, 3:38:08 PM12/27/05

to

"Matt Garrish" <matthew...@sympatico.ca> wrote:

> But he's a proud American, you forget.

And an IQ of 170 (or there about)

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
I ploink googlegroups.com :-)

robic0

unread,

Dec 29, 2005, 12:42:09 AM12/29/05

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

Update on the code. Lots of changes:

v.902
- Fixed white space issues surrounding attributes
- Fixed \"\' issue delimiting attribute content
- Fixed root container issues
- Fixed CDATA in comments
- Fixed comments in CDATA
- Added more warnings related to root level

Will be working on the usage regarding the
remove white spaces flag. Ie: to be applied
to content only or not.

To be done -
Still havent incorporated ":" logic in attributes.
Still no special xml character conversions (easy though)
Will incorporate simple callbacks for content.
No doctype or others as of yet, will look into this.
Will make the parsing a function with error trapping
capability for the caller (down the road).

The framework is working out pretty good.
Let me know if you have any questions or comments.

Thanks

print <<EOM;

# -----------------------
# XML Regex Parser
# Version .902 - 12/28/05

use strict;

use warnings;
use Data::Dumper;

#open DATA, "sumfile.xml" or die "can't open datafile...";

#my $gabage1 = join ('', <DATA>);
#close DATA;

my $gabage4 = '

<big name="asdf" date="33" >
asdf
<in1>

<in2>jjjj</in2>


<bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
</in1>
<in2>some in3 content</in2>
asdfb
</big>
';

my $gabage5 = '
<root>

this is some content
<stag> some br stuffafter<t>some t

</t>

</stag>

<![CDATA[  some text ]]>
<![CDATA[  some text ]]>

</root>
';

my $gabage6 = "

";

my $gabage7 = "
<node1
tire = 'Michelan'
size = \"235 x 16\">
Recalled by factory</node1>

";

my $gabage8 = "
<node1
color = 'green'
vtype = \"truck'
/>

";

my $gabage9 = '
<node>this is a node</node>
<node>this is a different</node>
';

my $gabage10 = '
<Node>this is a node</node>
<node>this is a different</Node>
';

my @xml_strings = ($gabage4, $gabage5, $gabage6, $gabage7, $gabage9,
$gabage10);

my $VERSION = .902;
my $debug = 0;
my $rmv_white_space = 0;

my $ForceArray = 0;
my $KeepRoot = 0;
my $KeepComments = 0;

## -- XML, start & end regexp substitution delimiter chars --

## match side , substitution side

## -------------------------/-------------------------------
my (@S_dlim, @E_dlim);
if ($debug) {
@S_dlim = ('\[' , '['); # use these for debug

@E_dlim = ('\]' , ']');

} else {

@S_dlim = (chr(140) , chr(140)); # use these for production

@E_dlim = (chr(141) , chr(141));
}

## -- Process xml data --
##
for (@xml_strings)
{
print "\n",'*'x30,"\nXML
string:\n",'-'x15,"$_\n\nOutput:\n",'-'x15,"\n\n";

my $ROOT = {}; # container

my %cdata_elements = ();
my ($last_cnt, $cnt, $i, $attr_error) = (-1, 1, 0, 0);

## Comment/CDATA block ==================================
#### To be done first -
# -- Questionable Comments --
while (s/()/$S_dlim[1]$cnt$E_dlim[1]/s) {
#print "$cnt = Questionable comment: $1\n" if
($debug);
$ROOT->{$cnt} = $1;
$cnt++;
}
#### To be done second -
# -- Real CDATA --
while (s/<!\[CDATA\[(.*?)\]\]>/$S_dlim[1]$cnt$E_dlim[1]/s)
{
# reconstitute cdata contents

my $cdata_contents = $1;
my $str = '';
while ($cdata_contents =~

s/([^$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//) {

if (defined $1) {
$str .= $1;

} elsif (defined $2 && exists $ROOT->{$2}) {
$str .= $ROOT->{$2};
delete $ROOT->{$2};
} else {} # shouldn't get here
}
print "$cnt CDATA = $str\n" if ($debug);
$ROOT->{$cnt} = $str;
$cdata_elements{$cnt} = '';
$cnt++;
}
#### To be done third -
# -- Real Comments are left --
foreach my $key (sort {$a <=> $b} keys %{$ROOT}) {
if (!exists $cdata_elements{$key}) {
$ROOT->{$key} =~ s/^$/$1/s;
print "$key Comment = $1\n" if ($debug);
if ($KeepComments) {
$ROOT->{$key} = { comment => $1 };
} else {delete $ROOT->{$key};}
}
}
## End Comment/CDATA block ==============================

#### Non-tag markup go here -
# -- Versioning -- <?XML-Version ?> , have to check the format
of '<?'
while (s/<\?([^<>]*)\?>//) {} # void xml versioning for now
# while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/)
# { print "$cnt <$1> = \n" if ($debug); $cnt++}

#### White space removal before tags ? .. TBD -

if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}

#### Tags here - should only need 2 iterations max
my $finished = 0;

while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <Tag/> , no content
while (s/<([0-9a-zA-Z]+)\/>/$S_dlim[1]$cnt$E_dlim[1]/)

{
print "$cnt <$1> = \n" if ($debug);
$ROOT->{$cnt} = { $1 => '' };
$cnt++;
}
## <Tag Attributes/> , no content
while

(s/<([0-9a-zA-Z]+)([\s]+[0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"])+[\s]*\/>/$S_dlim[1]$cnt$E_dlim[1]/)

{
print "$cnt <$1> = attr: $2\n" if ($debug);

my $hattrib = getAttrHash($2);
if (ref($hattrib) ne "HASH") {
print "Invalid token in attribute
asignment:\n$hattrib\n"; $attr_error = 1; last;

}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}

## <Tag> Content </Tag>
while

(s/<([0-9a-zA-Z]+)>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/) {

print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {

my $hcontent = getContentHash($2,
$ROOT, \%cdata_elements);
$unknown = $hcontent;

if (keys (%{$hcontent}) > 1) {

if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
} else {
if (exists
$hcontent->{'content'} && scalar(@{$hcontent->{'content'}}) == 1) {
if (!$ForceArray ) {
$unknown =
${$hcontent->{'content'}}[0];
} else {$unknown =
$hcontent->{'content'}; }
}
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }
}
}

$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}

last if ($attr_error);

## <Tag Attributes> Content </Tag>
while

(s/<([0-9a-zA-Z]+)([\s]+[0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"])+[\s]*>([^<]*)<\/\1>/$S_dlim[1]$cnt$E_dlim[1]/)

{
print "$cnt <$1> = attr: $2, content: $3\n" if
($debug);
my $hattrib = getAttrHash($2);

if (ref($hattrib) ne "HASH") {
print "Invalid token in attribute
asignment:\n$hattrib\n"; $attr_error = 1; last;

}
if (length($3) > 0) {
my $hcontent = getContentHash($3,

$ROOT, \%cdata_elements);
if (!$ForceArray) {
adjustForSingleItemArrays ($hcontent); }

while (my ($key,$val) = each
(%{$hcontent})) {
$hattrib->{$key} = $val;
}
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}

last if ($attr_error);

if ($last_cnt != $cnt) {
$i++; print "** End pass $i\n" if ($debug);

} else {
last if ($finished);
## Encapsulate the xml with a "root"
$_ = "<root>$_</root>";
$last_cnt--;
$finished = 1;
}
}
last if ($attr_error);
if (/<|>/) {
print "($i) XML problem: malformed, syntax or tag
closure:\n$_";
} else {

print "\n** Itterations = $i\n** ForceArray =
$ForceArray\n** KeepRoot = $KeepRoot\n** KeepComments =
$KeepComments\n\n";
#print Dumper($ROOT);

print "The remaining string is:\n$_\n\n" if ($debug);

## Strip off the outer element (our root) to
## examine the contents for errors.
## ---------------------------------------

my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element}) {

my $hroot = $ROOT->{$outer_element};
my ($key,$val) = each (%{$hroot});
my $htodump = $val;

# check for errors in root
if (ref($htodump) ne "HASH" || exists
$htodump->{'content'}) {
print "Error, bare content at root
level ..\n";
} else {
my $dmp_keys = keys (%{$htodump});

if ($dmp_keys > 1) {
print "Warning, multiple
elements at root level ..\n";
} else {

($key,$val) = each
(%{$htodump});

my $dmp_type = ref($val);

if ($dmp_keys == 0 || (exists
$htodump->{'comment'})) {
print "Warning, no
elements at root level ..\n";
}
if ($dmp_keys == 1) {
if ($dmp_type eq
"HASH") {
$htodump =
$val if (!$KeepRoot);
}
elsif ($dmp_type eq
"ARRAY") {
if
(!$ForceArray || scalar(@{$val}) > 1) {
print
"Warning, multiple elements at root level ..\n";
}
}
}
}
}
print "\n";
my $tmp = {};

%{$tmp} = %{$htodump};
print Dumper($tmp);
} else {
print "nothing to output!\n";
}
}
}
##
sub adjustForSingleItemArrays
{
my $href = shift;
## if $val is an array ref and has one element
## set $href->{$key} equal to the element
while (my ($key,$val) = each (%{$href})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1) {
$href->{$key} = $val->[0];
}
}
}
}
##
sub getAttrHash
{

my $attrstr = shift;
my $ahref = {};
return $ahref unless (defined $attrstr);
while ($attrstr =~
s/[\s]*([0-9a-zA-Z]+)[\s]*=[\s]*("|')([^=]*)\2[\s]*//i) {
$ahref->{$1} = $3;
}
if ($attrstr=~/=/) {
$attrstr =~ s/^\s+//s;
$attrstr =~ s/\s+$//s;
return $attrstr

}
return $ahref;
}
##
sub getContentHash
{

my ($contstr,$hStore,$hcdata_elements) = @_;
my $ahref = {};
return $ahref unless (defined $contstr && defined $hStore &&
defined $hcdata_elements);
my @ary = ();
my $append_flag = 0;

while ($contstr =~
s/^([^<$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//s) {
if (defined $1) {
my $tmp1 = $1;
# if flagged, append it to $ary[last]
if ($append_flag && scalar(@ary) > 0) {
my $size = scalar(@ary);
$ary[$size-1] .= $tmp1;
} else {
push (@ary, $1);
}
$append_flag = 0;
}
elsif (defined $2) {
# if it doesen't exist (Comments stripped?)
# turn on append flag.
if (!exists $hStore->{$2}) {
$append_flag = 1;
next;
}
# if its a CDATA, append it to $ary[last],
# turn on append flag.
if (exists $hcdata_elements->{$2}) {
my $size = scalar(@ary);
if ($size > 0) {
$ary[$size-1] .=
$hStore->{$2};
} else {push (@ary, $hStore->{$2});}
$append_flag = 1;
next;

}
my ($key,$val) = each (%{$hStore->{$2}});

if (exists $ahref->{$key}) {

push (@{$ahref->{$key}}, $val);
} else {
$ahref->{$key} = [$val];
}

$append_flag = 0;
}
else {} # shouldn't get here
}

# store contents, strip out
# pure whitespace text elements
my $hary = [];
for (@ary) {
next if (/^\s+$/s);
push (@{$hary}, $_);
}
if (scalar(@{$hary}) > 0) {
$ahref->{'content'} = $hary;

}
## if $val is an array ref and has one element and it
## is a hash ref, set {$key} equal to hash ref
if (!$ForceArray) {
while (my ($key,$val) = each (%{$ahref})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1 &&
ref($val->[0]) eq "HASH") {
$ahref->{$key} = $val->[0];
}
}
}
}
return $ahref;
}

__END__

# -----------------------
# XML Regex Parser
# Version .902 - 12/28/05

******************************
XML string:
---------------

<big name="asdf" date="33" >
asdf
<in1>

<in2>jjjj</in2>


<bar><inside>asgfasdf<insF>2</insF>sdfb</inside></bar>
</in1>
<in2>some in3 content</in2>
asdfb
</big>

Output:
---------------

** Itterations = 2

******************************
XML string:
---------------
<root>

this is some content
<stag> some br stuffafter<t>some t

</t>

</stag>

<![CDATA[  some text ]]>
<![CDATA[  some text ]]>

</root>

Output:
---------------

** Itterations = 2

** ForceArray = 0
** KeepRoot = 0
** KeepComments = 0

$VAR1 = {
'content' => [
'

this is some content
',
'

some text 
 some text

'
],
'stag' => {
'br' => 'some br stuff',
'content' => 'after',
't' => 'some t

'
}
};

******************************
XML string:
---------------

Output:
---------------

** Itterations = 2

** ForceArray = 0
** KeepRoot = 0
** KeepComments = 0

$VAR1 = {
'date' => '1/1/05',
'name' => 'Barney'
};

******************************
XML string:
---------------
<node1
tire = 'Michelan'
size = "235 x 16">
Recalled by factory</node1>

Output:
---------------

** Itterations = 2

** ForceArray = 0
** KeepRoot = 0
** KeepComments = 0

$VAR1 = {
'content' => '
Recalled by factory',
'tire' => 'Michelan',
'size' => '235 x 16'
};

******************************
XML string:
---------------
<node>this is a node</node>
<node>this is a different</node>

Output:
---------------

** Itterations = 2

** ForceArray = 0
** KeepRoot = 0
** KeepComments = 0

Warning, multiple elements at root level ..

$VAR1 = {
'node' => [
'this is a node',
'this is a different'
]
};

******************************
XML string:
---------------
<Node>this is a node</node>
<node>this is a different</Node>

Output:
---------------

(0) XML problem: malformed, syntax or tag closure:
<root>
<Node>this is a node</node>
<node>this is a different</Node>
</root>

robic0

unread,

Dec 29, 2005, 12:54:14 AM12/29/05

to

On Wed, 28 Dec 2005 21:42:09 -0800, robic0 wrote:

>On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
>
>Update on the code. Lots of changes:
>
>v.902
>- Fixed white space issues surrounding attributes
>- Fixed \"\' issue delimiting attribute content
>- Fixed root container issues
>- Fixed CDATA in comments
>- Fixed comments in CDATA
>- Added more warnings related to root level
>

Oh and
- Fixed case sensitivity of tag names

-t

robic0

unread,

Dec 29, 2005, 1:05:34 AM12/29/05

to

On 27 Dec 2005 04:19:29 -0800, "Bart Van der Donck" <ba...@nijlen.com>
wrote:

I'll read when I get to XSL and schema tags.
This is very easy to do. Haven't got there yet.
Version .902 framework is a big jump as far as decoding.
Numeric and special xml characters are very simple to do. I'll
probably do that next to get it out of the way.

robic0

unread,

Dec 29, 2005, 1:35:26 AM12/29/05

to

On Tue, 27 Dec 2005 09:37:12 -0500, "Matt Garrish"
<matthew...@sympatico.ca> wrote:

In the past I've used XML::Xerces module to interface Apache's
Xerces-C Version 2.3.0 for Windows. What do you use? How do you like
its documentation?

>To get back to my original point, however, your parser does not build a
>tree, so that makes it useless for half the applications of a parser. It
>also doesn't handle events like a SAX parser, which makes it useless for the
>other half. I'm honestly curious what real world application you think this
>is going to have?

Parsers just parse and without thought extract monolithic chunks of
data, serially. Ever see a SAX parser do anything else?
Ever see a SAX parser build a bridge or maybe erect a pyrimid in
Egypt? Perhaps you think that all the XML:: modules you call parsers
that do any number of different things are all parsing. Unless you
call a PARSER directly, you aren't really parsing are you? So maybe
you think "parsing" is somehow reserved, a thing one shouldn't even
attempt. A thing so complicated to you that it makes your pecker
droop just when you think about it....

Bart Van der Donck

unread,

Dec 29, 2005, 5:41:10 AM12/29/05

to

robic0 wrote:

> I'll read when I get to XSL and schema tags.
> This is very easy to do.

Utter nonsense.

> Numeric and special xml characters are very simple to do.

Sure:

use HTML::Entities();
my $decoded = HTML::Entities::decode($encoded);

Or do you want to code out that part by hand as well ?

Oh, and... most (if not all) readers here won't run your code because
the line ends of your post are messed up. You still don't seem to
understand this. That happens when your IQ is 17,0.

Cut & paste the code from your last post and try to run that yourself.
You'll see what I mean.

--
Bart

Matt Garrish

unread,

Dec 29, 2005, 10:57:40 AM12/29/05

to

<robic0> wrote in message news:ndv6r1987tikpgevm...@4ax.com...

>
> Parsers just parse and without thought extract monolithic chunks of
> data, serially.
>

That ranks as one of the stupidest things I've seen written in 2005. Lucky
you squeaked that in before the end of the week.

So your "parser" doesn't do anything. It doesn't validate the data. It
doesn't allow it to be handled in any meaningful way. All it does is
randomly chunk up data for no use to no one. Pretty much what I figured you
were doing.

Matt

robic0

unread,

Dec 29, 2005, 4:55:43 PM12/29/05

to

On Tue, 27 Dec 2005 09:37:12 -0500, "Matt Garrish"
<matthew...@sympatico.ca> wrote:

Wheather or not I can use it to write a schema checker is something I
will consider when I feel like it.
You have some misconception about the ability of schema to do 100%
validation. It can't on every level, period! In fact between level
sets, all it can do is validate a range of parents vs. a range of
children. It can't validate a relationship between a single parent
and posible several children. So schema is whoafully inadequate alone.
To propagate all the possible permutations would make schema 100%
valid. It doesen't have that capability and never will. If all you
use is schema to validate your xml, you don't know xml..

>To get back to my original point, however, your parser does not build a
>tree,

Doesn't build a tree? Wtf are you drinking?

> so that makes it useless for half the applications of a parser. It
>also doesn't handle events like a SAX parser, which makes it useless for the
>other half. I'm honestly curious what real world application you think this
>is going to have?

To start, it blows the doors off all parsers out there. It uses a
substitution method that exponentially gets quicker. It starts from
the inner xml blocks and works out. It takes data off right away
and builds a tree without waiting for parent closure. Its 100%
accurate because the logic is flawless. It works on a micro as opposed
to macro idiom. Its out of order with discrete cells. This is the
fastest possible method to parse xml. I'm suprised no one ever did
this before. Its on the level of a node model but it can easily
hone in on patterns and discard whats not necessary and filter.
All while using an examination method that exponentionally gets
quicker as the search progresses.

I wan't your promise that when this idea takes off that you won't
have any part of it and continue to bury your head in the sand.

robic0

unread,

Dec 29, 2005, 5:05:13 PM12/29/05

to

Ok, its my news reader I never had to fine tune so much. I have my
post width set to 100 characters (it was 70). This should fix it.
I'm using Forte Agent btw.. I'll repost the code.

robic0

unread,

Dec 29, 2005, 5:24:18 PM12/29/05

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

For easier reading and after an adjustment to Agent, I have expanded the width of my posts to 200 chrs/line.
This is still version .902. I wll modularize this and add exception processing on the next go round.
-robic0-

Matt Garrish

unread,

Dec 29, 2005, 8:27:50 PM12/29/05

to

<robic0> wrote in message news:sil8r1hc6jv4lbv41...@4ax.com...

> On Tue, 27 Dec 2005 09:37:12 -0500, "Matt Garrish"
> <matthew...@sympatico.ca> wrote:
>
>>
>>Your parser has no appreciation of nesting beyond the very trivial, so
>>there
>>is no way that it can check well-formedness. It (you) also doesn't
>>understand dtds or schemas, and don't realize how nearly impossible it's
>>going to be for your parser to validate against one.
>>
> Wheather or not I can use it to write a schema checker is something I
> will consider when I feel like it.
> You have some misconception about the ability of schema to do 100%
> validation.

I have no misconcenptions about what Schemas can do.
I also have no misconceptions about the ability of a DTD to specify
a document's structure.

I do, however, have no faith in your ability to validate against either.

>
> To start, it blows the doors off all parsers out there.

No, it doesn't. Quick-and-dirty regular expression parsers have been around
for a long time. Google it if you really think you're doing something new.

You are still missing fundemental concepts of XML, namely order. Using your
code (what is
"gabage"?):

my $gabage4 = <<TEST;
<problem>
<elem>i see a problem here with inline elements</elem>
</problem>
TEST

Nets me the following wonderful output:

$VAR1 = {
'elem' => {
'b' => 'here',
'content' => [
' see a problem ',
' with inline elements'
],
'i' => 'i'
}
};

Anyway, this is really growing tiresome, so when you release your module to
CPAN be sure to make an announcement so we can all be awed.

Matt

John Bokma

unread,

Dec 29, 2005, 9:35:57 PM12/29/05

to

robic0 wrote:

> To start, it blows the doors off all parsers out there. It uses a
> substitution method that exponentially gets quicker. It starts from
> the inner xml blocks and works out. It takes data off right away
> and builds a tree without waiting for parent closure. Its 100%
> accurate because the logic is flawless. It works on a micro as opposed
> to macro idiom. Its out of order with discrete cells. This is the
> fastest possible method to parse xml. I'm suprised no one ever did
> this before. Its on the level of a node model but it can easily
> hone in on patterns and discard whats not necessary and filter.
> All while using an examination method that exponentionally gets
> quicker as the search progresses.

Anyone has a free position at the marketing department?

Samwyse

unread,

Dec 29, 2005, 9:39:40 PM12/29/05

to

Thank you, Bart, for pointing this out to him.

rob...@yahoo.com

unread,

Dec 30, 2005, 1:36:06 PM12/30/05

to

Ok, inline mixed content is an issue that will produce a different form
IF it is
taken into account. The default (above) is where inline non-nested tags
are guaranteed
available at the same level.

I will add a flag to keep the ordering of inline mixed content. Ie:

$KeepInlineOrder = 1;

would change the output (html?) to something like this:

$VAR1 = {
'elem' => {
'content' => [
{'i' => 'i'},
' see a problem ',
{'b' => 'here'},
' with inline elements'
]
}
};

or, with extended root stripping:

$VAR1 = [{'i' => 'i'},' see a problem ',{'b' => 'here'},' with inline
elements'];

Tad McClellan

unread,

Dec 30, 2005, 8:54:24 PM12/30/05

to

Matt Garrish <matthew...@sympatico.ca> wrote:
><robic0> wrote in message news:sil8r1hc6jv4lbv41...@4ax.com...

[snip]

> You are still missing fundemental concepts of XML, namely order. Using your
> code (what is
> "gabage"?):
>
> my $gabage4 = <<TEST;
> <problem>
> <elem>i see a problem here with inline elements</elem>
> </problem>
> TEST
>
> Nets me the following wonderful output:
>
> $VAR1 = {
> 'elem' => {
> 'b' => 'here',
> 'content' => [
> ' see a problem ',
> ' with inline elements'
> ],
> 'i' => 'i'
> }
> };

I think you have discovered a use for this wizard's code.

It is a Yoda-speak generator!

-------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $VAR1 = {

'elem' => {
'b' => 'here',
'content' => [
' see a problem ',
' with inline elements'
],
'i' => 'i'
}
};

speak_yoda($VAR1);

sub speak_yoda {
foreach ( @_ ) {
if ( ref $_ eq 'HASH' ) { speak_yoda( values %$_ ) }
elsif ( ref $_ eq 'ARRAY' ) { speak_yoda( @$_ ) }
else { print "$_\n" }
}
}
-------------------------

Nets me the following wonderful output:

here
see a problem
with inline elements
i

:-)

Matt Garrish

unread,

Dec 30, 2005, 10:49:09 PM12/30/05

to

"Tad McClellan" <ta...@augustmail.com> wrote in message
news:slrndrbp6g...@magna.augustmail.com...

I wondered how long until someone spotted that! It wasn't intentional
on my part, but that was my first thought too when I reread the
output after posting!

Maybe Lucas will buy it and make him rich. Damn! : )

Matt

robic0

unread,

Dec 31, 2005, 6:23:19 PM12/31/05

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

Version .903
Added a new flag:

- KeepContentOrder

Attributes are not kept in order. I could
easily do it if its really usefull, however
it would extend the containing tag out another
array level.

The output is truncated to save space, it is
ActiveState's Perl 5.6 release html file.

Let me know if you have any questions.
-robic0-

print <<EOM;

# -----------------------
# XML Regex Parser

# Version .903 - 12/31/05

use strict;
use warnings;
use Data::Dumper;

open DATA, "CHANGES56.html" or die "can't open CHANGES56.html...";

my $gabage1 = join ('', <DATA>);

close DATA;

my @xml_strings = ($gabage1);

my $alt_debug = 0;
my $VERSION = .903;

my $debug = 0;
my $rmv_white_space = 0;
my $ForceArray = 0;

my $KeepRoot = 1;
my $KeepComments = 1;
my $KeepContentOrder = 1;

## -- XML, start & end regexp substitution delimiter chars --
## match side , substitution side
## -------------------------/-------------------------------
my (@S_dlim, @E_dlim);
if ($debug) {
@S_dlim = ('\[' , '['); # use these for debug
@E_dlim = ('\]' , ']');
} else {
@S_dlim = (chr(140) , chr(140)); # use these for production
@E_dlim = (chr(141) , chr(141));
}

## -- Process xml data --
##
for (@xml_strings)
{

print "\n",'*'x30,"\nXML string:\n",'-'x15,"\n$_\n\nOutput:\n",'-'x15,"\n\n";
if ($alt_debug) {
ProcessAltDebugInfo ($_) ;
print "\n";

#### Non-tag markups go here -
####

# -- Versioning -- <?XML-Version ?> - Placeholder, voided

while (s/<\?([^<>]*)\?>//) {

#while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/) {
print "$cnt <? ?> = $1\n" if ($debug);
$ROOT->{$cnt} = { 'XMLV' => $1 };
$cnt++;
}
# -- DOCTYPE -- <!DOCTYPE info> - Placeholder, voided
while (s/<!DOCTYPE([^<>]*)>//) {
#while (s/<!DOCTYPE([^<>]*)>/$S_dlim[1]$cnt$E_dlim[1]/) {
print "$cnt DOCTYPE = $1\n" if ($debug);
$ROOT->{$cnt} = { 'DOCTYPE' => $1 };

if (exists $hcontent->{'content'})
{
my ($key);
if (!$ForceArray ) {
if (ref($hcontent->{'content'}) eq "ARRAY" && scalar(@{$hcontent->{'content'}}) == 1) {

next if ($attr_error);

if (/<|>/) {
print "($i) XML problem: malformed, syntax or tag closure:\n$_";
} else {

print "** Itterations = $i\n".
"** Debug = $debug\n".
"** Rmv white space = $rmv_white_space\n".
"** ForceArray = $ForceArray\n".
"** KeepRoot = $KeepRoot\n".
"** KeepComments = $KeepComments\n".
"** KeepContentOrder = $KeepContentOrder\n";

#print Dumper($ROOT);
print "The remaining string is:\n$_\n\n" if ($debug);

## Strip off the outer element (our root) to
## examine the contents for errors.
## ---------------------------------------
my $outer_element = $cnt-1;
if (exists $ROOT->{$outer_element})
{
my $hroot = $ROOT->{$outer_element};
my ($key,$val) = each (%{$hroot});
my $htodump = $val;

# check for errors in root

if (ref($htodump) ne "HASH" || (!$KeepContentOrder && exists $htodump->{'content'})) {
my $msg = 'Error';
$msg = 'Warning' if ($KeepContentOrder);
print "$msg, bare content at root level ..\n";

} else {
my $dmp_keys = keys (%{$htodump});

if ($dmp_keys > 1) {
print "Warning, multiple elements at root level ..\n";
} else {
($key,$val) = each (%{$htodump});

my $val_type = ref($val);

if ($dmp_keys == 0 || (exists $htodump->{'comment'})) {
print "Warning, no elements at root level ..\n";
}
if ($dmp_keys == 1) {

if ($val_type eq "HASH") {

$htodump = $val if (!$KeepRoot);
}

elsif ($val_type eq "ARRAY") {
$htodump = $val if (!$KeepRoot && $KeepContentOrder);

if (!$ForceArray || scalar(@{$val}) > 1) {
print "Warning, multiple elements at root level ..\n";
}
}
}
}
}
print "\n";

my $tmp = undef;
if (ref($htodump) eq "HASH") {

$tmp = {};
%{$tmp} = %{$htodump};

} elsif (ref($htodump) eq "ARRAY") {
$tmp = [];
@{$tmp} = @{$htodump};
} else {
print "Not a hash or array!\n";
}
print Dumper($tmp) if (defined $tmp);

## -- $1 is text contents --

if (defined $1) {
my $tmp1 = $1;
# if flagged, append it to $ary[last]
if ($append_flag && scalar(@ary) > 0) {
my $size = scalar(@ary);
$ary[$size-1] .= $tmp1;
} else {
push (@ary, $1);
}
$append_flag = 0;
}

## -- $2 is substitution index --
elsif (defined $2) {
## Exist check (Comments stripped?),

# turn on append flag.

# -----------------------------------

if (!exists $hStore->{$2}) {
$append_flag = 1;
next;
}

## CDATA check, append it to $ary[last]
# and turn on append flag.
# ---------------------------------------

if (exists $hcdata_elements->{$2}) {
my $size = scalar(@ary);
if ($size > 0) {
$ary[$size-1] .= $hStore->{$2};
} else {push (@ary, $hStore->{$2});}
$append_flag = 1;
next;
}

$append_flag = 0;

## Substitution of in-line content,
# push it to @ary
# ----------------------------------
if ($KeepContentOrder) {

push (@ary, $hStore->{$2});

next;
}
## Substitution of same level here (normal),
# just store it to $ahref
# -----------------------------------------

my ($key,$val) = each (%{$hStore->{$2}});
if (exists $ahref->{$key}) {
push (@{$ahref->{$key}}, $val);
} else {
$ahref->{$key} = [$val];
}
}

else {} # shouldn't get here
}

# Store contents, strip out

# pure whitespace text elements
my $hary = [];
for (@ary) {
next if (/^\s+$/s);
push (@{$hary}, $_);
}
if (scalar(@{$hary}) > 0) {
$ahref->{'content'} = $hary;
}
## if $val is an array ref and has one element and it
## is a hash ref, set {$key} equal to hash ref
if (!$ForceArray) {
while (my ($key,$val) = each (%{$ahref})) {
if (ref($val) eq "ARRAY") {
if (scalar(@{$val}) == 1 && ref($val->[0]) eq "HASH") {
$ahref->{$key} = $val->[0];
}
}
}
}
return $ahref;
}

sub ProcessAltDebugInfo
{
}

__END__

# -----------------------
# XML Regex Parser

# Version .903 - 12/31/05

******************************
XML string:
---------------
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>ActivePerl 5.6 Change Log</title>
<link rel="stylesheet" href="Active.css" type="text/css" />
<link rev="made" href="mailto:" />
</head>

<body>

<a name="__index__"></a>

<!--

<ul>

<li><a href="#activeperl_5_6_change_log">ActivePerl 5.6 Change Log</a></li>
<ul>

<li><a href="#build_638_thursday__apr_15__2004">Build 638 Thursday, Apr 15, 2004</a></li>
<li><a href="#build_635_thursday__feb_6__2003">Build 635 Thursday, Feb 6, 2003</a></li>
<li><a href="#build_633_monday__june_17__2002">Build 633 Monday, June 17, 2002</a></li>
<li><a href="#build_632_monday__june_3__2002">Build 632 Monday, June 3, 2002</a></li>
<li><a href="#build_631_monday__december_31__2001">Build 631 Monday, December 31, 2001</a></li>
<li><a href="#build_630_wednesday__october_30__2001">Build 630 Wednesday, October 30, 2001</a></li>
<li><a href="#build_629_thursday__august_23__2001">Build 629 Thursday, August 23, 2001</a></li>
<li><a href="#build_628_thursday__july_5__2001">Build 628 Thursday, July 5, 2001</a></li>
<li><a href="#build_626_thursday__may_1__2001">Build 626 Thursday, May 1, 2001</a></li>
<li><a href="#build_623_sunday__december_12__2000">Build 623 Sunday, December 12, 2000</a></li>
<li><a href="#build_622_sunday__november_5__2000">Build 622 Sunday, November 5, 2000</a></li>
<li><a href="#build_620_sunday__october_29__2000">Build 620 Sunday, October 29, 2000</a></li>
<li><a href="#build_618_tuesday__september_12__2000">Build 618 Tuesday, September 12, 2000</a></li>
<li><a href="#build_617_thursday__august_31__2000">Build 617 Thursday, August 31, 2000</a></li>
<li><a href="#build_616_friday__july_14__2000">Build 616 Friday, July 14, 2000</a></li>
<li><a href="#build_615_thursday__june_29__2000">Build 615 Thursday, June 29, 2000</a></li>
<li><a href="#build_613_thursday__march_23__2000">Build 613 Thursday, March 23, 2000</a></li>
<li><a href="#build_612_wednesday__march_22__2000">Build 612 Wednesday, March 22, 2000</a></li>
<li><a href="#build_611_wednesday__march_15__2000">Build 611 Wednesday, March 15, 2000</a></li>
<li><a href="#build_609_wednesday__march_1__2000">Build 609 Wednesday, March 1, 2000</a></li>
<li><a href="#build_607_friday__february_11__2000">Build 607 Friday, February 11, 2000</a></li>
<li><a href="#build_606_friday__february_4__2000">Build 606 Friday, February 4, 2000</a></li>
<li><a href="#build_604_friday__november_26__1999">Build 604 Friday, November 26, 1999</a></li>
<li><a href="#build_603_tuesday__november_23__1999">Build 603 Tuesday, November 23, 1999</a></li>
<li><a href="#build_602_thursday__august_5__1999">Build 602 Thursday, August 5, 1999</a></li>
<li><a href="#build_601_tuesday__july_13__1999">Build 601 Tuesday, July 13, 1999</a></li>
<li><a href="#what_s_new_in_the_600_series">What's new in the 600 Series</a></li>
</ul>

</ul>
-->

<h1><a name="activeperl_5_6_change_log">ActivePerl 5.6 Change Log</a></h1>
For the latest information on ActivePerl, please see:
<pre>
<a href="http://www.ActiveState.com/ActivePerl/">http://www.ActiveState.com/ActivePerl/</a></pre>


<h2><a name="build_638_thursday__apr_15__2004">Build 638 Thursday, Apr 15, 2004</a></h2>
PPM2 and PPM3
PPM3 has not been updated to the latest version PPM 3.1 as shipped
with the ActivePerl 5.8 series. PPM 3.1 assumes that PPM 2.x is no
longer installed and doesn't synchronize package information with it.
Since PPM2 is the default PPM version in ActivePerl 5.6, PPM3 has been
kept at version 3.0.
Bug Fixes and Changes
<ul>
<li></li>
On Windows, a potential buffer overrun in the <code>stat()</code> function has been
fixed.

<li></li>
On Windows, a handle leak in <code>kill()</code> has been fixed.

<li></li>
On Windows, a memory leak in <code>fork()</code> has been fixed.

<li></li>
On Windows NT and later, subprocesses are now started via ``cmd /x/d/c''
instead of ``cmd /x/c''. This disables execution of AutoRun command
specified in the registry.

<li></li>
On Windows, the four-argument form of <code>select()</code> did not report the
$! (errno) value properly after errors. This has been corrected.

<li></li>
Win32::GetOSVersion() returns additional information about the system
(when available, Windows NT SP6 and later).

<li></li>
Perl for ISAPI would sometimes close a filehandle twice. This leads
to a race condition where another thread could have reused the
filehandle before the second close would be executed. This usually
happens in high load scenarios. Typical symptoms include error
messages that Perl could not load standard modules, even though they
are installed on the server.
Perl for ISAPI no longer closes filehandles implicitly and relies now
on the application to properly clean up file and socket handle
resources.

<li></li>
Perl for ISAPI now avoids closing the special handles STDIN, STDOUT
and STDERR, even if the script asked for that explicitly.

<li></li>
The following bundled modules have been updated to their latest
versions:
<pre>
Archive-Tar
Compress-Zlib
Digest
Digest-MD2
Digest-MD5
Digest-SHA1
File-CounterFile
HTML-Parser
HTML-Tree
libnet
libwin32
libwww-perl
MD5
MIME-Base64
Storable
Test-Harness
URI</pre>
The following modules have been added to ActivePerl:
<pre>
Data-Dump
IO-Zlib
Test-Simple</pre>

<li></li>
Other minor bug fixes and documentation updates.
</ul>


<h2><a name="build_635_thursday__feb_6__2003">Build 635 Thursday, Feb 6, 2003</a></h2>
Fixes for Security Issues
<ul>
<li></li>
On Linux, the <code>crypt()</code> builtin did not return consistent results.
This has been corrected.

***** cut off here ******

Output:
---------------

** Itterations = 3
** Debug = 0
** Rmv white space = 0
** ForceArray = 0
** KeepRoot = 1
** KeepComments = 1
** KeepContentOrder = 1

$VAR1 = {
'html' => {
'xmlns' => 'http://www.w3.org/1999/xhtml',
'content' => [
{
'head' => [
{
'title' => 'ActivePerl 5.6 Change Log'
},
{
'link' => {
'rel' => 'stylesheet',
'href' => 'Active.css',
'type' => 'text/css'
}
},
{
'link' => {
'href' => 'mailto:',
'rev' => 'made'
}
}
]
},
{
'body' => [
{
'p' => {
'a' => {
'name' => '__index__'
}
}
},
{
'comment' => ' INDEX BEGIN '
},
{
'comment' => '

<ul>

<li><a href="#activeperl_5_6_change_log">ActivePerl 5.6 Change Log</a></li>
<ul>

<li><a href="#build_638_thursday__apr_15__2004">Build 638 Thursday, Apr 15, 2004</a></li>
<li><a href="#build_635_thursday__feb_6__2003">Build 635 Thursday, Feb 6, 2003</a></li>
<li><a href="#build_633_monday__june_17__2002">Build 633 Monday, June 17, 2002</a></li>
<li><a href="#build_632_monday__june_3__2002">Build 632 Monday, June 3, 2002</a></li>
<li><a href="#build_631_monday__december_31__2001">Build 631 Monday, December 31, 2001</a></li>
<li><a href="#build_630_wednesday__october_30__2001">Build 630 Wednesday, October 30, 2001</a></li>
<li><a href="#build_629_thursday__august_23__2001">Build 629 Thursday, August 23, 2001</a></li>
<li><a href="#build_628_thursday__july_5__2001">Build 628 Thursday, July 5, 2001</a></li>
<li><a href="#build_626_thursday__may_1__2001">Build 626 Thursday, May 1, 2001</a></li>
<li><a href="#build_623_sunday__december_12__2000">Build 623 Sunday, December 12, 2000</a></li>
<li><a href="#build_622_sunday__november_5__2000">Build 622 Sunday, November 5, 2000</a></li>
<li><a href="#build_620_sunday__october_29__2000">Build 620 Sunday, October 29, 2000</a></li>
<li><a href="#build_618_tuesday__september_12__2000">Build 618 Tuesday, September 12, 2000</a></li>
<li><a href="#build_617_thursday__august_31__2000">Build 617 Thursday, August 31, 2000</a></li>
<li><a href="#build_616_friday__july_14__2000">Build 616 Friday, July 14, 2000</a></li>
<li><a href="#build_615_thursday__june_29__2000">Build 615 Thursday, June 29, 2000</a></li>
<li><a href="#build_613_thursday__march_23__2000">Build 613 Thursday, March 23, 2000</a></li>
<li><a href="#build_612_wednesday__march_22__2000">Build 612 Wednesday, March 22, 2000</a></li>
<li><a href="#build_611_wednesday__march_15__2000">Build 611 Wednesday, March 15, 2000</a></li>
<li><a href="#build_609_wednesday__march_1__2000">Build 609 Wednesday, March 1, 2000</a></li>
<li><a href="#build_607_friday__february_11__2000">Build 607 Friday, February 11, 2000</a></li>
<li><a href="#build_606_friday__february_4__2000">Build 606 Friday, February 4, 2000</a></li>
<li><a href="#build_604_friday__november_26__1999">Build 604 Friday, November 26, 1999</a></li>
<li><a href="#build_603_tuesday__november_23__1999">Build 603 Tuesday, November 23, 1999</a></li>
<li><a href="#build_602_thursday__august_5__1999">Build 602 Thursday, August 5, 1999</a></li>
<li><a href="#build_601_tuesday__july_13__1999">Build 601 Tuesday, July 13, 1999</a></li>
<li><a href="#what_s_new_in_the_600_series">What\'s new in the 600 Series</a></li>
</ul>

</ul>
'
},
{
'comment' => ' INDEX END '
},
{
'p' => {}
},
{
'h1' => {
'a' => {
'content' => 'ActivePerl 5.6 Change Log',
'name' => 'activeperl_5_6_change_log'
}
}
},
{
'p' => 'For the latest information on ActivePerl, please see:'
},
{
'pre' => {
'a' => {
'href' => 'http://www.ActiveState.com/ActivePerl/',
'content' => 'http://www.ActiveState.com/ActivePerl/'
}
}
},
{
'p' => {}
},
{
'h2' => {
'a' => {
'content' => 'Build 638 Thursday, Apr 15, 2004',
'name' => 'build_638_thursday__apr_15__2004'
}
}
},
{
'p' => {
'em' => 'PPM2 and PPM3'
}
},
{
'p' => [
'PPM3 has ',
{
'strong' => 'not'
},
' been updated to the latest version PPM 3.1 as shipped
with the ActivePerl 5.8 series. PPM 3.1 assumes that PPM 2.x is no
longer installed and doesn\'t synchronize package information with it.
Since PPM2 is the default PPM version in ActivePerl 5.6, PPM3 has been
kept at version 3.0.'
]
},
{
'p' => {
'em' => 'Bug Fixes and Changes'
}
},
{
'ul' => [
{
'li' => ''
},
'
On Windows, a potential buffer overrun in the ',
{
'code' => 'stat()'
},
' function has been
fixed.
',
{
'p' => ''
},
{
'li' => ''
},
'
On Windows, a handle leak in ',
{
'code' => 'kill()'
},
' has been fixed.
',
{
'p' => ''
},
{
'li' => ''
},
'
On Windows, a memory leak in ',
{
'code' => 'fork()'
},
' has been fixed.
',
{
'p' => ''
},
{
'li' => ''
},
'
On Windows NT and later, subprocesses are now started via ``cmd /x/d/c\'\'
instead of ``cmd /x/c\'\'. This disables execution of AutoRun command
specified in the registry.
',
{
'p' => ''
},
{
'li' => ''
},
'
On Windows, the four-argument form of ',
{
'code' => 'select()'
},
' did not report the
$! (errno) value properly after errors. This has been corrected.
',
{
'p' => ''
},
{
'li' => ''
},
'
Win32::GetOSVersion() returns additional information about the system
(when available, Windows NT SP6 and later).
',
{
'p' => ''
},
{
'li' => ''
},
'
Perl for ISAPI would sometimes close a filehandle twice. This leads
to a race condition where another thread could have reused the
filehandle before the second close would be executed. This usually
happens in high load scenarios. Typical symptoms include error
messages that Perl could not load standard modules, even though they
are installed on the server.
',
{
'p' => 'Perl for ISAPI no longer closes filehandles implicitly and relies now
on the application to properly clean up file and socket handle
resources.'
},
{
'p' => ''
},
{
'li' => ''
},
'
Perl for ISAPI now avoids closing the special handles STDIN, STDOUT
and STDERR, even if the script asked for that explicitly.
',
{
'p' => ''
},
{
'li' => ''
},
'
The following bundled modules have been updated to their latest
versions:
',
{
'pre' => '
Archive-Tar
Compress-Zlib
Digest
Digest-MD2
Digest-MD5
Digest-SHA1
File-CounterFile
HTML-Parser
HTML-Tree
libnet
libwin32
libwww-perl
MD5
MIME-Base64
Storable
Test-Harness
URI'
},
{
'p' => 'The following modules have been added to ActivePerl:'
},
{
'pre' => '
Data-Dump
IO-Zlib
Test-Simple'
},
{
'p' => ''
},
{
'li' => ''
},
'
Other minor bug fixes and documentation updates.
',
{
'p' => ''
}
]
},

********** cut off here ************

robic0

unread,

Jan 1, 2006, 7:01:26 PM1/1/06

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

Final edition .904
This is the last, it will not be continued. Further processing could
have been done for all tag entities and alot of specialized handling.
However due to the recent awareness of the latencies involved in regexp
(re)searches, the "inner" substitution method is not viable in this context.

Thats all I will say on this. Content is not examined for xml reserved
characters (ie: &amp, etc..), although its easily added. Kind of too bad,
this method has alot of potential.

Anyway, for now its just an exercise in xml/html form and structure.
This status could change in the future as I do a more detailed post-mortum.
Should it change significantly, you won't hear about it in the forums.
I don't anticipate that being the case, but who knows.

-robic0-

Changes:
- Added to regexp, white space to account for <tag />

print <<EOM;

# -----------------------
# XML Regex Parser

# Version .904 - 12/31/05

use strict;
use warnings;
use Data::Dumper;

open DATA, "your.html" or die "can't open your.html...";

my $gabage1 = join ('', <DATA>);
close DATA;

print "here\n";
my @xml_strings = ($gabage1);

my $alt_debug = 0;
my $VERSION = .904;

my $debug = 0;
my $rmv_white_space = 0;
my $ForceArray = 0;

my $KeepRoot = 0;
my $KeepComments = 0;

my $KeepContentOrder = 0;

## -- XML, start & end regexp substitution delimiter chars --
## match side , substitution side
## -------------------------/-------------------------------
my (@S_dlim, @E_dlim);
if ($debug) {
@S_dlim = ('\[' , '['); # use these for debug
@E_dlim = ('\]' , ']');
} else {
@S_dlim = (chr(140) , chr(140)); # use these for production
@E_dlim = (chr(141) , chr(141));
}

## -- Process xml data --
##
for (@xml_strings)
{

#print "\n",'*'x30,"\nXML string:\n",'-'x15,"\n$_\n\nOutput:\n",'-'x15,"\n\n" if ($debug);

if ($alt_debug) {
ProcessAltDebugInfo ($_) ;
print "\n";
}
my $ROOT = {}; # container
my %cdata_elements = ();
my ($last_cnt, $cnt, $i, $attr_error) = (-1, 1, 0, 0);

## Comment/CDATA block ==================================
#### To be done first -
# -- Questionable Comments --

while (s/()/$S_dlim[1]$cnt$E_dlim[1]/) {

# -- META -- <META info> - Placeholder, voided
while (s/<META([^<>]*)>//) {
#while (s/<META([^<>]*)>/$S_dlim[1]$cnt$E_dlim[1]/) {
print "$cnt META = $1\n" if ($debug);
$ROOT->{$cnt} = { 'META' => $1 };

$cnt++;
}
#### White space removal before tags ? .. TBD -
if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}

#### Tags here - should only need 2 iterations max
my $finished = 0;

while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <Tag/> , no content

while (s/<([0-9a-zA-Z]+)[\s]*\/>/$S_dlim[1]$cnt$E_dlim[1]/) {

print "$cnt <$1> = \n" if ($debug);
$ROOT->{$cnt} = { $1 => '' };
$cnt++;
}
## <Tag Attributes/> , no content
while (s/<([0-9a-zA-Z]+)([\s]+[0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"])+[\s]*\/>/$S_dlim[1]$cnt$E_dlim[1]/) {
print "$cnt <$1> = attr: $2\n" if ($debug);
my $hattrib = getAttrHash($2);
if (ref($hattrib) ne "HASH") {
print "Invalid token in attribute asignment:\n$hattrib\n"; $attr_error = 1; last;
}
$ROOT->{$cnt} = { $1 => $hattrib };
$cnt++;
}
## <Tag> Content </Tag>

while (s/<([0-9a-zA-Z]+)>([^<]*)<[\s]*\/\1>/$S_dlim[1]$cnt$E_dlim[1]/) {

print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {
my $hcontent = getContentHash($2, $ROOT, \%cdata_elements);
$unknown = $hcontent;
if (keys (%{$hcontent}) > 1) {
if (!$ForceArray) { adjustForSingleItemArrays ($hcontent); }
} else {
if (exists $hcontent->{'content'}) {
my ($key);
if (!$ForceArray ) {
if (ref($hcontent->{'content'}) eq "ARRAY" && scalar(@{$hcontent->{'content'}}) == 1) {
$unknown = ${$hcontent->{'content'}}[0];
}
else {$unknown = $hcontent->{'content'}; }
}
}
if (!$ForceArray) { adjustForSingleItemArrays ($hcontent); }
}
}
$ROOT->{$cnt} = { $1 => $unknown };
$cnt++;
}
last if ($attr_error);
## <Tag Attributes> Content </Tag>

while (s/<([0-9a-zA-Z]+)([\s]+[0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"][\s]*)+[\s]*>([^<]*)<[\s]*\/\1>/$S_dlim[1]$cnt$E_dlim[1]/) {

while ($attrstr =~ s/[\s]*([0-9a-zA-Z]+)[\s]*=[\s]*("|')([^=]*)\2[\s]*//) {

Matt Garrish

unread,

Jan 1, 2006, 9:30:02 PM1/1/06

to

<robic0> wrote in message news:96qgr1p7h24qb3nlg...@4ax.com...

> On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:
>
> Final edition .904
> This is the last, it will not be continued. Further processing could
> have been done for all tag entities and alot of specialized handling.
> However due to the recent awareness of the latencies involved in regexp
> (re)searches, the "inner" substitution method is not viable in this
> context.
>
> Thats all I will say on this. Content is not examined for xml reserved
> characters (ie: &amp, etc..), although its easily added. Kind of too bad,
> this method has alot of potential.
>

Well, at least you had the good sense to abandon the sinking ship...

Matt

robic0

unread,

Jan 2, 2006, 5:58:04 PM1/2/06

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

Post-Mortum conclusions:

I have cut out 3 test examples on the post-mortum.
1. Substitution data store is a anonymous hash.
2. Data store is bypassed, just null regexp substitution is done.
3. Substitution data store is a named array.

I'm posting #3 here for reference and the final code v .905
A note of interest between #1 and #3 on a test of 15,000
substitutions (large html file), there was no discernable time
difference using hash lookups as opposed to array indexes.
This shows that hash lookups involve just a pointer + offset
math as is done with array index addressing.

The time killer here was #2, not #1 or #3.
This makes sense when doing inner substitution since the search
position has to be reset at every pass.

This code might be a usefull to debug the inner data structure
of xml/html beyond just validation (by itself). Since at the
time of failure or possibly no failure but bad structure, the inner
structures grown up until then are retained for post-mortum.
This would be in contrast to an outter to inner parse.
In that case, since the data structure is build from the outside in,
there wouldn't be much of a starting point for analysis.

Other than that and from what was mentioned about the #2 time killer,
about the only use given the limitations here would be as a debug
tool. The code is readily alterred to filter needed structures.

Let me know if you have any questions.
-robic0-

print <<EOM;

# -----------------------
# XML Regex Parser

# Version .905 - 12/31/05

use strict;
use warnings;
use Data::Dumper;

open DATA, "config.html" or die "can't open config.html...";

my $gabage1 = join ('', <DATA>);
close DATA;

my @xml_strings = ($gabage1);

my $alt_debug = 0;
my $VERSION = .905;

my $debug = 0;
my $rmv_white_space = 0;
my $ForceArray = 0;
my $KeepRoot = 0;

my $KeepComments = 1;
my $KeepContentOrder = 1;

## -- XML, start & end regexp substitution delimiter chars --

## match side , substitution side
## -------------------------/-------------------------------
my (@S_dlim, @E_dlim);
if ($debug) {
@S_dlim = ('\[' , '['); # use these for debug
@E_dlim = ('\]' , ']');
} else {
@S_dlim = (chr(140) , chr(140)); # use these for production
@E_dlim = (chr(141) , chr(141));
}

## -- Process xml data --
##
for (@xml_strings)
{
#print "\n",'*'x30,"\nXML string:\n",'-'x15,"\n$_\n\nOutput:\n",'-'x15,"\n\n" if ($debug);
if ($alt_debug) {
ProcessAltDebugInfo ($_) ;
print "\n";
}

my (@ROOT, %cdata_elements);
@ROOT = (); # container
%cdata_elements = ();

my ($last_cnt, $cnt, $i, $attr_error) = (-1, 1, 0, 0);

## Comment/CDATA block ==================================
#### To be done first -
# -- Questionable Comments --

while (s/()/$S_dlim[1]$cnt$E_dlim[1]/s) {

print "$cnt = Questionable comment: $1\n" if ($debug);

$ROOT[$cnt] = $1;

$cnt++;
}
#### To be done second -
# -- Real CDATA --
while (s/<!\[CDATA\[(.*?)\]\]>/$S_dlim[1]$cnt$E_dlim[1]/s)
{
# reconstitute cdata contents
my $cdata_contents = $1;
my $str = '';
while ($cdata_contents =~ s/([^$S_dlim[0]$E_dlim[0]]+)|$S_dlim[0]([\d]+)$E_dlim[0]//) {
if (defined $1) {
$str .= $1;

} elsif (defined $2 && defined $ROOT[$2]) {
$str .= $ROOT[$2];
delete $ROOT[$2];

} else {} # shouldn't get here
}
print "$cnt CDATA = $str\n" if ($debug);

$ROOT[$cnt] = $str;

$cdata_elements{$cnt} = '';
$cnt++;
}
#### To be done third -
# -- Real Comments are left --

for (my $ndx = 1; $ndx < @ROOT; $ndx++) {
if (!exists $cdata_elements{$ndx}) {
$ROOT[$ndx] =~ s/^$/$1/s;
print "$ndx Comment = $1\n" if ($debug);
if ($KeepComments) {
$ROOT[$ndx] = { comment => $1 };
} else {delete $ROOT[$ndx];}

}
}
## End Comment/CDATA block ==============================

#### Non-tag markups go here -
####

# -- Versioning -- <?XML-Version ?> - Placeholder, voided
while (s/<\?([^<>]*)\?>//) {
#while (s/<\?([^<>]*)\?>/$S_dlim[1]$cnt$E_dlim[1]/) {
print "$cnt <? ?> = $1\n" if ($debug);

$ROOT[$cnt] = { 'XMLV' => $1 };

$cnt++;
}
# -- DOCTYPE -- <!DOCTYPE info> - Placeholder, voided
while (s/<!DOCTYPE([^<>]*)>//) {
#while (s/<!DOCTYPE([^<>]*)>/$S_dlim[1]$cnt$E_dlim[1]/) {
print "$cnt DOCTYPE = $1\n" if ($debug);

$ROOT[$cnt] = { 'DOCTYPE' => $1 };

$cnt++;
}
# -- META -- <META info> - Placeholder, voided
while (s/<META([^<>]*)>//) {
#while (s/<META([^<>]*)>/$S_dlim[1]$cnt$E_dlim[1]/) {
print "$cnt META = $1\n" if ($debug);

$ROOT[$cnt] = { 'META' => $1 };

$cnt++;
}
#### White space removal before tags ? .. TBD -
if ($rmv_white_space) {
s/>[\s]+</></g;
s/[\s]+</</g;
s/>[\s]+/>/g;
}
#### Tags here - should only need 2 iterations max
my $finished = 0;

while ($cnt != $last_cnt && $i < 20)
{
$last_cnt = $cnt;

## <Tag/> , no content

while (s/<([0-9a-zA-Z]+)[\s]*\/>/$S_dlim[1]$cnt$E_dlim[1]/s) {

print "$cnt <$1> = \n" if ($debug);

$ROOT[$cnt] = { $1 => '' };

$cnt++;
}
## <Tag Attributes/> , no content

while (s/<([0-9a-zA-Z]+)([\s]+[0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"])+[\s]*\/>/$S_dlim[1]$cnt$E_dlim[1]/s) {

print "$cnt <$1> = attr: $2\n" if ($debug);
my $hattrib = getAttrHash($2);
if (ref($hattrib) ne "HASH") {
print "Invalid token in attribute asignment:\n$hattrib\n"; $attr_error = 1; last;
}

$ROOT[$cnt] = { $1 => $hattrib };

$cnt++;
}
## <Tag> Content </Tag>

while (s/<([0-9a-zA-Z]+)>([^<]*)<[\s]*\/\1>/$S_dlim[1]$cnt$E_dlim[1]/s) {

print "$cnt <$1> = $2\n" if ($debug);
my $unknown = '';
if (length($2) > 0) {

my $hcontent = getContentHash($2, \@ROOT, \%cdata_elements);

$unknown = $hcontent;
if (keys (%{$hcontent}) > 1) {
if (!$ForceArray) { adjustForSingleItemArrays ($hcontent); }
} else {
if (exists $hcontent->{'content'}) {
my ($key);
if (!$ForceArray ) {
if (ref($hcontent->{'content'}) eq "ARRAY" && scalar(@{$hcontent->{'content'}}) == 1) {
$unknown = ${$hcontent->{'content'}}[0];
}
else {$unknown = $hcontent->{'content'}; }
}
}
if (!$ForceArray) { adjustForSingleItemArrays ($hcontent); }
}
}

$ROOT[$cnt] = { $1 => $unknown };

$cnt++;
}
last if ($attr_error);
## <Tag Attributes> Content </Tag>

while (s/<([0-9a-zA-Z]+)([\s]+[0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"][\s]*)+[\s]*>([^<]*)<[\s]*\/\1>/$S_dlim[1]$cnt$E_dlim[1]/s) {

print "$cnt <$1> = attr: $2, content: $3\n" if ($debug);
my $hattrib = getAttrHash($2);
if (ref($hattrib) ne "HASH") {
print "Invalid token in attribute asignment:\n$hattrib\n"; $attr_error = 1; last;
}
if (length($3) > 0) {

my $hcontent = getContentHash($3, \@ROOT, \%cdata_elements);

if (!$ForceArray) { adjustForSingleItemArrays ($hcontent); }
while (my ($key,$val) = each (%{$hcontent})) {
$hattrib->{$key} = $val;
}
}

$ROOT[$cnt] = { $1 => $hattrib };

if (defined $ROOT[$outer_element])
{
my $hroot = $ROOT[$outer_element];

while ($attrstr =~ s/[\s]*([0-9a-zA-Z]+)[\s]*=[\s]*("|')([^=]*)\2[\s]*//s) {

if (!defined $hStore->[$2]) {

sub ProcessAltDebugInfo
{
}

__END__

robic0

unread,

Jan 6, 2006, 1:04:15 AM1/6/06

to

On Tue, 20 Dec 2005 23:59:06 -0800, robic0 wrote:

I'm posting v906 with ':' in the regex so "xsd" can
and should be interpreted.
I did this before but I can't find it anywhere.
It was tested at one time with that code but can't gurantee
it now cause I just pushed that char in the regex and nobody
gives a shit. Btw, if you think this is mental masturbation,
this is nothing compared to this forum, the "cream" of the crop!
Should work, if it doesent, don't blama me. hahaha
As if you give a rats ass..

Here it is. I may post some grease, regex code with a new
substitution paradigm. So far, it looks as though you could
care less if I live or die. Maybe I'll chunk it down to
10 assembly lines. Hahaha. There u go .... to the bottom of
my list.

print <<EOM;

# -----------------------
# XML Regex Parser

# by robic0-At-yahoo.com
# -----------------------
EOM

use strict;
use warnings;
use Data::Dumper;

open DATA, "config.html" or die "can't open config.html...";
my $gabage1 = join ('', <DATA>);
close DATA;

my @xml_strings = ($gabage1);

my $alt_debug = 0;
my $VERSION = .906;