[coldsync-hackers] New Perl question; Do I need to keep both a hash and an array?

0 views
Skip to first unread message

Izzy Blacklock

unread,
Jun 7, 2003, 3:32:37 PM6/7/03
to coldsync...@googlegroups.com
First off, I discovered my bug of using @_ not $_ in my previous post. Oops!
Other then that, it seems to work like a charm. ;)

Here's another area in my code that's been bugging me though.

# XXX This is where I parse the TitraxNamesDB data for further use
# Below. It'd be nice not to need to keep both an array and a hash,
# but I can't think of how at the moment. The hash makes most sense
# for extracting data from Datebook, but an array is needed to
# match the project name with the notes out of the TitraxNoteDB.
my %NamesHash;
my @NamesArray;
my $record;
my $i=0;
foreach $record (@{$pdbs{TitraxNameDB}->{records}})
{
my $text = $record->{data};
$text =~ s/\\/\\\\/g;
$NamesHash{$text} = 1;
$NamesArray[$i] = $text;
$i++;
}

Is there a way around needing to populate both an array and a hash? Here's
how each is being used:

foreach $record (@{$pbds{DatabookDB}->{records}})
{
next unless $NamesHash{ $record->{"description"} };

.... Do stuff ...

}

and
$i=0;
foreach $record (@{$pdbs{TitraxNoteDB}->{records}})
{
$description = $NamesArray[$i];

..... Do more stuff ....

$i++;
}

Is there a way to do this without using both an array and a hash?

...Izzy

--
This message was sent through the coldsync-hackers mailing list. To remove
yourself from this mailing list, send a message to majo...@thedotin.net
with the words "unsubscribe coldsync-hackers" in the message body. For more
information on Coldsync, send mail to coldsync-ha...@thedotin.net.

Christophe Beauregard

unread,
Jun 7, 2003, 7:45:48 PM6/7/03
to coldsync...@googlegroups.com
On Saturday 07 June 2003 15:32, Izzy Blacklock wrote:
> Is there a way around needing to populate both an array and a hash?

Populating, yes. Using, maybe.

If order matters, a hash by itself won't quite work. For your first usage
where you're essentially just doing a search against a known value,
something like this works fine:

my %NamesHash;
foreach my $record (@{$pdbs{TitraxNameDB}->{records}}) {


my $text = $record->{data};
$text =~ s/\\/\\\\/g;

$NamesHash{$text} = $text;
}

All you'd need to do now is use the "exists" function:

foreach my $record (@{$pbds{DatabookDB}->{records}}) {
next unless exists $NamesHash{ $record->{"description"} };
... stuff ...
}

For your second example, things are a little more complicated. First,
however, I have to ask a question... Are you absolutely positive that the
record ordering between two different databases is going to be the same?
You seem to be assuming that record $i in TitraxNameDB is the same as
record $i in TitraxNoteDB. I'm not sure how Titrax organizes its databases,
so I have to ask. Record ordering isn't always under the control of one
app. Sync, for example, could mess with it.

Assuming you _can_ make the assumption, you need to modify the creation code
as follows:

my %NamesHash;
my $i = 0;
foreach my $record (@{$pdbs{TitraxNameDB}->{records}}) {


my $text = $record->{data};
$text =~ s/\\/\\\\/g;

$NamesHash{$text} = $i ++;
}

In order to use it, you're almost certainly going to need to make some kind
of temporary array similar to what you already had because you need to
preserve the ordering. Something like:

my @nhSorted =
sort { $NamesHash{$a} <=> $NamesHash{$b} } (keys %NamesHash);

my $i = 0;
foreach my $record (@{$pdbs{TitraxNoteDB}->{records}}) {
my $description = $nhSorted[$i++];

... stuff ...
}

The gist is that you don't need to build two separate objects that have
essentially the same data. You can generate the array from the hash. You
could also do it the other way around if you needed to.

I would, however, question the ordering. If I were doing something where I
was creating two separate cross-referenced databases, I'd probably be using
some kind of record identifier. Palm records all have unique id's, so that
would be my first guess. I don't suppose TitraxNoteDB records have a field
something like "nameid" or some such, huh?

c.

Izzy Blacklock

unread,
Jun 10, 2003, 12:23:18 AM6/10/03
to coldsync...@googlegroups.com
On June 7, 2003 05:45 pm, Christophe Beauregard wrote:

> All you'd need to do now is use the "exists" function:
>
> foreach my $record (@{$pbds{DatabookDB}->{records}}) {
> next unless exists $NamesHash{ $record->{"description"} };
> ... stuff ...
> }

Currently, I'm not using the exists function. ie, my next statement is as
follows:

next unless $NamesHash{ $record->{"description"};

It seems to be working fine. Is there an advantage to the exists function?
Am I likely to trigger a bug by not using it?

> For your second example, things are a little more complicated. First,
> however, I have to ask a question... Are you absolutely positive that the
> record ordering between two different databases is going to be the same?
> You seem to be assuming that record $i in TitraxNameDB is the same as
> record $i in TitraxNoteDB. I'm not sure how Titrax organizes its databases,
> so I have to ask. Record ordering isn't always under the control of one
> app. Sync, for example, could mess with it.

See below...

> Assuming you _can_ make the assumption, you need to modify the creation
> code as follows:
>
> my %NamesHash;
> my $i = 0;
> foreach my $record (@{$pdbs{TitraxNameDB}->{records}}) {
> my $text = $record->{data};
> $text =~ s/\\/\\\\/g;
> $NamesHash{$text} = $i ++;
> }

I had tried this myself, but couldn't figure out how to use it like an indexed
array.

> In order to use it, you're almost certainly going to need to make some kind
> of temporary array similar to what you already had because you need to
> preserve the ordering. Something like:
>
> my @nhSorted =
> sort { $NamesHash{$a} <=> $NamesHash{$b} } (keys %NamesHash);

This is what I was missing! :) This technique may come in handy for something
else one day, but I don't think it provides any advantage for my current
problem. This method delays the need to reserve memory for the array up
front, at the expense of needing extra CPU time to generate it later. In the
end, I still end up with an array and a hash. :( Not that it should be much
of a problem either way. By default, there shouldn't be more then 50 entries
for project names. I don't know what the upper limit is that Titrax can
handle, but I'm sure the Palm's limited memory will keep it within reason! ;)

> I would, however, question the ordering. If I were doing something where I
> was creating two separate cross-referenced databases, I'd probably be using
> some kind of record identifier. Palm records all have unique id's, so that
> would be my first guess. I don't suppose TitraxNoteDB records have a field
> something like "nameid" or some such, huh?
>
> c.

As far as I can tell, the names and notes do use the same record ordering.
But you may have a point. I checked the ID of the records themselves, and
there doesn't seem to be any correlation between the Name and the Note DBs.
There are however Record index entries in each of the three DBs (TitraxData
is the third). I wasn't sure what they were for before, but now I'm thinking
they are for cross-referencing the databases.

Each of the Record index entries has a 6 digit hex ID that isn't an exact
match between the three files, but may be an encoded index. The first three
digits seems to be the same within a file, but different between the three.
I'm thinking it indicates the record type (ie, name, note, data). The last
three digits could be the record number. It's an incrementing value that IS
the same between the three files.

I'll have to take a closer look at these later. For now, the records do seem
to be indexed by record number as well. I should look over the source code
for Titrax to see if I can make heads or tales of it one day. If you or
anyone else has some insight as to how I should decode this info, please let
me know. Below I've provided sample output of pdbdump. I've snipped all but
the first two Record index and Record entries from each file.

Here's a snip of the pdbdump output for TitraxNamesDB.pdb

Record/resource index header:
00 00 00 00 00 32 |.....2 |

Next index: 0
# records: 50

Record index:
Record index entry 0
00 00 01 e0 00 8c d0 01 |........ |

Offset: 480
Attributes: 0x00
Category: 0
ID: 0x8cd001

Record index entry 1
00 00 01 e4 00 8c d0 02 |........ |

Offset: 484
Attributes: 0x00
Category: 0
ID: 0x8cd002

[....snip.....]

Records:
Record 0
57 41 4d 00 |WAM. |

data -> [WAM]
category -> [0]
offset -> [480]
attributes:
id -> [9228289]

Record 1
57 41 4d 32 00 |WAM2. |

data -> [WAM2]
category -> [0]
offset -> [484]
attributes:
id -> [9228290]

And from TitraxNoteDB.pdb

Record/resource index header:
00 00 00 00 00 32 |.....2 |

Next index: 0
# records: 50

Record index:
Record index entry 0
00 00 01 e0 00 37 80 01 |.....7.. |

Offset: 480
Attributes: 0x00
Category: 0
ID: 0x378001

Record index entry 1
00 00 11 fa 00 37 80 02 |.....7.. |

Offset: 4602
Attributes: 0x00
Category: 0
ID: 0x378002

[ ......snip.......]

Records:
Record 0
data -> [ SNIPPED ]
category -> [0]
offset -> [480]
attributes:
id -> [3637249]

Record 0
data -> [ SNIPPED ]
category -> [0]
offset -> [4602]
attributes:
id -> [3637250]

Christophe Beauregard

unread,
Jun 10, 2003, 8:05:46 AM6/10/03
to coldsync...@googlegroups.com
On Tuesday 10 June 2003 00:23, Izzy Blacklock wrote:

> It seems to be working fine. Is there an advantage to the exists
> function?

Yes, especially if you don't need the value.

my %hash;
$hash{$id} = 0;

exists $hash{$id} evaluates to true.
$hash{$id}, however, is false.

If you're storing, say, array indices as hash values (or anything else where
zero or an empty string is a legit value) exists is the only way to test
for hash membership.

Also, exists has no side effects. For any hash, $hash{'random text'} causes
the entry $hash{'random text'} to be added to the hash with an undefined
value. This will create problems should you, say, call keys on it.
exists $hash{'random text'} will never cause the entry to be created if it
wasn't already there.

> Am I likely to trigger a bug by not using it?

The way you're using it now, probably not. But you definitely want to get
into the habit before you start writing stuff where it _will_ bite you in
the ass.

> > my @nhSorted =
> > sort { $NamesHash{$a} <=> $NamesHash{$b} } (keys %NamesHash);
>
> This is what I was missing! :) This technique may come in handy for
> something else one day, but I don't think it provides any advantage for
> my current problem. This method delays the need to reserve memory for
> the array up front, at the expense of needing extra CPU time to generate
> it later. In the end, I still end up with an array and a hash.

As I said in my original message, "if order matters". For something the size
and complexity of a conduit it may not be worth bothering, but as
complexity increases it starts to make a lot of sense to just manage one
data structure and create any derivatives on-the-fly.

> > I would, however, question the ordering. If I were doing something
> > where I was creating two separate cross-referenced databases, I'd
> > probably be using some kind of record identifier. Palm records all have
> > unique id's, so that would be my first guess. I don't suppose
> > TitraxNoteDB records have a field something like "nameid" or some such,
> > huh?
>

> As far as I can tell, the names and notes do use the same record
> ordering. But you may have a point. I checked the ID of the records
> themselves, and there doesn't seem to be any correlation between the Name
> and the Note DBs. There are however Record index entries in each of the
> three DBs (TitraxData is the third). I wasn't sure what they were for
> before, but now I'm thinking they are for cross-referencing the
> databases.

Each record will have its own unique record id. The question, however, is
whether somewhere in the application specific data structure there might be
the id's of _other_ records.

That probably has more to do with the order in which entries are created
than anything else. The record id is just that... an increasing number.
It's assigned by the PalmOS database subsystem for every record that's
created. There's no meaning beyond that, I think.

> I'll have to take a closer look at these later. For now, the records do
> seem to be indexed by record number as well. I should look over the
> source code for Titrax to see if I can make heads or tales of it one day.

I might be giving the Titrax developers too much credit, you know. For me,
it would seem nuts to create three databases and just assume that they're
all going to be completely synchronized. PalmOS is a remarkably stable
system, but still...

> If you or anyone else has some insight as to how I should decode this
> info, please let me know. Below I've provided sample output of pdbdump.

data, category, attributes and id are all managed by the PalmOS database
layer. They don't have any specific meaning to Titrax. The data is where
the app-specific stuff is. Normally, data corresponds to a C structure of
some kind. Well, a C structure with variable length strings tacked onto the
end. I'm starting to get the feeling that maybe the Titrax folks decided to
have a separate database for what would normally be just different fields
in a structure in a single database?

c.

David A. Desrosiers

unread,
Jun 10, 2003, 8:14:16 AM6/10/03
to coldsync...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> As I said in my original message, "if order matters". For something the
> size and complexity of a conduit it may not be worth bothering, but as
> complexity increases it starts to make a lot of sense to just manage one
> data structure and create any derivatives on-the-fly.

There's always Tie::IxHash, if you need ordered hashes.

d.

perldoc -qa.j | perl -lpe '($_)=m("(.*)")'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+5cujkRQERnB1rkoRAvciAJ9X4E7CIR2yfWCgpzY+IuRgi2SkowCg1KuR
zil11u2Y0PJxlAaN0ywdjw0=
=ljIN
-----END PGP SIGNATURE-----

Izzy Blacklock

unread,
Jun 10, 2003, 4:54:47 PM6/10/03
to coldsync...@googlegroups.com
On June 10, 2003 06:05 am, Christophe Beauregard wrote:
> On Tuesday 10 June 2003 00:23, Izzy Blacklock wrote:
> > It seems to be working fine. Is there an advantage to the exists
> > function?
>
> Yes, especially if you don't need the value.
>
> my %hash;
> $hash{$id} = 0;
>
> exists $hash{$id} evaluates to true.
> $hash{$id}, however, is false.
>
> If you're storing, say, array indices as hash values (or anything else
> where zero or an empty string is a legit value) exists is the only way to
> test for hash membership.
>
> Also, exists has no side effects. For any hash, $hash{'random text'} causes
> the entry $hash{'random text'} to be added to the hash with an undefined
> value. This will create problems should you, say, call keys on it.
> exists $hash{'random text'} will never cause the entry to be created if it
> wasn't already there.
>
> > Am I likely to trigger a bug by not using it?
>
> The way you're using it now, probably not. But you definitely want to get
> into the habit before you start writing stuff where it _will_ bite you in
> the ass.

Thanks. I understand the difference now. I'll be sure to add the exists and
get in the habit of using it. In this case, the value should never be 0 or
undef, but you never know what a user might do. Better to be safe then
sorry! :)

>
> > > my @nhSorted =
> > > sort { $NamesHash{$a} <=> $NamesHash{$b} } (keys %NamesHash);
> >
> > This is what I was missing! :) This technique may come in handy for
> > something else one day, but I don't think it provides any advantage for
> > my current problem. This method delays the need to reserve memory for
> > the array up front, at the expense of needing extra CPU time to generate
> > it later. In the end, I still end up with an array and a hash.
>
> As I said in my original message, "if order matters". For something the
> size and complexity of a conduit it may not be worth bothering, but as
> complexity increases it starts to make a lot of sense to just manage one
> data structure and create any derivatives on-the-fly.

Yes, I could see doing this being usefull in a large program where the data is
needed in different forms at different times. In this case, I don't so an
advantage. I was hoping there was a way I could use a hash as and indexed
array without needing to build one. Obviousely not. No matter. What I've
done works just fine.

Thanks for the feedback though.

> > > I would, however, question the ordering. If I were doing something
> > > where I was creating two separate cross-referenced databases, I'd
> > > probably be using some kind of record identifier. Palm records all have
> > > unique id's, so that would be my first guess. I don't suppose
> > > TitraxNoteDB records have a field something like "nameid" or some such,
> > > huh?
> >
> > As far as I can tell, the names and notes do use the same record
> > ordering. But you may have a point. I checked the ID of the records
> > themselves, and there doesn't seem to be any correlation between the Name
> > and the Note DBs. There are however Record index entries in each of the
> > three DBs (TitraxData is the third). I wasn't sure what they were for
> > before, but now I'm thinking they are for cross-referencing the
> > databases.
>
> Each record will have its own unique record id. The question, however, is
> whether somewhere in the application specific data structure there might be
> the id's of _other_ records.
>
> That probably has more to do with the order in which entries are created
> than anything else. The record id is just that... an increasing number.
> It's assigned by the PalmOS database subsystem for every record that's
> created. There's no meaning beyond that, I think.
>

Ok, so it's not likely of any value to me. Any idea what the Record Index
entries are? They don't have any data, and I haven't seen them in other DBs
I've peeked at. Maybe I haven't looked at enough?

> > I'll have to take a closer look at these later. For now, the records do
> > seem to be indexed by record number as well. I should look over the
> > source code for Titrax to see if I can make heads or tales of it one day.
>
> I might be giving the Titrax developers too much credit, you know. For me,
> it would seem nuts to create three databases and just assume that they're
> all going to be completely synchronized. PalmOS is a remarkably stable
> system, but still...

Sadly, it looks this way. I don't see anything except the note/name data in
the database records. Mind you, I haven't decoded the TitraxDataDB yet.
It's possible it holds some sort of indexing entries. There isn't really a
lot of data in them though. I had assumed it was just the total times that
are kept for each project. I'll have to read the source one of these days to
see what I can learn. Here's a sample of the records.

Record 1
00 01 30 a5 ba d8 8e 35 00 01 00 00 |..0....5.... |

data -> [^@^A0ズ�5^@^A^@^@]
category -> [0]
offset -> [492]
attributes:
id -> [2240514]

Record 2
00 01 18 d9 ba dd 6d 25 00 01 00 00 |......m%.... |

data -> [^@^A^X俸輒%^@^A^@^@]
category -> [0]
offset -> [504]
attributes:
id -> [2240515]

> > If you or anyone else has some insight as to how I should decode this
> > info, please let me know. Below I've provided sample output of pdbdump.
>
> data, category, attributes and id are all managed by the PalmOS database
> layer. They don't have any specific meaning to Titrax. The data is where
> the app-specific stuff is. Normally, data corresponds to a C structure of
> some kind. Well, a C structure with variable length strings tacked onto the
> end. I'm starting to get the feeling that maybe the Titrax folks decided to
> have a separate database for what would normally be just different fields
> in a structure in a single database?

Yeah, the three databases should have been made into one. There are a number
of other issues I have with the program. One of these days, I'll spend some
time learning how to program for the Palm so I can fix them. Another project
for another day...

Thanks for all your help/feedback.

...Izzy

Reply all
Reply to author
Forward
0 new messages