Merging the four MS Access (*.MDB) PUIDs into two...

196 views
Skip to first unread message

ross-spencer

unread,
Mar 18, 2013, 5:33:16 PM3/18/13
to pro...@googlegroups.com

I have been working on an Microsoft Access Validator tool – at least something that could potentially turn into one: https://github.com/exponential-decay/MDBValidator

I have also been keeping notes and updating the Archive Team wiki as I go: http://fileformats.archiveteam.org/wiki/Access

In my work I've come to the conclusion that PRONOM lists two signatures for Access 95 and 97 and two for Access 2000 and 2002 that might want to be merged, respectively. This is because the actual format used between the two versions, 95 and 97, and then 2000/2002 respectively is the same

  • Access 95/97 uses a database called Jet 3 - v 3.0.
  • Access 2000/2002 used a database called Jet 4 - v 4.0.

The difference between the two is essentially how big the database can be and the size of variables that can be held within that newly sized data structure.

The current signatures PRONOM adopts read the Jet version at the beginning of file and then a variably positioned field later on in the database that relates to the version of access that created it. The variable part of the signature for 95/95 for example, begins ASCII: 'AccessVersion' and then the version number within {0-1024} bytes.

The information being read is simply stored by Access in what is termed a ‘Data Page’ when it is created. This data page can be located anywhere within the data stream and is one of many throughout the stream.

What I discovered through the research is that the version of the database doesn’t change between Microsoft Access versions. Jet 3 and 4 are coupled closely with their respective access versions. What changes between access versions is the DLL that reads and writes to the database. I’ve documented more on the archive team wiki. While this is important information it is distinct from the ‘file format’ alone which is what PRONOM should be recording.

The change might also result in modifying the signature to remove the search for the Access strings: 'AccessVersion' and 'A.c.c.e.s.s.V.e.r.s.i.o.n'. DROID might simply look for the Jet DB version.

The advantage of this (but by no means a sole reason for change) is that the current search for the Access version and build number is so variable it could potentially be found up to the very end of a maximum 2GB file (If I understand the max size of a 2kl DB. The Access signature is currently one of the slowest – if not the slowest signature search in the whole of the PRONOM database.

A concern about this is that we might then want to change the records to reflect Jet DB instead of Access. This is probably a more extreme change but perhaps more accurate. What would happen at this point is the two signatures would begin to pick up instances of Microsoft Money and Visual Basic databases – users would need to be informed about this – probably via the PRONOM record. My feeling is that it is more accurate but understand some users would be uncomfortable with such a radical change.


So, two questions:

  • Does PRONOM merge the records for Access 95 and 97 and then 2000 and 2002 to accurately reflect that they are the same ‘file format’?
  • Does PRONOM go further and modify the signature to simply discover if it is a Jet DB? Resulting also in a change to the record name – perhaps maintaining aliases for Access, Microsoft Money and Visual Basic.

Message has been deleted

dclipsham

unread,
Mar 25, 2013, 2:16:01 PM3/25/13
to pro...@googlegroups.com
Thanks Ross,

We are pleased to receive your suggestion, however we do have concerns about making the changes as suggested:

This proposal involves deprecating existing records that are in active use for identification. 

Historically, we have merged records when either there hasn’t been a signature file at all and the signature research hasn’t given us enough specificity (e.g. early versions of MS Works – so a previously unidentified file now gets a PUID), or where multiple PUIDs have shared one signature, to remove the multiple ID conflict (e.g. Excel 5.0/95, as identified by your skeleton corpus – so a file previously identified with two PUIDs, now only has one).

We accept that your proposal would potentially more accurately identify the format - being Jet DB 3 or 4 rather than any one of four MS Access 'formats,' however we have two concerns:

* We would lose a degree of specifity - given that we can currently distinguish between, say files generated by either MS Access 95/97, is there any harm in keeping it that way?
* Deprecating, for example, x-fmt/239 (Access 97) in favour of x/fmt-238 (Access 95), would mean that existing system users who have formulated policy decisions based on specific PUIDs would have to revise these policy decisions where a digital object currently identified as one thing, would change to another. We're conscious that previous changes to our identification of TIFF and related formats previously caused problems for some institutions, so we do not wish to take such changes lightly.

This said, our position is not immutable. We're keen to understand what the community thinks, and whether they share, or are quite comfortable with these concerns. We encourage anybody with a stake or an opinion to share their thoughts within this discussion.

David

ross-spencer

unread,
Mar 27, 2013, 2:50:22 AM3/27/13
to pro...@googlegroups.com
Thank you for the reply. I think what it is missing, and perhaps is missing a little in my own opening remarks is a request for a bit more of a technical discussion from the group. And I appreciate TNA might not be the ones to provide that discussion.  

Asking 'what harm' something might do if left alone is not the strongest of technical positions from TNA. And actually, I think if I read the response correctly there is a potential here to confuse two streams of discussion:

1. Whether these formats need collapsing and how that should look
2. PRONOM's coupling with existing digital repositories and its ability to evolve and correct entries if necessary

The second point I suggest warrants another thread. I'd be happy to kick start that but suggest that might be one for you guys to start? - For me that boils down to two things, encapsulation of repositories and their use of Postel's law in a potentially 'not always scientific' discipline of format identification. And how much PRONOM should concern itself with that coupling - how can it realistically second guess its own evolution and commitment to 'format identification' if it has to be concerned with a function of the digital preservation community outside of its own scope. - It's not feasible to keep everyone happy; but it is possible to continue to keep improving as a resource to help repository managers be better informed. 

As for my original point of discussion, the technical question I would like to raise is over my own concerns; and that is: Beyond the core database format, is anyone out there experiencing backward compatibility issues between Access 97 and 95 and 2002 and 2000. Particularly in more complex database objects containing Visual Basic components, and maybe the simplest of Access Forms.

I would like that to be answered because then it would be much clearer whether there is any potential "harm" in collapsing the four records into two.

To clarify the position of there being only two formats, then the Microsoft documentation does support this quite clearly (I believe):


This chart shows that Microsoft Access 95 and Access 97 create a Microsoft Jet 3.0 format database file. If you are using a Microsoft Jet 3.5 client, it is best to use a Microsoft Jet 3.0 format database.

Here is a list of commonly used Microsoft Jet database engine clients and what associated Jet engine version they use:

   Client Application          Jet Engine Used    Recommended Jet DB Format
   -------------------------   ---------------    -------------------------
    Access 2.0                       Jet 2.0             Jet 2.0
    Access 95                        Jet 3.0             Jet 3.0
    Access 97                        Jet 3.5             Jet 3.0
    Access 2000                      Jet 4.0             Jet 4.0
    Access 2002                      Jet 4.0             Jet 4.0
    DAO 3.0                          Jet 3.0             Jet 3.0
    DAO 3.5                          Jet 3.5             Jet 3.0
    DA0 3.6                          Jet 4.0             Jet 4.0
    Microsoft.JET.OLEDB.3.51         Jet 3.5             Jet 3.0
    Microsoft.Jet.OLEDB.4.0          Jet 4.0             Jet 4.0
    Access ODBC Driver*              Jet 4.0             Jet 4.0

I would suggest the current distinction in PRONOM is between 'clients' and not versions of formats. (Again another point of discussion, but perhaps best not visited until further clarification of the problem).

On top of the 'format' Microsoft purports to use in each of its clients I am aware that Access database objects can contain other complex objects as previously alluded to, and my concern is whether those objects create a preservation and access problem where it isn't enough to say One is looking at Jet 3 or Jet 4 and whether the extra characterization information on top is not only useful but crucial. 

Unfortunately I haven't been able to set up a more appropriate testing environment and more complex set of database objects to test this hypothesis more thoroughly. 

(Note: My own development work so far processes the two categories of database I am distinguishing between identically - the question is to what degree do they remain identical as the objects become more complex?)

If this was something that could be cleared up and the records modified to reflect this work then the narrowing of scope, conversely, could open up the number of options for accessing  data in an Access DB and then potentially preserving it. It would allow me to inform users better and allow them to make better, more intelligent digital preservation decisions. 

So, I guess to conclude, my call for discussion in the PRONOM Group is a call to the community to help me understand what they're seeing with Access databases, and something I will follow up on in my own research to try and solidify the position. At the minute, I believe it is clear that there is a theoretical inaccuracy in the PRONOM records, what will be interesting to discover is whether that is a convenient and useful inaccuracy, or if it is something that we can benefit from correcting.

David Clipsham

unread,
Apr 22, 2013, 8:13:19 AM4/22/13
to pro...@googlegroups.com
Hi Ross, with apologies for the delay,

 I’ve now had a chance to conduct further research into these issues and I’ve found the following:
 
Using emulated Windows 98 environments, I created a very basic, 1 table .mdb file in Access 95, and another, as identical as possible, in Access 97. The files themselves have similar headers as expected (just 2 bytes different), however the Access 95 database is 92KB, whilst the Access 97 database is just 70KB and aside from the header, the binary between the two has many differences.
 
I attempted to open the Access 97 database in Access 95, and it refused to open. I hit the following error messages:

 

 
 
I then opened the Access 95 database in Access 97, and was immediately prompted to convert the file:
 


 
 
I have not yet repeated the exercise with Access 2000/2002, however I’ve come across a Microsoft page that indicates incompatibility between files created in each version:
 
File formatCompatible with Access 97?Compatible with Access 2000?Compatible with Access 2002?
Access 97 format (created with Access 97)YesNoNo
Access 2000 file format (created with Access 2000)NoYesYes
Access 2000 file format (created with Access 2002)NoYes (Access 2000 features only)Yes
Access 2002 file format (created with Access 2002)NoNoYes
Note   While using Access 2002, the features available to files in Access 2000 file format are exactly the same as those available to Access 2002 files. The Access 2002-specific features are unavailable when any file in Access 2000 file format is opened in Access 2000.
 
 
I think this shows that there is a practical distinction between each .mdb version regardless of the underlying database engine, and it therefore wouldn’t be right to merge the records as suggested.
 
I can't seem to upload the files I created here, so I've posted them on the OPF File formats corpus:

I hope you’ll find them useful in your research.
 
Please let me know if I can provide anything further. 
 
David

ross-spencer

unread,
Apr 25, 2013, 10:40:36 PM4/25/13
to pro...@googlegroups.com
Thanks David. 

An excellent response, it provides some good technical insight that backs up some of my thoughts/findings/fears. It seems that this is a good example of a blurred line between purported 'format' and implementation of that format, at least, the effect of the producing software/engine to skew more rigid definitions of format/specification.

When I have time, I'll get back to the Just Solve It Wiki and try and write a bit of this up. 

Q. Can I have permission to use the images you've provided on the Wiki? - they will become Creative Commons 0

Also, (wondering if we can open it up to the thread), but maybe there is some PRONOM description text we can come up with at this point to highlight this situation better? - if you guys haven't done so already that is!

I'll have to come back to this. 

Thanks again for the valuable input. 

Ross

dclipsham

unread,
Apr 26, 2013, 5:17:23 AM4/26/13
to pro...@googlegroups.com
Thank you Ross,

Yes, feel free to reuse these images under CC0.

I'll try to add some descriptive text in the next release; if any kind soul would like to write a suitable description by Monday, I'll happily include it (with attribution of course).

David
Reply all
Reply to author
Forward
0 new messages