Data conversion between various schemas

Peter Skvarenina

unread,

Jul 15, 2002, 2:31:19 PM7/15/02

to

I would like to ask, if there exist some papers with the topic of general
DB conversion according to different source and target schemas.

I need to create system that converts data between two different schemas.
I have fixed target schema and various source schemas, that represent
some subset (not always) of information that can be expressed by the target
schema.

I know I can't guarantee non-lossy conversion, i.e. when there are different
subsets of attributes, non-compatible functional dependencies and so on.
But has someone thought about some subset of operations that is inevitable
to implement to provide tool, that can be helpful in any case that should arise
during conversion process ?

Regards,

Peter Skvarenina

Paul Vernon

unread,

Jul 16, 2002, 5:51:10 AM7/16/02

to

> I would like to ask, if there exist some papers with the topic of
general
> DB conversion according to different source and target schemas.

I don't know of any such work, but would be v interested in any leads
people have.
Obviously books on data warehousing sort of cover this, but I find that
most DW book don't have a lot of theory in them ;-)

>I need to create system that converts data between two different schemas.
>I have fixed target schema and various source schemas, that represent
>some subset (not always) of information that can be expressed by the
target
>schema.

>I know I can't guarantee non-lossy conversion, i.e. when there are
different
>subsets of attributes, non-compatible functional dependencies and so on.
>But has someone thought about some subset of operations that is
inevitable
>to implement to provide tool, that can be helpful in any case that should
arise
>during conversion process ?

Very briefly, my thoughts here are.

1. Get each source system into 1NF

2. Use Relational Algebra (or SQL if that is all you have to hand..) to
normalise and constrain each source individually
2a. Add temporal attributes to you source models, then normalise and
constrain again. This will let you store history in the DW.

3. For each normalised, temporal source model, use Views (i.e. Relational Algebra again) to:
3a. Join to multiple 'lookup' tables to map source domains to common
(target) domains
3b. Rename source attributes to common target names

4. Select a common subset of attributes from each source model and UNION
them all together
4a. Use aggregation over the source models if the data cannot be
consolidated at the detail level

That's the basic outline. Life gets more complex when you get different
sources stating contradictory facts.
Note that there is no information loss as the data from each source is
stored in individual normalised source models.

Regards
Paul Vernon
Business Intelligence, IBM Global Services

Joe Novella

unread,

Jul 16, 2002, 8:34:59 AM7/16/02

to

> Very briefly, my thoughts here are.
>
> 1. Get each source system into 1NF
>
> 2. Use Relational Algebra (or SQL if that is all you have to hand..) to
> normalise and constrain each source individually
> 2a. Add temporal attributes to you source models, then normalise and
> constrain again. This will let you store history in the DW.
>
> 3. For each normalised, temporal source model, use Views (i.e. Relational Algebra again) to:
> 3a. Join to multiple 'lookup' tables to map source domains to common
> (target) domains
> 3b. Rename source attributes to common target names
>
> 4. Select a common subset of attributes from each source model and UNION
> them all together
> 4a. Use aggregation over the source models if the data cannot be
> consolidated at the detail level
>
> That's the basic outline. Life gets more complex when you get different
> sources stating contradictory facts.
> Note that there is no information loss as the data from each source is
> stored in individual normalised source models.

There are newer data profiling products which not only integrate multiple databases into a
consolidated normalized model, they do so based on analyzing the data below the metadata.
Through inferring dependencies and redundancies across different sources, the analyst can model
the source data or map to a pre-defined fixed target.

From there, you modify the model and add structures for new requirements or denormalize for
performance.

The idea is to identify the actual content, structure and quality of the data before migration.
Of course, the data profiling products won't qualify or prioritize the problems or concerns,
they do quantify them.

There are also numerous papers on inferring dependencies. Here's a query against citeseer.com.

http://citeseer.nj.nec.com/cs?q="functional+dependencies"&cs=1

Hope this helps.

--
Joseph Novella
Anthem Consulting, LLC
"Good ideas are not adopted automatically. They must be driven into practice with courageous
patience." - Admiral Hyman Rickover

Paul Vernon

unread,

Jul 16, 2002, 2:06:46 PM7/16/02

to

> There are newer data profiling products which not only integrate
multiple databases into a
> consolidated normalized model, they do so based on analyzing the data
below the metadata.
> Through inferring dependencies and redundancies across different
sources, the analyst can model
>the source data or map to a pre-defined fixed target.

Indeed there are tools to help automate the normalisation process on
individual data sources (my step 2).
I'm not sure how well they work when attempting to integrate separate
models together however. This is more a matter of matching entity
predicates between sources rather than finding hidden functional
dependences

Joe Novella

unread,

Jul 16, 2002, 3:32:40 PM7/16/02

to

Paul Vernon wrote:

Agreed. Though at least one of the data profiling tools on the market can
infer, what they call, "cross-file dependencies."

Other products do not infer cross-file dependencies. Rather, they compare
value sets of individual pairs of attributes and provide overlap
percentages. For example, identifying how many (and which) values in
cust_id in one file exist in customer_num in another file and vice versa.
Of course, it's only one step in the redundancy process. The analyst has
to decide whether cust_id and customer_num really should be the same
attribute in the model. If so, the analyst creates a "synonym" between the
attributes. If not, the analyst moves to the next attribute pair.

Likewise, if there is little overlap between the two attributes, and there
really should be a synonym between the two, you should be able to still
create the synonym. Of course, you have another problem on your hands.
:-)

Finally, other profiling tools can perform orphan analysis on sets of
attributes. The purpose of redundancy profiling is to identify potential
foreign keys and duplicate columns across files, tables, or systems, using
the data itself, not the metadata or code.

Keep in mind, the data profiling tools do not make the normalization
decisions. The analyst has to determine which functional dependencies to
include in the model and which redundancies are true foreign keys (or
denormalized attributes). The normalization decisions still need to be
made by a user. The difference is that you're making these decisions based
on the underlying data. Once these decisions are made, though, it's very
easy for the profiling tool to normalize the sources.

From personal experience, I've developed integrated models incorporating
data sources of flat files, RDBMS tables, VSAM files, and IMS segments,
using these tools. It still is a lengthy process, but it is manageable,
iterative, and repeatable.

Barry

unread,

Jul 23, 2002, 8:13:42 AM7/23/02

to

An approach which I have used with some success in the past is :-
1) Define schema conversion as specifications for metadata
integration, using
a suitable product.
2) Use a data integration tool, such as SeeBeyond, to retrieve,
convert and load
the data, using the specs from Step 1.

This is a rather expensive solution, suitable for enterprise-level
budgets.
If you do not have that kind of money available,($250K+), then a
sensible alternative is to follow the same kind of methodology, and
then use a small database for holding a Data Dictionary with mappings,
and either hand-code some SQL, or use Embarcadero's product at about
$25,000.

This page on my Database Answers Web Site has some useful links,
including 5
products reviewed in a recent evaluation report of MetaData
Integration products :-
http://www.databaseanswers.com/metadata_management.htm

Barry Williams
Principal Consultant
Database Answers

Joe Novella <jnov...@voicenet.com> wrote in message news:<3D3474DD...@voicenet.com>...

Ariel Fuxman

unread,

Sep 26, 2002, 12:37:55 AM9/26/02

to

On Mon, 15 Jul 2002 14:31:19 -0400, Peter Skvarenina wrote:

> I would like to ask, if there exist some papers with the topic of
> general DB conversion according to different source and target schemas.

We have a research project called Clio, that deals with precisely this
problem. It is a joint project between University of Toronto and IBM
Almaden. We have a number of papers on this subject (called "schema
mapping"). You can find them, and also more info on the project at

http://www.cs.toronto.edu/db/clio/

Also, is there any possibility that you share some examples of the
source and target schemas you were trying to map? We would be very
interested in applying our tool to examples from people who have already
faced the problem of schema mapping.

Let me know if you need any additional information,

Ariel Fuxman.

Jeff Tan

unread,

Oct 6, 2002, 11:37:20 PM10/6/02

to

Clio is indeed a leading product/technology in this field.

I'm curious to know, however, if the data conversion is once and for all
or if it must be via a dynamic system on top of the existing databases.

In the absence of sufficient budget (wasn't clear from the query if this
was an issue at all), my hacking instincts would make me try using a
static application that connects to the databases via JDBC and converts
the data via SQL. I'm talking about abandoning the original database,
though, and there would be wide assumptions about the conversions that
the application is capable of. In fact the application may have to be
rewritten for every pair of databases to convert between.

Full-blown prototypes that are a lot more dynamic do exist. Try
searching the web for the following keywords:

multidatabases, federated databases, schema mapping, schema integration,
intelligent integration of information

Some major projects I've heard of, aside from Clio, are:

AURORA (Uni of Alberta, Canada)
TSIMMIS (Stanford Uni, IBM)
Ariadne and SIMS (Uni. of South California)
CARNOT and Infosleuth (Microelectronics and Computer Technology Corp.)
OBSERVER (EHU, Spain)

You can find papers for them at:

http://citseer.nj.nec.com/cs

Good luck!

cheers,
Jeff

--
Jeff Tan Ph :(61 3) 99031405
Computer Science & Software Engineering Fax:(61 3) 99031077
Monash University Mob:(61 409) 567113

Joseph R. Hudicka

unread,

Oct 7, 2002, 9:39:15 AM10/7/02

to

I have published a series of white papers this year entitled, "Setting
the Record Straight". Data migration is not a technical problem - it
is a business problem, and should be handled as such.

ETL's are not the answer, but they can be a component of the solution.
We have documented a thorough methodology for data migration, that
focuses exclusively on a linear project time line, eliminating the
widely-known and substantively flawed, "Spiral Methodology".

Please let me know if you would like me to send you a copy of the
series.

kalpana...@gmail.com

unread,

Sep 27, 2012, 11:53:07 PM9/27/12

to

Informatics Outsourcing is an Offshore Data Management service company. Data Management Service includes all types of Data Conversion, File Conversion, XML Conversion, HTML Conversion,SGML Conversion, Document Conversion,Data Entry, Data Extraction and Validation,OCR and ICR Services with affordable price. Our team to give the solution quickly and given requirements.

Erwin

unread,

Oct 1, 2012, 5:24:33 AM10/1/12

to

Op dinsdag 16 juli 2002 14:38:54 UTC+2 schreef Joe Novella het volgende:
>
> There are newer data profiling products ...

The question was for papers, not for products.

Erwin

unread,

Oct 1, 2012, 5:29:23 AM10/1/12

to

Op donderdag 26 september 2002 06:48:31 UTC+2 schreef Ariel Fuxman het volgende:
>
>
> http://www.cs.toronto.edu/db/clio/
>

Are you aware you're pointing us at nothing with this ?

ian.t...@gmail.com

unread,

Jul 19, 2014, 12:46:48 PM7/19/14

to

FWIW, I think the following is the new link:

http://dblab.cs.toronto.edu/project/clio/