Guidance for using EDG for data integration?

88 views
Skip to first unread message

Fan Li

unread,
Apr 7, 2020, 7:45:22 AM4/7/20
to TopBraid Suite Users
Hi TopBraid Community,

I have a use case where I need to map data sources (spreadsheets) of different formats into a single schema. I was wondering how I should use EDG on the data modeling aspect of this task.
  • Should I use "Data Assets" to model each data source?
  • Should I use "Crosswalks" for schema mapping?
  • Is there a concrete example I can follow?
Any guidance is appreciated!

Fan Li

unread,
Apr 7, 2020, 7:51:12 AM4/7/20
to TopBraid Suite Users
I should have added that the immediate objectives are:
  • Give domain experts a visual tool to create data mapping
  • Use SHACL to describe & validate the harmonized data structure

Irene Polikoff

unread,
Apr 7, 2020, 10:13:21 AM4/7/20
to topbrai...@googlegroups.com
Hi Fan Li.

On Apr 7, 2020, at 7:51 AM, Fan Li <lifa...@gmail.com> wrote:

I should have added that the immediate objectives are:
  • Give domain experts a visual tool to create data mapping
  • Use SHACL to describe & validate the harmonized data structure


On Tuesday, April 7, 2020 at 7:45:22 AM UTC-4, Fan Li wrote:
Hi TopBraid Community,

I have a use case where I need to map data sources (spreadsheets) of different formats into a single schema. I was wondering how I should use EDG on the data modeling aspect of this task.
  • Should I use "Data Assets" to model each data source?
What kind of information are you planning to import into EDG? Is it some data in spreadsheets e.g., the actual information about lets say products or companies? Or do these spreadsheets contain information about data sources e.g., what datasets you have, what are the fields in each dataset, how many records in each dataset, etc.?

It would be useful if you could provide an example.
  • Should I use "Crosswalks" for schema mapping?
  • Is there a concrete example I can follow?
Any guidance is appreciated!

--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/topbraid-users/80a1e323-0d3e-4a1d-bb8b-33898253242a%40googlegroups.com.

Fan Li

unread,
Apr 7, 2020, 12:12:09 PM4/7/20
to TopBraid Suite Users
Hi Irene,

Each spreadsheet represents data we received from a different customer. I would like to capture its metadata/descriptors such as column names, data types, number of records etc in EDG. As customers use slightly different terminologies, I also need to map the column names to a single schema so I can merge the data for reporting purpose.


On Tuesday, April 7, 2020 at 10:13:21 AM UTC-4, Irene Polikoff wrote:
Hi Fan Li.

On Apr 7, 2020, at 7:51 AM, Fan Li <lifa...@gmail.com> wrote:

I should have added that the immediate objectives are:
  • Give domain experts a visual tool to create data mapping
  • Use SHACL to describe & validate the harmonized data structure


On Tuesday, April 7, 2020 at 7:45:22 AM UTC-4, Fan Li wrote:
Hi TopBraid Community,

I have a use case where I need to map data sources (spreadsheets) of different formats into a single schema. I was wondering how I should use EDG on the data modeling aspect of this task.
  • Should I use "Data Assets" to model each data source?
What kind of information are you planning to import into EDG? Is it some data in spreadsheets e.g., the actual information about lets say products or companies? Or do these spreadsheets contain information about data sources e.g., what datasets you have, what are the fields in each dataset, how many records in each dataset, etc.?

It would be useful if you could provide an example.
  • Should I use "Crosswalks" for schema mapping?
  • Is there a concrete example I can follow?
Any guidance is appreciated!

--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbrai...@googlegroups.com.

Irene Polikoff

unread,
Apr 8, 2020, 3:37:51 PM4/8/20
to topbrai...@googlegroups.com
Hi Fan Li,

In this case, you would represent each spreadsheet as a dataset in a data asset collection. Each column will become a dataset element.

For 6.4, we have added an Import feature that will create this information from a spreadsheet. It will perform some profiling of data to populate metadata.


If you are interested, may be we can arrange for a way for you to test it and provide input. 

Without this feature, you could create dataset instances for each spreadsheet, then use the plain spreadsheet importer to create data elements for each dataset. Creating input for the importer will require manipulating your data. Take the first row of your spreadsheet that lists the column names and turn it into columns.

With respect to connecting the data elements from different datasets, I would not use crosswalks. Crosswalks are primarily about mapping different reference data, different glossaries, taxonomies, etc. You would not create a separate data asset collection for each dataset. At least, I do not think you would. You would most likely use a single data asset collection.

Then is a question of whether you would map the similar data elements to each other or whether you would map all of them to a common business term. For example, you may have different data elements (spreadsheet columns) capturing gender information. It would make sense to create a business term Gender and map all of them to it - in the hub and spoke type of approach. EDG has some capabilities suggesting such mappings based on the available data and rules about a business term. And, yes, SHACL is used for this.

Mapping one element to another make sense if you are trying to capture lineage e.g., data from one dataset is used for another dataset and your goal is to capture this.

Coming back to your question on “giving domain experts a visual tool to create data mapping”, I would probably organize the editor UI for data assets to display data elements and drag and drop to map. I am showing an example in the screenshot below. My first panel contains business terms. My second panel contains data elements. I use it to select a data element to be shown on a form. Then, I could drag and drop from the business term table the relevant term to map the data element two. If you were doing mapping between data elements, you could have in the first panel data elements from one dataset and in another panel data elements from another dataset. 


Alternatively or additionally, you could also do batch editing. For example, you could select data elements from different datasets that represent let’s say gender and batch edit all of them in one step to connect them to the same business term - as opposed to doing one by one mapping. There are various ways to accomplish this. For example, you could use the Asset List panel. You could drag and drop different data elements (using search to find them) into a list in order to assemble everything you want to edit as a group. Then select all of them for editing. If you are familiar with Basket in TBC, Asset Lists are similar to TBC baskets, but they are collaborative. Users can name them and store them on the server to share with other users for collaborative work and discussion.

Hopefully, this gives you some useful information.

Regards,

Irene

To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/topbraid-users/358116ab-b2eb-4a2f-be3f-213c77253725%40googlegroups.com.

Fan Li

unread,
Apr 14, 2020, 11:40:23 AM4/14/20
to TopBraid Suite Users
Hi Irene, sorry for my delayed response. I have finally had an opportunity to try it on the new 6.3 version. (By the way I really like the new user experience which is fantastic.)

Thanks so much for pointing me to the right direction and I think the UI layout you suggested is very intuitive and would work great with domain experts.

One remaining question:
  • Suppose we receive a new data file every month (same schema), do we create a new "File Table" instance each time? If we do, can we associate the new instance with the same set of "Table Columns" or we need to create new column instances each time?

Irene

Irene Polikoff

unread,
Apr 14, 2020, 2:19:20 PM4/14/20
to topbrai...@googlegroups.com
Hi Fan Li,

I think it depends on your context and the goals for using this information.

Will you be capturing more than just simple column metadata such as the name, datatype and what it maps to? 
For example, you could include data profiling e.g., number of null values, min and max values for each data element, etc. 

If so, then these will probably be changing from one month to another. If it is important for users to easily see what the table/dataset was like at each month, then you may consider creating new resources for each column/data element. If you are not very interested in readily available information about the dataset evolution and mainly want to know what is in the dataset today, then you definitely do not need new instances for data elements. Simply update them. The historical info could always be queried from the change history should you need it. Even if you are interested in making the history easily available to users, you could create a query for them to run against the change history.

In our experience, at least with databases, people are primarily interested in how it looks like today. By default, when we do the subsequent (second, third, etc.,) import of a data source, it is a delta of the previous import. The same tables and columns are kept if they continue to exist. Only profiling and sampling is updated for them. Any new schema elements are added. Elements that no longer exist are removed. Some of it is done automatically and some requires attention of curators because if you see one new column and one deleted column, EDG can’t know for sure if it is indeed one new and one deleted or an update to the previously existing one.

With this approach you would not need to do the remapping of data elements unless the data in them changed fundamentally.

If you are only interested in knowing that a given table has some specific data elements (columns) which map to something and no data profiling details are of interest, then are you getting monthly updates because the schema may change e.g., may be a column got added or deleted? In this case, I do not see even a small reason to create each month new resources for the data elements. 

If you need very easy access to the info about what elements were part of the table as of January, February, etc., you may consider creating a new instance of a table/dataset for each month. Otherwise, simply update the instance of the table you created originally. As per above, the historical info could be queried from the change history.


Regards,

Irene

To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/topbraid-users/417843e4-80fa-4aeb-8ad4-42bfd3488b85%40googlegroups.com.

Fan Li

unread,
Jun 25, 2020, 9:20:53 AM6/25/20
to TopBraid Suite Users
Hi Irene, I have a follow-up question. In your example, you used the "maps to term" property to connect Data Columns with Business Terms. How can I be more specific in terms of the mapping relation, for example the mapping involves a conversion of the unit of measure (from Euro to US dollar) or more complex transformation?

Irene Polikoff

unread,
Jun 25, 2020, 9:46:40 AM6/25/20
to topbrai...@googlegroups.com
Hi Fan Li,

Such detail typically is not required or expected in the context of connecting a data element to a business term as organizations would not normally have Amount in Dollars and Amount in Euros as separate business terms. Connection to business term is expected to be a general business connection without any deeper technical details. If, for some reason, you decide to make it into something more, then decision on how to extend the EDG models is up to you.

This type of details would make sense in a context of capturing mappings between data elements when describing data flows. In that context instead of using simple mapping properties you could, if desired, use constructs that allow specifying these details.

The EDG Models have “Mappings” classes intended to capture details of mapping transformations performed by software - as shown in the diagram below 

So, basically, you would create instances of Mapping classes and a mapping will have inputs and outputs that are data elements - something like below: Mapping is a piece of code and will typically be connected to some program that contains it.




Fan Li

unread,
Jun 25, 2020, 1:50:16 PM6/25/20
to TopBraid Suite Users
Hi Irene. Thanks for the explanation. Is it possible to share the sample data you used for the screenshot? I feel I need more details to understand what the derivative map represents.

Irene Polikoff

unread,
Jun 25, 2020, 10:11:10 PM6/25/20
to topbrai...@googlegroups.com
Hi Fan Li,

I can’t send you data I used in the screenshot, but I added a small example to the lineage sample file.

To use it, create a Lineage Asset collection and load this RDF into it. You will see an example of a Functional Mapping.

lineage-model_topbankcorp-fry9c.ttl
PastedGraphic-3.png

Fan Li

unread,
Jun 29, 2020, 2:46:42 PM6/29/20
to TopBraid Suite Users
Hi Irene,

I see that in the lineage model we can specify the "use database" relation and the database would appear in the lineage model. I attempted to expand the lineage ontology model to add a "use file table" relation for file data tables. However this relation is not displayed in the lineage graph. Do you have any insight?

Thanks,
Fan





Irene Polikoff

unread,
Jun 29, 2020, 3:54:50 PM6/29/20
to topbrai...@googlegroups.com
Hi Fan,

LineageGram is a sophisticated visualization. Simply extending an ontology will not change it. It is more complex than that.

If you go into the Lineage Tutorial, there is some documentation on the properties used for the diagram


There are ways to extend and modify the diagram’s behavior, but this type of customization will require direct mentoring from TQ and can’t be provided through e-mails in the user group forum.

Regards,

Irene

--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.

Fan Li

unread,
Jun 29, 2020, 4:10:10 PM6/29/20
to TopBraid Suite Users
Hi Irene, thanks for the clarification. For now I will stay within the boundary of the lineage model and perhaps look into customization in the future.

Ralph TQ [Gmail]

unread,
Jun 29, 2020, 4:14:21 PM6/29/20
to topbrai...@googlegroups.com
Hello Fran,

I have been following your discussion on lineage. As the lead developer for the LineagGram capabilities, I am interested in how you are wanting to use the capabilities.  With regard to the use of other properties for lineage, in release 6.4 there will be the ability to have other properties inferred for generating lineage. These properties will have to be specified as relevant to computing lineage. In a similar way for the derivation maps the composite structure of a node is inferred from properties that specify “partOf” (or “hasPart”) composition. Details of how to do this kind of modeling will be provided.

Regards,

Ralph

--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.

Ralph TQ [Gmail]

unread,
Jun 29, 2020, 4:32:02 PM6/29/20
to topbrai...@googlegroups.com
Hello Fan not Fran - sorry :-)

Fan Li

unread,
Jun 29, 2020, 9:43:43 PM6/29/20
to TopBraid Suite Users
Thank you, Ralph. I look forward to the release of EDG 6.4.

Fan Li

unread,
Aug 13, 2020, 11:14:18 AM8/13/20
to TopBraid Suite Users
Hi Irene, regarding the mapping from data asset columns to business term, you have shown how it is done manually. Is it possible to automate the process, for example, using the fuzzy matching algorithm in the data quality report? Thanks!  

Irene Polikoff

unread,
Aug 13, 2020, 1:18:26 PM8/13/20
to topbrai...@googlegroups.com
Hi Fan Li,

You can associate a SHACL Shape with a term describing its values and EDG will use results of data profiling and sampling to suggest matches. This is done by creating a data value rule - see below for an example.




A data value rule is a node shape and you can specify in it everything you can specify for a shape e.g., min and max values, datatype, a regex pattern, etc.

There are also some pre-build custom constraint components you can use to detect e-mails, URIs, credit card numbers, etc. These are available in in the String Kind SHACL Library


This constraint library is located in the workspace under SHACL directory in strkind.shapes.ttl


There is webinar where this was demonstrated https://www.topquadrant.com/project/applied-data-governance-2/. Go to around 17:40 into the webinar. This was recorded 2 years ago, so you will see differences in the UI, but the concept is the same.

If column names are sufficiently meaningful, then you could also try using a crosswalk, cross walking instances of data elements with instances of terms. 

Crosswalk has an auto-suggestion feature. It uses fuzzy matching on labels only. You can customize (on the Manage tab), the crosswalk predicate to be “maps to term” and then, after matches are established, include the crosswalk into your data asset collection.

Reply all
Reply to author
Forward
0 new messages