Test framework for pygrametl to write tests for ETL pipeline

82 views
Skip to first unread message

Ray Goel

unread,
Dec 8, 2019, 1:20:16 PM12/8/19
to pygrametl-dev
Hi all,

I am a SDET who is trying to figure out some integration and system tests for an ETL pipeline at my company. I want to write functional tests for the different Transform functions and I came across pygrametl. However, there is no documentation on using this library for testing.

I would like to know if there is an existing framework which I can use with this library to write these tests or if I will have to work on one from scratch. Any help will be appreciated. 

Thanks,
Ray 

Christian Thomsen

unread,
Dec 9, 2019, 8:32:26 AM12/9/19
to pygrametl-dev
Hi,

Thank you for your interest in pygrametl.

The current version of pygrametl has no direct support for testing, but since it is code-based, it can be used with existing Python unittest frameworks.

We are, however, working on a simple framework where you easily can define the pre- and post-conditions of the database tables and test that the post-condition holds after the ETL process finishes.

Is it something like this you are looking for? If you have ideas about how programmatic ETL testing should be done, we would be happy to hear about them.

Best regards,
Søren Kejser Jensen and Christian Thomsen

Ray Goel

unread,
Dec 9, 2019, 10:24:03 AM12/9/19
to pygram...@googlegroups.com
Thanks for the quick response. 
 
Since ETL pipeline testing is a comparatively newer field, we are also figuring out the kind of tests we want to write for the pipeline. 
However, some of the areas we identified are as follows: 

  1. Unit tests for each component(function)  of the Transform stage - This is handled by the devs using Pytest
  2. Integrations tests to make sure the different components interact with each other as expected. Each transform component we have writes data either to a BQ table or a MySQL table. Tests to make sure the schema of the table being written to is correct and a quick sanity check of the data in these tables. (We are trying to implement DBT for this, however a python library which allows testers to assert expected data here would help. This is where I was thinking of putting pygrametl to use)
  3. Feed all these test to a continuous testing pipeline which will mimic the ETL process as it is in production. (Airflow with the tests integrated at each node).
If pygrametl can be used to develop a framework to incorporate tests like these, it would be first of its kind and would be really helpful!

Let me know if you guys need any more information. And it would be great to be a part of the development if there is a framework being developed.

Thanks,
Ray

--
You received this message because you are subscribed to the Google Groups "pygrametl-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pygrametl-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pygrametl-dev/a111c777-2e21-4152-ba4b-d47d3ea19c4a%40googlegroups.com.


--


Ray Goel | Software Engineer in Test


M: 647-906-5569


Take your lunch from routine to Ritual

Christian Thomsen

unread,
Dec 19, 2019, 10:16:47 AM12/19/19
to pygrametl-dev
Hi,

Sorry about the late reply - we have been (too) busy the last couple of days.

We are working on finalizing our implementation for easy definition of pre- and post-conditions for ETL testing. The basic idea is that you easily can specify a table in a string as shown here:
tbl = dtt.Table("browser",
"""
| bid:int (pk) | browser:text | os:text |
-----------------------------------------
| -1           | Unknown      | Unknown |
| 1            | Firefox      | Linux   |
"""
)
tbl
.ensure()

The method ensure will create and/or populate the table. By default, the test database is in SQLite, but the user can also provide a PEP249 connection to another database. This serves as the pre-condition.

It is then possible to specify changes, e.g., as shown below:
etl.runETL() # this could be pygrametl or any other ETL tool
expected
= tbl + "| 2 | Chrome | Windows |"
expected
.assertEqual()

With the post-conditition defined in expected, it can be asserted that the table in the database has the same content (other asserts are also supported: assertNotEqual, assertSubset).

Does this fit your needs? And if not, what is then missing? How would you like to write the tests?

Best regards,
Søren Kejser Jensen and Christian Thomsen

mandag den 9. december 2019 kl. 16.24.03 UTC+1 skrev Ray Goel:
Thanks for the quick response. 
 
Since ETL pipeline testing is a comparatively newer field, we are also figuring out the kind of tests we want to write for the pipeline. 
However, some of the areas we identified are as follows: 

  1. Unit tests for each component(function)  of the Transform stage - This is handled by the devs using Pytest
  2. Integrations tests to make sure the different components interact with each other as expected. Each transform component we have writes data either to a BQ table or a MySQL table. Tests to make sure the schema of the table being written to is correct and a quick sanity check of the data in these tables. (We are trying to implement DBT for this, however a python library which allows testers to assert expected data here would help. This is where I was thinking of putting pygrametl to use)
  3. Feed all these test to a continuous testing pipeline which will mimic the ETL process as it is in production. (Airflow with the tests integrated at each node).
If pygrametl can be used to develop a framework to incorporate tests like these, it would be first of its kind and would be really helpful!

Let me know if you guys need any more information. And it would be great to be a part of the development if there is a framework being developed.

Thanks,
Ray

On Mon, Dec 9, 2019 at 8:32 AM Christian Thomsen <c...@cs.aau.dk> wrote:
Hi,

Thank you for your interest in pygrametl.

The current version of pygrametl has no direct support for testing, but since it is code-based, it can be used with existing Python unittest frameworks.

We are, however, working on a simple framework where you easily can define the pre- and post-conditions of the database tables and test that the post-condition holds after the ETL process finishes.

Is it something like this you are looking for? If you have ideas about how programmatic ETL testing should be done, we would be happy to hear about them.

Best regards,
Søren Kejser Jensen and Christian Thomsen

--
You received this message because you are subscribed to the Google Groups "pygrametl-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pygrametl-dev+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages