Dimension 404 Streaming

0 views

Skip to first unread message

Rosangela Pinkard

unread,

Aug 5, 2024, 2:24:20 PM8/5/24

to clicimnate

Processingfacts and dimensions is the core of data engineering. Fact and dimension tables appear in what is commonly known as a Star Schema. Its purpose is to keep tables denormalized enough to write simpler and faster SQL queries. In a data warehouse, a dimension is more like an entity that represents an individual, a non-overlapping data element where the facts are behavioral data produced as a result of an action on or by a dimension. A fact table is surrounded by one or more dimension tables as it holds a reference to dimension natural or surrogate keys.

Late-arriving transactions (facts) aren't as annoying as late-arriving dimensions. In order to ensure the accuracy of the data, usually dimensions are processed first as they need to be looked up while processing the facts for enrichment or validation. In this blog post, we are going to look at a few use cases of late-arriving dimensions and potential solutions to handle it in Apache Spark pipelines.

Even if other dimensions were streamed, it would still be possible for corresponding facts to arrive in close proximity.. Think of an automated test generating synthetic data in a lower environment. Another example that can be often seen is organizational migration. Private schools sometimes migrate under a district. This entity migration triggers a wave of Slowly Changing Dimensions and the facts streamed afterward should use the updated dimensions. In such cases, when attempting to join facts and dimensions, it's possible that the join will fail due to late-arriving dimensions.

The margin of error here can be reduced by scheduling ETL jobs efficiently to ensure new dimensions are processed before processing the new facts. At McGraw Hill, this is not an option because significant delays do occasionally occur in our source systems.

In this approach, even if dimension lookup fails, facts are dumped into fact tables with default values in the dimension columns and have a hydration process to periodically hydrate the missing dimension data in the facts table. You can filter out data with no or default dimension keys while querying that table to ensure that you aren't returning the bad data. The caveat of this approach is that it does not work where those dimension keys are the foreign keys in fact tables. Another limitation is that if facts are written to multiple destinations, the hydration process has to update missing dimension columns in all those destinations for the sake of data consistency.

We can detect the early-arriving facts instead (facts for which the corresponding dimension has not arrived yet), put them on hold, and retry them after a period until either they get processed or exhaust retries. This ensures data quality and consistency in the target tables. At McGraw Hill, we have many such streaming pipelines that read facts from Kafka, lookup multiple dimension tables, and write to multiple destinations. To handle such late-arriving dimensions, we built an internal framework that easily plugs into the streaming pipelines. The framework is built around a common pattern that all streaming pipelines use:

A mechanism was needed to process retryable data along with the new data without having to send it back to Kafka. Spark allows you to UNION two or more data frames as long as they have the same schema. Retryable records can be unioned with the new data from Kafka and process it all together. That's how the reconciliation pattern was designed.

Since the reconciliation table has the `retry_count`, each pipeline can control how many retries are allowed by filtering out reconciliation records that exceed the configured number of retries. If a certain event exhausts the max retry count, the reconciliation job updates its status as DEAD and is not retried anymore.

A spike in records processed here indicates that more and more retryable records were getting accumulated faster than they were getting resolved. It indicates a potential issue in a batch job that loads dimensions. Once retries are exhausted or all events are processed, this spike will reduce. The best way to optimize this is to partition the reconciliation table on the `status` column so that you are only reading unresolved records.

This reconciliation pattern becomes easy enough to plug into the existing pipelines once a tiny boilerplate framework is added on top of it. It works with both streaming and batch ETLs. In streaming, it relies on the checkpoint state to read new data from the reconciliation table. Whereas in batch mode, it has to rely on the latest timestamp to keep track of new data. The best way of doing this would be to use streams with Trigger--once.

This automated reconciliation saves a lot of manual effort. You can also add alerting and monitoring on top of it. The volume of retryable data can become a performance overhead as it grows over time. It's better to periodically clean those tables to get rid of older unwanted error logs and keep their size minimal.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Precise and contact-free manipulation of physical and biological objects is highly desirable in a wide range of fields that include nanofabrication, micro- and nano-robotics, drug delivery, and cell and tissue engineering. To this end, acoustic tweezers serve as a fast-developing platform for precise manipulation across a broad object size range1,2. There are two primary types of acoustic tweezers under development at present: radiation force tweezers and acoustic-streaming tweezers.

Radiation force tweezers, in which the acoustic radiation force acts as the trap, can be divided into standing-wave tweezers and traveling-wave tweezers. To date, most demonstrated acoustic tweezers are standing wave tweezers that use counter-propagating waves to create a mesh of standing-wave nodes and antinodes where the particles are trapped3,4,5,6,7,8,9,10,11,12,13,14. Such systems are particularly suitable for manipulating groups of particles, but the chessboard-like node network precludes object selectivity. In addition, standing wave trapping typically requires multiple transducers that surround the trapping region, which adds complexity and makes it incompatible with some application scenarios, especially those that involve fixed object inside the trapping region.

Both standing-wave tweezers and traveling-wave tweezers rely on acoustic radiation force to directly manipulate particles, whereas acoustic-streaming tweezers take advantage of the nonlinear Rayleigh streaming induced fluid flows34, and thus handle particles indirectly in fluids by creating streaming vortices35 with oscillating bubbles36 or rigid structures37,38. These devices tend to be simple devices that are easy to operate, but offer low degree of spatial resolution, because microbubble and microstructure-based phenomena are nonlinear and difficult to control2. Fluid manipulation has been demonstrated using controlled pumping39, but is limited to 2D, and requires sophisticated control over the source array.

Here we propose a hybrid 3D single beam acoustic tweezer by combining the radiation force and acoustic streaming. We exploit the Eckart streaming34 and demonstrate that, instead of being a nuisance, carefully designed acoustic streaming can be embedded in the focused acoustic vortex to create a fully 3D trap. As a proof of concept, we generated a focused acoustic vortex with a single piezoelectric transducer and a passive polydimethylsiloxane (PDMS) lens. The experimental levitation force provided by streaming reaches 3 orders magnitude larger than previously reported33, and allows a wider range of particle size, shape, and material properties. We demonstrate this three-dimensional acoustic tweezer first by simulation and experimental measurement of the acoustic field. Then the acoustic streaming flow field is measured with particle image velocimetry (PIV). Finally, levitation, trapping and 3D manipulation of a particle is demonstrated in a fluid environment.

a Creating a focused acoustic vortex for in-plane particle trapping, and the localized gradient streaming field levitates the particle, providing trap in the third dimension. Inset shows a photo of the fabricated device. b Evolution of the intensity and phase fields across different cut planes along z axis. The field is gradually focused as it propagates, keeping the spiral phase profile in the central region.

The acoustic streaming flow was simulated using another open-source tool, OpenFOAM41. The streaming effect modeling was performed in three stages as previously suggested in ref. 42: (i) simulation of the wave propagation in time domain using the compressible flow computational fluid dynamics(CFD) solver, (ii) time-averaging of the effective non-linear equation term to calculate the body force driving the acoustic streaming flow, and (iii) using the incompressible steady-state CFD solver to calculate the streaming velocity field by adding the effective external force equation term calculated in step (ii). All the required solvers are included in the default OpenFOAM distribution with minor additional code modifications required.

The hybrid acoustic tweezer relies on the balance of drag force and gravity, Therefore, it is designed to work in the upward orientation. Nevertheless, the trap stays stable even when the setup is tilted. From the experimental results shown in Supplementary Movie 5, we can see that the tweezer can stably trap the particle at a tilting angle as large as 21.

Compared with the 3D trap demonstrated by Baresch et al.33 where the trapping force along z direction relies on the dipole mode in a sphere, levitating particles with drag in the streaming flow offers several advantages. First, it is able to levitate heavier particles since the drag in streaming flow can provide larger upward force than using radiation forces. Second, it removes the dependence of trapping on the shape and material properties of the levitated particle. Third, the wave field is generated by a single transducer and passive lens instead of the transducer array, which provides an inexpensive and reliable route for contact-free particle and fluid manipulation.