Fgl Programming Guide

0 views

Skip to first unread message

Marcelene Vasconez

unread,

Aug 5, 2024, 5:49:03 AM8/5/24

to preeminopter

TheBeam Programming Guide is intended for Beam users who want to use theBeam SDKs to create data processing pipelines. It provides guidance for usingthe Beam SDK classes to build and test your pipeline. The programming guide isnot intended as an exhaustive reference, but as a language-agnostic, high-levelguide to programmatically building your Beam pipeline. As the programming guideis filled out, the text will include code samples in multiple languages to helpillustrate how to implement Beam concepts in your pipelines.

To use Beam, you need to first create a driver program using the classes in oneof the Beam SDKs. Your driver program defines your pipeline, including all ofthe inputs, transforms, and outputs; it also sets execution options for yourpipeline (typically passed in using command-line options). These include thePipeline Runner, which, in turn, determines what back-end your pipeline will runon.

The Beam SDKs provide a number of abstractions that simplify the mechanics oflarge-scale distributed data processing. The same Beam abstractions work withboth batch and streaming data sources. When you create your Beam pipeline, youcan think about your data processing task in terms of these abstractions. Theyinclude:

Pipeline: A Pipeline encapsulates your entire data processing task, fromstart to finish. This includes reading input data, transforming that data, andwriting output data. All Beam driver programs must create a Pipeline. Whenyou create the Pipeline, you must also specify the execution options thattell the Pipeline where and how to run.

PCollection: A PCollection represents a distributed data set that yourBeam pipeline operates on. The data set can be bounded, meaning it comesfrom a fixed source like a file, or unbounded, meaning it comes from acontinuously updating source via a subscription or other mechanism. Yourpipeline typically creates an initial PCollection by reading data from anexternal data source, but you can also create a PCollection from in-memorydata within your driver program. From there, PCollections are the inputs andoutputs for each step in your pipeline.

PTransform: A PTransform represents a data processing operation, or a step,in your pipeline. Every PTransform takes one or more PCollection objects asinput, performs a processing function that you provide on the elements of thatPCollection, and produces zero or more output PCollection objects.

Use the pipeline options to configure different aspects of your pipeline, suchas the pipeline runner that will execute your pipeline and any runner-specificconfiguration required by the chosen runner. Your pipeline options willpotentially include information such as your project ID or a location forstoring files.

To read from an external source, you use one of the Beam-provided I/Oadapters. The adapters vary in their exact usage, but all of themread from some external data source and return a PCollection whose elementsrepresent the data records in that source.

A PCollection is owned by the specific Pipeline object for which it iscreated; multiple pipelines cannot share a PCollection.In some respects, a PCollection functions likea Collection class. However, a PCollection can differ in a few key ways:

The elements of a PCollection may be of any type, but must all be of the sametype. However, to support distributed processing, Beam needs to be able toencode each individual element as a byte string (so elements can be passedaround to distributed workers). The Beam SDKs provide a data encoding mechanismthat includes built-in encoding for commonly-used types as well as support forspecifying custom encodings as needed.

In many cases, the element type in a PCollection has a structure that can be introspected.Examples are JSON, Protocol Buffer, Avro, and database records. Schemas provide a way toexpress types as a set of named fields, allowing for more-expressive aggregations.

A PCollection is immutable. Once created, you cannot add, remove, or changeindividual elements. A Beam Transform might process each element of aPCollection and generate new pipeline data (as a new PCollection), but itdoes not consume or modify the original input collection.

A PCollection can be either bounded or unbounded in size. Abounded PCollection represents a data set of a known, fixed size, while anunbounded PCollection represents a data set of unlimited size. Whether aPCollection is bounded or unbounded depends on the source of the data set thatit represents. Reading from a batch data source, such as a file or a database,creates a bounded PCollection. Reading from a streaming orcontinuously-updating data source, such as Pub/Sub or Kafka, creates an unboundedPCollection (unless you explicitly tell it not to).

The bounded (or unbounded) nature of your PCollection affects how Beamprocesses your data. A bounded PCollection can be processed using a batch job,which might read the entire data set once, and perform processing in a job offinite length. An unbounded PCollection must be processed using a streamingjob that runs continuously, as the entire collection can never be available forprocessing at any one time.

Each element in a PCollection has an associated intrinsic timestamp. Thetimestamp for each element is initially assigned by the Sourcethat creates the PCollection. Sources that create an unbounded PCollectionoften assign each new element a timestamp that corresponds to when the elementwas read or added.

Note: Sources that create a bounded PCollection for a fixed data setalso automatically assign timestamps, but the most common behavior is toassign every element the same timestamp (Long.MIN_VALUE).

Timestamps are useful for a PCollection that contains elements with aninherent notion of time. If your pipeline is reading a stream of events, likeTweets or other social media messages, each element might use the time the eventwas posted as the element timestamp.

To invoke a transform, you must apply it to the input PCollection. Eachtransform in the Beam SDKs has a generic apply method(or pipe operator ).Invoking multiple Beam transforms is similar to method chaining, but with oneslight difference: You apply the transform to the input PCollection, passingthe transform itself as an argument, and the operation returns the outputPCollection.arrayIn YAML, transforms are applied by listing their inputs.This takes the general form:

You can also build your own composite transforms thatnest multiple transforms inside a single, larger transform. Composite transformsare particularly useful for building a reusable sequence of simple steps thatget used in a lot of different places.

In such roles, ParDo is a common intermediate step in a pipeline. You mightuse it to extract certain fields from a set of raw input records, or convert rawinput into a different format; you might also use ParDo to convert processeddata into a format suitable for output, like database table rows or printablestrings.

In the example, our input PCollection contains Stringstring values. We apply aParDo transform that specifies a function (ComputeWordLengthFn) to computethe length of each string, and outputs the result to a new PCollection ofIntegerint values that stores the length of each word.

Note: When you create your DoFn, be mindful of the Requirementsfor writing user code for Beam transformsand ensure that your code follows them.You should avoid time-consuming operations such as reading large files in DoFn.Setup.

Note: If the elements in your input PCollection are key/value pairs, yourprocess element method must have two parameters, for each of the key and value,respectively. Similarly, key/value pairs are also output as separateparameters to a single emitter function.

If your function is relatively straightforward, you can simplify your use ofParDo by providing a lightweight DoFn in-line, asan anonymous inner class instancea lambda functionan anonymous functiona function passed to PCollection.map or PCollection.flatMap.

Here is a sequence diagram that shows the lifecycle of the DoFn duringthe execution of the ParDo transform. The comments give usefulinformation to pipeline developers such as the constraints thatapply to the objects or particular cases such as failover orinstance reuse. They also give instantiation use cases. Three key pointsto note are that:

GroupByKey gathers up all the values with the same key and outputs a new pairconsisting of the unique key and a collection of all of the values that wereassociated with that key in the input collection. If we apply GroupByKey toour input collection above, the output collection would look like this:

If you are using unbounded PCollections, you must use either non-globalwindowing or anaggregation trigger in order to perform a GroupByKey orCoGroupByKey. This is because a bounded GroupByKey orCoGroupByKey must wait for all the data with a certain key to be collected,but with unbounded collections, the data is unlimited. Windowing and/or triggersallow grouping to operate on logical, finite bundles of data within theunbounded data streams.

If you do apply GroupByKey or CoGroupByKey to a group of unboundedPCollections without setting either a non-global windowing strategy, a triggerstrategy, or both for each collection, Beam generates an IllegalStateExceptionerror at pipeline construction time.

When using GroupByKey or CoGroupByKey to group PCollections that have awindowing strategy applied, all of the PCollections you want togroup must use the same windowing strategy and window sizing. For example, allof the collections you are merging must use (hypothetically) identical 5-minutefixed windows, or 4-minute sliding windows starting every 30 seconds.

CombineCombineCombineCombineis a Beam transform for combining collections of elements or values in yourdata. Combine has variants that work on entire PCollections, and some thatcombine the values for each key in PCollections of key/value pairs.