The PCollection
is a multi-element dataset. To create a data processing pipeline, we must have a PCollection.
Therefore, there will be at least one or more PCollection
in the pipeline, storing at least one input and output.
PCollection
There are two ways to make a PCollection.
External sources are typically used in the production environment. In contrast, in-memory sources are used to debug and test objectives.
To read data from external sources, you need Beam I/O Adapters. The adapters differ in usage. However, all of them read from some external data source and return a PCollection
.
Let’s look at an example using the ReadFromText
adapter.
lines = pipeline | 'ReadFile' >> beam.io.ReadFromText('gs://some/input.txt')
Here, we pass the file location as an argument. In this case, gs://some/input.txt
is a Google cloud storage location. Each adapter has a Read
transform. To read, you must apply that transform to the pipeline
object itself.
To create a PCollection
from in-memory data, you need to use create
transform. This transform can be directly applied to the pipeline object, and you can pass data in the code.
Let’s look at an example.
import apache_beam as beam
with beam.Pipeline() as pipeline:
lines = (
pipeline
| beam.Create([
'Welcome to the SHOT: ',
"This is in-memory data ",
'it is used in testing ',
'This is never used in prod ',
]))
PCollection
featuresBefore we begin, it i essential to know that the PCollections
cannot be shared between different pipelines.
You cannot access a random element from a PCollection
. Instead, your code should read elements one by one.
The PCollection
data can be of any type (integer, string, etc.). But all of the data should be of the same kind.
Beam encodes every element in a PCollection
as a byte string to support distributed processing.
The element type in a PCollection
often has a structure, e.g., JSON, Proto Buffer, Avro, and database records. Using a schema will allow us to perform complex operations.
PCollections
are immutable. Therefore, once created, you cannot modify (add, delete, etc.) them in any way.
While applying the transformations upon the PCollection
, the elements will be read one by one, transformed, and stored into a new PCollection.
Hence, it is never modified.
There is no upper limit on the number of elements stored in a PCollection
. So, either the data will be adjusted in memory on a single machine, or Beam can also distribute it.
The PCollection
can either be bounded or unbounded. Bounded represents a fixed amount of data; for example, files in Google Cloud Storage. Unbounded means an infinite amount of data; for instance, while reading the data from the streaming process (Kafka/pub-sub), we may not know how much data we will be receiving.
A timestamp will be assigned to every element in a PCollection
when created from the source.
Timestamps can be helpful when you want to filter or transform the data with a particular timestamp.