How to do serverless data processing with Dataflow

Serverless data processing is a powerful way to analyze, transform, and process data without the burden of managing infrastructure. Google Cloud Dataflow, powered by Apache Beam, is a managed service that simplifies data processing tasks, making it an ideal choice for beginners. In this Answer, we will learn how to set up and execute a basic data processing pipeline using Google Cloud Dataflow and the Apache Beam framework in a local development environment, including creating a virtual environment.

Dataflow pipeline prerequisites

  • Python must be installed on the local machine

  • Pip must be installed to manage Python packages

  • Basic knowledge of data processing concepts

Implementation of the Dataflow pipeline

The following steps show a basic implementation of a Dataflow pipeline using Apache Beam.

Step 1: Setting up the local development and virtual environments

To begin, we need to set up a local development environment to work with Apache Beam and Google Cloud Dataflow. We must also create a virtual environment to isolate our project dependencies, as shown below:

  1. We open a terminal and navigate to the directory where we want to create our project.

  2. We create a virtual environment dataflow for our project as follows:

python -m venv dataflow
  1. We activate the virtual environment as follows:

source my_dataflow_env/bin/activate

Step 2: Installing Apache Beam in the virtual environment

Within our virtual environment, we install the Apache Beam Python SDK and the necessary dependencies using pip, as shown below:

pip install apache-beam

Step 3: Preparing our data

We use the following .txt file for this Answer:

This is a sample text file to test dataflow processing using Apache Beam.
It contains two lines of text for word counting.
Sample text file for processing

Step 4: Developing our Dataflow pipeline

Now, it’s time to write the code for our data processing pipeline using Apache Beam by performing the following steps:

  1. We create a simple Python script to define our Dataflow pipeline as follows:

import apache_beam as beam
# Define a Dataflow pipeline
def run():
with beam.Pipeline() as pipeline:
data = (
pipeline
| 'ReadData' >> beam.io.ReadFromText('data.txt')
| 'TransformData' >> beam.Map(lambda line: line.upper())
| 'WriteData' >> beam.io.WriteToText('output.txt')
)
if __name__ == '__main__':
run()

Note: In the code, replace 'data.txt' with the path to your local data file and output.txt with the path where you want to store the processed data.

Step 5: Testing locally

  1. We open a terminal and navigate to the directory containing our Python script.

  2. We run the pipeline locally using the following command:

python local-dataflow.py

We execute the following widget to see it in action:

import apache_beam as beam

# Define a Dataflow pipeline
def run():
    with beam.Pipeline() as pipeline:
        # Read a text file
        lines = pipeline | "ReadFromText" >> beam.io.ReadFromText("data.txt")
        
        # Count the number of words in each line
        word_counts = (
            lines
            | "SplitWords" >> beam.FlatMap(lambda line: line.split())
            | "CountWords" >> beam.combiners.Count.PerElement()
        )

        # Print the results
        word_counts | "PrintResults" >> beam.Map(print)

if __name__ == "__main__":
    run()
Example of Apache Beam pipeline to count words in a text file

The output with the word count shows a successful execution.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved