What is batch processing with MapReduce and how does it work?

Batch processing is a technique where data is collected and processed as a group. It is used to periodically complete large and recurring repetitive data jobs. Certain data processing tasks, such as backups, counting, and sorting, can be compute-intensive and slow to run on individual dataset entries. So, instead of putting all the available resources on just one data entry at a given time and underusing a processing unit’s capacity, batch processing can be used to utilize the full potential of a processing unit by processing data in groups. The optimal size of these groups depends on the amount of data entries a processing unit can handle at a given time and the number of available processing units.

Batch processing with MapReduce

MapReduce is a framework developed by Google to handle large amounts of data in a timely and efficient manner. It has the advantage of executing a single task with the help of multiple processing units, which helps in taking as many input batches of data as the number of processing units and processing them in one MapReduce jobA single iteration of MapReduce processing on a single part of a dataset is called a MapReduce job.. Batch processing with MapReduce is a distributed data processing model that efficiently processes large datasets across a cluster of machines.

How does it work?

Map and reduce are both user-defined functions and it is totally up to the users for what operations they want them to perform. The idea is that the task can be divided into two parts, first one could be mapping values for the input data. Here’s the basic workflow of batch processing with MapReduce:

  • First the data is split into multiple parts called splits.

  • A map() function is applied to each split independently. It takes a split and produces a set of key-value pairs, which is the intermediate data.

  • The intermediate data generated by the map function is sorted and grouped. This brings all the values with same keys closer to each other.

  • A reduce() function is applied to each group independently. The reduce function takes a key and all of the values against it and produces a summary of value/output against that key.

  • The output of the reduce() function is stored in the output directory.

Workflow of MapReduce
Workflow of MapReduce

Example

Let’s take a CSV file that contains a Netflix movies dataset that has the following format:

Netflix_data= {
'index',
'type',
'title',
'director',
'country',
'date_added',
'release_year',
'rating',
'duration',
'listed_in'
}

We need to get the number of movies that are available in Italy. If we sequentially start looking for the movies, we will have to traverse the whole dataset sequentially which is not an efficient way to get the data we need. Instead, we use batch processing of MapReduce, which will divide the data and look at movies in Italy in each split of data simultaneously and then summarize the output obtained from different processing units to get the total count of movies in Italy.

main.py
Netflix_movies.csv
import concurrent.futures
from itertools import groupby
import pandas as pd
class MapReduce:
def __init__(self, data):
self.dataset = data
# Map function
def map(self, batch):
# Count movies in Italy for a specific batch
counts = 0
for i, eachrow in batch.iterrows():
country = eachrow['country']
if country == 'Italy':
counts += 1
return counts
# Reduce function
def reduce(self, batch_size=2, workers=2):
total_counts = 0
with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
# Split dataset into batches based on index
batches = [group for i, group in self.dataset.groupby(lambda x: x // batch_size)]
# Process batches simultaneously
with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as mapper_executor:
# Call map function
map_results = {mapper_executor.submit(self.map, batch): batch for batch in batches}
# Aggregate counts from map results for Italy only
for map_result in concurrent.futures.as_completed(map_results):
batch = map_results[map_result]
counts = map_result.result()
total_counts += counts
return total_counts
if __name__ == '__main__':
# Example dataset (replace with our actual dataset)
df = pd.read_csv('Netflix_movies.csv')
data = pd.DataFrame(df)
# Define object of MapReduce class
job = MapReduce(data)
# We can play around with the batch size and the number of workers in the command below
counts = job.reduce(batch_size=100, workers=3)
print(f"Number of Movies in Italy: {counts}")

Code explanation

The explanation of the above code given is as follows:

  • Lines 7–8: We define the constructor of the MapReduce class, which takes in the dataset and defines the self.dataset attribute of the class.

  • Lines 10–17: We define the map() function. It takes a batch of data and goes through the country columns in each of its rows to check if the movie is in Italy or not. If it is, then the count variable is incremented.

  • Lines 19–34: We define a reduce() function. The reduce() function is performing the following tasks:

    • It takes the number of workers and batch size as input.

    • It uses the concurrent.futures module to simultaneously process batches of data using a thread pool.

    • It executes the map function on each batch of data in parallel.

    • It uses the map_results object to track the progress and obtain the results of the parallel computations.

    • In the end, it aggregates the result of all the results found in the map_results object and returns the total count of movies in Italy.

  • Lines 38–47: We define the main function. It reads the CSV data into a pandas DataFrame, creates an object of the MapReduce class, and calls the reduce() function with a defined batch size and number of workers.

Conclusion

While MapReduce offers many advantages, newer frameworks, such as Apache Spark, offer performance improvements for certain types of batch-processing workloads.

Free Resources

HowDev By Educative. Copyright ©2025 Educative, Inc. All rights reserved