What is batch processing with MapReduce and how does it work?

Batch processing is a technique where data is collected and processed as a group. It is used to periodically complete large and recurring repetitive data jobs. Certain data processing tasks, such as backups, counting, and sorting, can be compute-intensive and slow to run on individual dataset entries. So, instead of putting all the available resources on just one data entry at a given time and underusing a processing unit’s capacity, batch processing can be used to utilize the full potential of a processing unit by processing data in groups. The optimal size of these groups depends on the amount of data entries a processing unit can handle at a given time and the number of available processing units.

Batch processing with MapReduce

MapReduce is a framework developed by Google to handle large amounts of data in a timely and efficient manner. It has the advantage of executing a single task with the help of multiple processing units, which helps in taking as many input batches of data as the number of processing units and processing them in one MapReduce jobA single iteration of MapReduce processing on a single part of a dataset is called a MapReduce job.. Batch processing with MapReduce is a distributed data processing model that efficiently processes large datasets across a cluster of machines.

How does it work?

Map and reduce are both user-defined functions and it is totally up to the users for what operations they want them to perform. The idea is that the task can be divided into two parts, first one could be mapping values for the input data. Here’s the basic workflow of batch processing with MapReduce:

First the data is split into multiple parts called splits.
A map() function is applied to each split independently. It takes a split and produces a set of key-value pairs, which is the intermediate data.
The intermediate data generated by the map function is sorted and grouped. This brings all the values with same keys closer to each other.
A reduce() function is applied to each group independently. The reduce function takes a key and all of the values against it and produces a summary of value/output against that key.
The output of the reduce() function is stored in the output directory.

main.py

Netflix_movies.csv

import concurrent.futures
from itertools import groupby
import pandas as pd
class MapReduce:
    def __init__(self, data):
        self.dataset = data
    # Map function
    def map(self, batch):
        # Count movies in Italy for a specific batch
        counts = 0
        for i, eachrow in batch.iterrows():
            country = eachrow['country']
            if country == 'Italy':
                counts += 1
        return counts
    # Reduce function
    def reduce(self, batch_size=2, workers=2):
        total_counts = 0
        with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
            # Split dataset into batches based on index
            batches = [group for i, group in self.dataset.groupby(lambda x: x // batch_size)]
            # Process batches simultaneously
            with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as mapper_executor:
                # Call map function
                map_results = {mapper_executor.submit(self.map, batch): batch for batch in batches}
                # Aggregate counts from map results for Italy only
                for map_result in concurrent.futures.as_completed(map_results):
                    batch = map_results[map_result]
                    counts = map_result.result()
                    total_counts += counts
        return total_counts
if __name__ == '__main__':
    # Example dataset (replace with our actual dataset)
    df = pd.read_csv('Netflix_movies.csv')
    data = pd.DataFrame(df)
    # Define object of MapReduce class
    job = MapReduce(data)
    # We can play around with the batch size and the number of workers in the command below
    counts = job.reduce(batch_size=100, workers=3)
    print(f"Number of Movies in Italy: {counts}")

Code explanation

The explanation of the above code given is as follows:

Lines 7–8: We define the constructor of the MapReduce class, which takes in the dataset and defines the self.dataset attribute of the class.
Lines 10–17: We define the map() function. It takes a batch of data and goes through the country columns in each of its rows to check if the movie is in Italy or not. If it is, then the count variable is incremented.
Lines 19–34: We define a reduce() function. The reduce() function is performing the following tasks:
- It takes the number of workers and batch size as input.
- It uses the concurrent.futures module to simultaneously process batches of data using a thread pool.
- It executes the map function on each batch of data in parallel.
- It uses the map_results object to track the progress and obtain the results of the parallel computations.
- In the end, it aggregates the result of all the results found in the map_results object and returns the total count of movies in Italy.
Lines 38–47: We define the main function. It reads the CSV data into a pandas DataFrame, creates an object of the MapReduce class, and calls the reduce() function with a defined batch size and number of workers.

Conclusion

While MapReduce offers many advantages, newer frameworks, such as Apache Spark, offer performance improvements for certain types of batch-processing workloads.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

You TubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources