Batch processing is a technique where data is collected and processed as a group. It is used to periodically complete large and recurring repetitive data jobs. Certain data processing tasks, such as backups, counting, and sorting, can be compute-intensive and slow to run on individual dataset entries. So, instead of putting all the available resources on just one data entry at a given time and underusing a processing unit’s capacity, batch processing can be used to utilize the full potential of a processing unit by processing data in groups. The optimal size of these groups depends on the amount of data entries a processing unit can handle at a given time and the number of available processing units.
MapReduce is a framework developed by Google to handle large amounts of data in a timely and efficient manner. It has the advantage of executing a single task with the help of multiple processing units, which helps in taking as many input batches of data as the number of processing units and processing them in one
Map and reduce are both user-defined functions and it is totally up to the users for what operations they want them to perform. The idea is that the task can be divided into two parts, first one could be mapping values for the input data. Here’s the basic workflow of batch processing with MapReduce:
First the data is split into multiple parts called splits.
A map()
function is applied to each split independently. It takes a split and produces a set of key-value pairs, which is the intermediate data.
The intermediate data generated by the map function is sorted and grouped. This brings all the values with same keys closer to each other.
A reduce()
function is applied to each group independently. The reduce function takes a key and all of the values against it and produces a summary of value/output against that key.
The output of the reduce()
function is stored in the output directory.
Let’s take a CSV file that contains a Netflix movies dataset that has the following format:
Netflix_data= {'index','type','title','director','country','date_added','release_year','rating','duration','listed_in'}
We need to get the number of movies that are available in Italy. If we sequentially start looking for the movies, we will have to traverse the whole dataset sequentially which is not an efficient way to get the data we need. Instead, we use batch processing of MapReduce, which will divide the data and look at movies in Italy in each split of data simultaneously and then summarize the output obtained from different processing units to get the total count of movies in Italy.
import concurrent.futuresfrom itertools import groupbyimport pandas as pdclass MapReduce:def __init__(self, data):self.dataset = data# Map functiondef map(self, batch):# Count movies in Italy for a specific batchcounts = 0for i, eachrow in batch.iterrows():country = eachrow['country']if country == 'Italy':counts += 1return counts# Reduce functiondef reduce(self, batch_size=2, workers=2):total_counts = 0with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:# Split dataset into batches based on indexbatches = [group for i, group in self.dataset.groupby(lambda x: x // batch_size)]# Process batches simultaneouslywith concurrent.futures.ThreadPoolExecutor(max_workers=workers) as mapper_executor:# Call map functionmap_results = {mapper_executor.submit(self.map, batch): batch for batch in batches}# Aggregate counts from map results for Italy onlyfor map_result in concurrent.futures.as_completed(map_results):batch = map_results[map_result]counts = map_result.result()total_counts += countsreturn total_countsif __name__ == '__main__':# Example dataset (replace with our actual dataset)df = pd.read_csv('Netflix_movies.csv')data = pd.DataFrame(df)# Define object of MapReduce classjob = MapReduce(data)# We can play around with the batch size and the number of workers in the command belowcounts = job.reduce(batch_size=100, workers=3)print(f"Number of Movies in Italy: {counts}")
The explanation of the above code given is as follows:
Lines 7–8: We define the constructor of the MapReduce
class, which takes in the dataset and defines the self.dataset
attribute of the class.
Lines 10–17: We define the map()
function. It takes a batch of data and goes through the country
columns in each of its rows to check if the movie is in Italy or not. If it is, then the count
variable is incremented.
Lines 19–34: We define a reduce()
function. The reduce()
function is performing the following tasks:
It takes the number of workers and batch size as input.
It uses the concurrent.futures
module to simultaneously process batches of data using a thread pool.
It executes the map
function on each batch of data in parallel.
It uses the map_results
object to track the progress and obtain the results of the parallel computations.
In the end, it aggregates the result of all the results found in the map_results
object and returns the total count of movies in Italy.
Lines 38–47: We define the main function. It reads the CSV data into a pandas DataFrame, creates an object of the MapReduce
class, and calls the reduce()
function with a defined batch size and number of workers.
While MapReduce offers many advantages, newer frameworks, such as Apache Spark, offer performance improvements for certain types of batch-processing workloads.
Free Resources