Apache Lucene was developed in 1999 as an open-source framework. Shortly after, Hadoop was created. In today’s world, Apache Hadoop and Apache Spark are widely used.
Spark was originally developed by Matei Zaharia in 2009 while studying at the University of California at Berkeley. His main discovery in technology was improving data organization to more efficiently scale storage processing across distributed cluster nodes. Spark can process large amounts of data by dividing workloads across different nodes faster than Hadoop. This makes Spark more general than Hadoop because Hadoop cannot handle cases with MapReduce
.
The following are the five main components of Apache Spark:
Spark core is the foundation of the entire project. It handles functions such as scheduling, task distribution, input and output operations, troubleshooting, etc. It is also a base for building other functionalities.
Spark SQL allows users to perform optimized structured data processing by executing SQL queries directly or by using the Spark Dataset API to access the SQL execution engine.
Spark streaming collects data from multiple streaming sources for a smooth uninterrupted stream which divides data into micro-batches.
Structured streaming is a more modern procedure developed on Spark SQL that reduces latency and makes programming more simple.
Machine learning is a built-in machine learning library that involves different machine learning algorithms, tools for feature selection, and building machine learning pipelines. The main API for MLlib is DataFrames, which is the same for different programming languages like Python, Java, and Scala.
GraphX is an interactive engine that computes and allows interactive building, adjustments, and interprets scalable, graph-structured data.
Hadoop was initially developed in 2006 by Doug Cutting and Mike Cafarella to process high amounts of data using MapReduce. Hadoop efficiently distributes large computing problems between computers, performs calculations locally, and then combines the results.
HDFS handles the distribution, storage, and access of data over multiple separate servers. HDFS also contributes to high-throughput data access and high fault tolerance.
Mapreduce
breaks large data processing tasks into smaller tasks which are then distributed over separate nodes and later on each task is executed.
YARN (Yet Another Resource Negotiator) is a cluster resource manager that schedules jobs and runs the specified tasks. Computing sources such as CPU and memory are also allocated by YARN.
Hadoop common is a collection of utilities and libraries that are used by other parts of Hadoop.
Spark is faster as it works with random access memory (RAM) rather than reading and writing intermediate data to disks.
Hadoop stores data on various sources and handles it in groups through MapReduce.
Hadoop is better security-wise as it uses more than one authentication and access control procedure.
Spark improves security with authentication by shared secret or event logging.
Spark is better as it includes MLlib, which implements iterative in-memory machine learning computations.