Spark Streaming is an extension of the core Spark API. It allows for scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be taken from sources like Kafka, Kinesis, or TCP sockets, and processed using complex algorithms expressed through high-level functions like map, reduce, join, and window. Processed data can also be pushed out to filesystems, databases, and live dashboards. Spark Streaming natively supports both batch and streaming workloads.
Spark Streaming receives live input data streams and divides it into batches. Its key abstraction is a Discretized Stream, a.k.a., a DStream that represents a data stream divided into small batches. DStreams are built on RDDs, Spark’s core data abstraction, which allows Spark Streaming to seamlessly integrate with other Spark components like MLlib and Spark SQL. The batches of input data are then processed by the Spark engine to generate the final stream of results in batches.
Spark Streaming brings Apache Spark’s language-integrated API to stream processing, which lets you write streaming jobs the same way you write batch jobs. It supports Java, Scala, and Python.
Spark Streaming recovers lost work and operator state (e.g., sliding windows) out of the box without any extra code on your part.
Spark’s integration lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. Moreover, it offers native integration with advanced processing libraries (SQL, machine learning, graph processing).
Free Resources