What is Spark streaming?

Spark Streaming is an extension of the core Spark API. It allows for scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be taken from sources like Kafka, Kinesis, or TCP sockets, and processed using complex algorithms expressed through high-level functions like map, reduce, join, and window. Processed data can also be pushed out to filesystems, databases, and live dashboards. Spark Streaming natively supports both batch and streaming workloads.

Features

Easy to use

Spark Streaming brings Apache Spark’s language-integrated API to stream processing, which lets you write streaming jobs the same way you write batch jobs. It supports Java, Scala, and Python.

Fault tolerance

Spark Streaming recovers lost work and operator state (e.g., sliding windows) out of the box without any extra code on your part.

Native integration

Spark’s integration lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. Moreover, it offers native integration with advanced processing libraries (SQL, machine learning, graph processing).

What is Spark streaming?

Working

Features

Easy to use

Fault tolerance

Native integration