14 January 2026

#Spark Streaming

#Spark Streaming

Key Concepts


S.No Topic Sub-Topics
1Spark StreamingWhat is Spark Streaming, Real-time data, Micro-batch processing, Advantages, Use cases
2Spark Streaming ArchitectureDriver, Receiver, DStream, Scheduler, Executors
3DStream BasicsDefinition, Creation, Operations, RDDs, Transformations
4Creating DStreamsFrom sources: Kafka, Flume, TCP sockets, File streams, Custom receivers
5Transformations on DStreamsmap(), flatMap(), filter(), reduceByKey(), window()
6Window Operationswindow(), slideDuration, reduceByKeyAndWindow(), aggregateByKeyAndWindow(), Examples
7Stateful TransformationsupdateStateByKey(), mapWithState(), Example, Use cases, Performance
8Actions on DStreamsprint(), count(), saveAsTextFiles(), foreachRDD(), Examples
9Data Sources IntegrationKafka, Flume, HDFS, Socket, Custom sources
10Sinks / Output Operationsprint(), saveAsTextFiles(), saveAsObjectFiles(), foreachRDD(), write to DB
11CheckpointingDefinition, Directory setup, Purpose, Examples, Fault tolerance
12Receiver TypesReliable receiver, Unreliable receiver, Custom receiver, Receiver lifecycle, Examples
13Transformations: map vs flatMapmap(), flatMap(), Use cases, Examples, Differences
14Transformations: reduceByKeyreduceByKey(), reduceByKeyAndWindow(), Examples, Use cases, Performance
15Transformations: join in streamingjoin(), leftOuterJoin(), rightOuterJoin(), fullOuterJoin(), Example
16Transformations: union & transformunion(), transform(), Example, Use cases, Combining multiple streams
17Handling Late DataWatermarks, Window operations, State management, dropLateData(), Examples
18Kafka IntegrationDirectStream vs ReceiverStream, Kafka parameters, Offset management, Example, Best practices
19Flume IntegrationSpark Streaming + Flume, Push vs Pull, Receiver setup, Example, Best practices
20File Stream SourceHDFS integration, Local files, Monitoring new files, Examples, Performance considerations
21Structured Streaming IntroductionDifferences from DStream, High-level API, DataFrames & Datasets, Fault-tolerance, Example
22Structured Streaming SourcesKafka, File, Socket, Rate source, Custom sources
23Structured Streaming SinksConsole, File, Kafka, ForeachBatch, Memory
24Event Time & WatermarksDefinition, Handling late data, withWatermark(), Examples, Use cases
25Window Operations in Structured Streamingwindow(), slideDuration, groupBy window(), Examples, Performance tips
26Stateful Operations in Structured StreamingmapGroupsWithState(), flatMapGroupsWithState(), Examples, Use cases, Performance
27Performance TuningBatch interval, Partitioning, Backpressure, Checkpointing, Resource tuning
28Fault Tolerance & ReliabilityCheckpointing, Write-ahead logs, Replay, Receiver reliability, Structured Streaming guarantees
29Monitoring & DebuggingSpark UI, Streaming metrics, Logs, Executor monitoring, Performance tuning
30Real-world ExamplesLog analytics, IoT data processing, Real-time dashboards, Clickstream analysis, Recommendations

Interview question

Basic

  • What is Spark Streaming?
  • Explain real-time data processing.
  • What is a micro-batch in Spark Streaming?
  • Difference between batch and streaming.
  • What is a DStream?
  • How is a DStream created?
  • What are the basic DStream transformations?
  • What are the basic DStream actions?
  • Explain map() transformation in streaming.
  • Explain flatMap() transformation in streaming.
  • Explain filter() transformation in streaming.
  • Explain reduceByKey() transformation in streaming.
  • Explain count() action in streaming.
  • Explain print() action in streaming.
  • How to read from a socket stream?
  • How to read from a file stream?
  • Difference between reliable and unreliable receivers.
  • What is the role of the driver in Spark Streaming?
  • What is the role of executors in streaming?
  • How is batch interval configured?
  • What is the default checkpointing mechanism?
  • How do you stop a streaming context?
  • Explain foreachRDD() action.
  • What is the Spark Streaming UI?
  • Explain the use cases of Spark Streaming.

Intermediate

  • Explain window operations in Spark Streaming.
  • What is slide interval?
  • Difference between window duration and slide duration.
  • Explain reduceByKeyAndWindow().
  • Explain aggregateByKeyAndWindow().
  • What are stateful transformations?
  • Explain updateStateByKey().
  • Explain mapWithState().
  • How do you integrate Spark Streaming with Kafka?
  • What is DirectKafkaStream?
  • What is Receiver-based Kafka stream?
  • How do you handle offsets in Kafka?
  • Explain Spark Streaming integration with Flume.
  • Explain push-based vs pull-based Flume integration.
  • How to read from HDFS in streaming?
  • How to read from S3 in streaming?
  • Explain streaming file source options.
  • Explain output operations: saveAsTextFiles().
  • Explain output operations: saveAsObjectFiles().
  • Explain output operations: foreachRDD() to database.
  • Explain fault tolerance in Spark Streaming.
  • What is write-ahead logs (WAL)?
  • Explain receiver reliability.
  • Explain backpressure mechanism in Spark Streaming.
  • What is the role of batch scheduling?

Advanced

  • Explain structured streaming.
  • Difference between DStream API and Structured Streaming API.
  • What are the sources in Structured Streaming?
  • What are the sinks in Structured Streaming?
  • Explain event-time processing.
  • Explain watermarks in streaming.
  • How to handle late data using watermarks?
  • Explain streaming aggregation.
  • Explain window aggregation in structured streaming.
  • Explain stateful aggregations.
  • Explain mapGroupsWithState().
  • Explain flatMapGroupsWithState().
  • Explain join operations in streaming.
  • Explain stream-stream join vs stream-static join.
  • Explain stream-stream outer joins.
  • Explain checkpointing in structured streaming.
  • Explain exactly-once semantics in streaming.
  • Explain output modes: append, complete, update.
  • Explain processing-time triggers.
  • Explain continuous processing mode.
  • Explain schema inference in streaming.
  • Explain custom sources in structured streaming.
  • Explain foreachBatch() in structured streaming.
  • Explain streaming aggregation with watermarking.
  • Explain performance tuning for structured streaming.

Expert

  • Explain state store in structured streaming.
  • Explain recovery from failures in streaming.
  • Explain backpressure in structured streaming.
  • Explain memory and executor tuning for streaming.
  • Explain shuffle optimization in streaming joins.
  • Explain handling skewed streaming data.
  • Explain checkpointing and lineage recovery.
  • Explain streaming aggregation optimizations.
  • Explain watermarks with multiple streams.
  • Explain latency vs throughput trade-offs.
  • Explain using Kafka offsets with checkpointing.
  • Explain exactly-once vs at-least-once delivery.
  • Explain stateful streaming performance tuning.
  • Explain streaming joins with large datasets.
  • Explain stream-stream join optimization.
  • Explain integrating streaming with machine learning.
  • Explain handling late-arriving events.
  • Explain multi-window aggregations.
  • Explain structured streaming with event time vs processing time.
  • Explain monitoring streaming jobs with Spark UI.
  • Explain streaming metrics and logs.
  • Explain resource allocation and dynamic scaling.
  • Explain memory spill and disk management in streaming.
  • Explain streaming ETL pipelines.
  • Explain real-world streaming applications and case studies.

Related Topics