14 January 2026

#Transformations & Actions

#Transformations & Actions

Key Concepts


S.No Topic Sub-Topics
1Transformations & ActionsDefinition, Lazy Evaluation, DAG concept, Execution flow, Why separation matters
2Narrow vs Wide TransformationsDefinition, Examples, Shuffle impact, Performance difference, Use cases
3map()Syntax, One-to-one mapping, Use cases, Performance, Examples
4flatMap()One-to-many mapping, Differences from map, Use cases, Examples, Performance
5filter()Predicate logic, Data reduction, Optimization tips, Examples, Use cases
6select() / withColumn()Column selection, Column creation, Expressions, Performance tips, Examples
7union() & distinct()Combining datasets, Removing duplicates, Shuffle behavior, Use cases, Examples
8groupBy()Grouping logic, Aggregation basics, Shuffle impact, Examples, Best practices
9reduceByKey()Key-based reduction, Map-side aggregation, Performance benefits, Examples, Comparison
10groupByKey()Working principle, Memory impact, Comparison with reduceByKey, Examples, When to avoid
11sortBy() & orderBy()Sorting logic, Asc/Desc order, Shuffle cost, Examples, Optimization tips
12join() BasicsInner join, Join condition, Execution flow, Examples, Common issues
13Advanced Join TypesLeft, Right, Full, Semi, Anti joins, Use cases, Examples
14Broadcast JoinConcept, When to use, Memory impact, SQL hint, Examples
15repartition() & coalesce()Partition control, Shuffle behavior, Performance impact, Use cases, Examples
16cache() & persist()Storage levels, Memory vs disk, When to cache, Examples, Pitfalls
17count()Action trigger, Job creation, Performance considerations, Examples, Use cases
18collect()Driver memory risk, Small data usage, Examples, Best practices, Alternatives
19show() & take()Preview data, Execution behavior, Limit handling, Examples, Usage tips
20save() & write()Output formats, File systems, Partition output, Modes, Examples
21foreach() & foreachPartition()Side effects, External systems, Performance difference, Examples, Best practices
22Window FunctionsOver clause, Partition by, Order by, Use cases, Examples
23Actions vs TransformationsComparison, Execution timing, DAG role, Interview questions, Examples
24Shuffle InternalsWhen shuffle occurs, Cost factors, Optimization, Examples, Debugging
25Performance OptimizationAvoid wide ops, Partition sizing, Caching strategy, Examples, Tips
26Error HandlingBad records, Null handling, Try-catch logic, Data validation, Examples
27Spark UI AnalysisJobs tab, Stages tab, Task metrics, Shuffle read/write, Debugging
28Real-world ETL FlowTransform chain design, Action placement, Optimization, Examples, Best practices
29Interview ScenariosCommon questions, Tricky cases, Performance questions, Sample answers, Tips
30Hands-on Mini ProjectEnd-to-end pipeline, Transformations usage, Actions usage, Optimization, Review

Interview question

Basic

  • What is a transformation in Spark?
  • What is an action in Spark?
  • Difference between transformations and actions?
  • What is lazy evaluation?
  • What is an RDD?
  • How do you create an RDD?
  • What is parallelize() in Spark?
  • What is textFile() in SparkContext?
  • Explain map() transformation.
  • Explain filter() transformation.
  • What is flatMap()?
  • Explain distinct() transformation.
  • What does union() do?
  • Explain intersection() transformation.
  • Explain subtract() transformation.
  • What is cartesian() transformation?
  • What is collect() action?
  • Explain count() action.
  • Explain first() action.
  • Explain take(n) action.
  • Explain reduce() action.
  • Explain fold() action.
  • Explain aggregate() action.
  • Explain takeOrdered() action.
  • Explain top() action.

Intermediate

  • Explain groupByKey() transformation.
  • Explain reduceByKey() transformation.
  • Difference between groupByKey() and reduceByKey().
  • Explain aggregateByKey() transformation.
  • Explain combineByKey() transformation.
  • What are pair RDDs?
  • Explain mapValues() transformation.
  • Explain flatMapValues() transformation.
  • Explain keys() and values() transformations.
  • Explain lookup() action on pair RDDs.
  • Explain joins: innerJoin(), leftOuterJoin(), rightOuterJoin(), fullOuterJoin().
  • Explain cogroup() transformation.
  • Explain sortByKey() transformation.
  • Explain sortBy() transformation.
  • Difference between sortByKey() and sortBy().
  • Explain repartition() transformation.
  • Explain coalesce() transformation.
  • Difference between repartition() and coalesce().
  • Explain sample() transformation.
  • Explain sampleByKey() and sampleByKeyExact().
  • What is takeSample()?
  • Explain randomSplit() transformation.
  • Explain cache() and persist() transformations.
  • Explain unpersist() method.
  • Explain checkpointing and its use case.

Advanced

  • Explain narrow vs wide transformations.
  • Difference between narrow and wide dependencies.
  • What is shuffle in Spark?
  • How does shuffle affect performance?
  • Explain partitioning and its importance.
  • Explain HashPartitioner and RangePartitioner.
  • Explain repartitionAndSortWithinPartitions().
  • Explain the role of DAG scheduler in transformations.
  • Explain stages and tasks in Spark execution.
  • Explain lazy evaluation and lineage.
  • How are transformations optimized internally?
  • Explain the difference between map() and mapPartitions().
  • Explain foreach() and foreachPartition() actions.
  • How to handle skewed data in transformations?
  • Explain broadcast variables and their usage.
  • Explain accumulators and their usage.
  • Explain foldByKey() transformation.
  • Explain subtractByKey() transformation.
  • Explain join optimizations for pair RDDs.
  • Explain caching strategies for iterative algorithms.
  • Explain checkpointing vs caching.
  • Explain transformations on DataFrames compared to RDDs.
  • Explain mapPartitionsWithIndex() transformation.
  • Explain wide transformations and task parallelism.
  • Explain best practices for memory management during transformations.

Expert

  • Explain spark.sql.shuffle.partitions and its impact.
  • Explain narrow dependency scheduling optimizations.
  • How does Spark handle task failures during actions?
  • Explain lineage and recomputation during RDD failure.
  • Explain transformations in Structured Streaming.
  • Explain actions in Structured Streaming.
  • How to optimize joins on large datasets?
  • Explain partition tuning for large-scale RDDs.
  • Explain avoiding shuffles with map-side reductions.
  • Explain caching strategies for MLlib pipelines.
  • Explain the difference between cache() and persist(StorageLevel.MEMORY_AND_DISK).
  • Explain how Spark plans tasks for wide transformations.
  • Explain the difference between reduceByKey() and aggregateByKey() performance.
  • How to handle skewed keys in joins?
  • Explain Spark UI metrics related to transformations and actions.
  • Explain how stage boundaries are created in wide transformations.
  • Explain partition coalescing to reduce shuffle.
  • Explain RDD lineage graph and fault tolerance.
  • Explain advanced join strategies: broadcast join, shuffle hash join.
  • Explain the difference between DataFrame and RDD transformations.
  • Explain transformation optimization by Catalyst (for DataFrames).
  • Explain whole-stage code generation for DataFrame transformations.
  • Explain streaming aggregation and its fault tolerance.
  • Explain stateful transformations in streaming RDDs.
  • Explain advanced tuning techniques for actions on huge datasets.

Related Topics