Prime_Questions: #Transformations & Actions

#Transformations & Actions

Key Concepts

S.No	Topic	Sub-Topics
1	Transformations & Actions	Definition, Lazy Evaluation, DAG concept, Execution flow, Why separation matters
2	Narrow vs Wide Transformations	Definition, Examples, Shuffle impact, Performance difference, Use cases
3	map()	Syntax, One-to-one mapping, Use cases, Performance, Examples
4	flatMap()	One-to-many mapping, Differences from map, Use cases, Examples, Performance
5	filter()	Predicate logic, Data reduction, Optimization tips, Examples, Use cases
6	select() / withColumn()	Column selection, Column creation, Expressions, Performance tips, Examples
7	union() & distinct()	Combining datasets, Removing duplicates, Shuffle behavior, Use cases, Examples
8	groupBy()	Grouping logic, Aggregation basics, Shuffle impact, Examples, Best practices
9	reduceByKey()	Key-based reduction, Map-side aggregation, Performance benefits, Examples, Comparison
10	groupByKey()	Working principle, Memory impact, Comparison with reduceByKey, Examples, When to avoid
11	sortBy() & orderBy()	Sorting logic, Asc/Desc order, Shuffle cost, Examples, Optimization tips
12	join() Basics	Inner join, Join condition, Execution flow, Examples, Common issues
13	Advanced Join Types	Left, Right, Full, Semi, Anti joins, Use cases, Examples
14	Broadcast Join	Concept, When to use, Memory impact, SQL hint, Examples
15	repartition() & coalesce()	Partition control, Shuffle behavior, Performance impact, Use cases, Examples
16	cache() & persist()	Storage levels, Memory vs disk, When to cache, Examples, Pitfalls
17	count()	Action trigger, Job creation, Performance considerations, Examples, Use cases
18	collect()	Driver memory risk, Small data usage, Examples, Best practices, Alternatives
19	show() & take()	Preview data, Execution behavior, Limit handling, Examples, Usage tips
20	save() & write()	Output formats, File systems, Partition output, Modes, Examples
21	foreach() & foreachPartition()	Side effects, External systems, Performance difference, Examples, Best practices
22	Window Functions	Over clause, Partition by, Order by, Use cases, Examples
23	Actions vs Transformations	Comparison, Execution timing, DAG role, Interview questions, Examples
24	Shuffle Internals	When shuffle occurs, Cost factors, Optimization, Examples, Debugging
25	Performance Optimization	Avoid wide ops, Partition sizing, Caching strategy, Examples, Tips
26	Error Handling	Bad records, Null handling, Try-catch logic, Data validation, Examples
27	Spark UI Analysis	Jobs tab, Stages tab, Task metrics, Shuffle read/write, Debugging
28	Real-world ETL Flow	Transform chain design, Action placement, Optimization, Examples, Best practices
29	Interview Scenarios	Common questions, Tricky cases, Performance questions, Sample answers, Tips
30	Hands-on Mini Project	End-to-end pipeline, Transformations usage, Actions usage, Optimization, Review

Interview question

Basic

What is a transformation in Spark?
What is an action in Spark?
Difference between transformations and actions?
What is lazy evaluation?
What is an RDD?
How do you create an RDD?
What is parallelize() in Spark?
What is textFile() in SparkContext?
Explain map() transformation.
Explain filter() transformation.
What is flatMap()?
Explain distinct() transformation.
What does union() do?
Explain intersection() transformation.
Explain subtract() transformation.
What is cartesian() transformation?
What is collect() action?
Explain count() action.
Explain first() action.
Explain take(n) action.
Explain reduce() action.
Explain fold() action.
Explain aggregate() action.
Explain takeOrdered() action.
Explain top() action.

Intermediate

Explain groupByKey() transformation.
Explain reduceByKey() transformation.
Difference between groupByKey() and reduceByKey().
Explain aggregateByKey() transformation.
Explain combineByKey() transformation.
What are pair RDDs?
Explain mapValues() transformation.
Explain flatMapValues() transformation.
Explain keys() and values() transformations.
Explain lookup() action on pair RDDs.
Explain joins: innerJoin(), leftOuterJoin(), rightOuterJoin(), fullOuterJoin().
Explain cogroup() transformation.
Explain sortByKey() transformation.
Explain sortBy() transformation.
Difference between sortByKey() and sortBy().
Explain repartition() transformation.
Explain coalesce() transformation.
Difference between repartition() and coalesce().
Explain sample() transformation.
Explain sampleByKey() and sampleByKeyExact().
What is takeSample()?
Explain randomSplit() transformation.
Explain cache() and persist() transformations.
Explain unpersist() method.
Explain checkpointing and its use case.

Advanced

Explain narrow vs wide transformations.
Difference between narrow and wide dependencies.
What is shuffle in Spark?
How does shuffle affect performance?
Explain partitioning and its importance.
Explain HashPartitioner and RangePartitioner.
Explain repartitionAndSortWithinPartitions().
Explain the role of DAG scheduler in transformations.
Explain stages and tasks in Spark execution.
Explain lazy evaluation and lineage.
How are transformations optimized internally?
Explain the difference between map() and mapPartitions().
Explain foreach() and foreachPartition() actions.
How to handle skewed data in transformations?
Explain broadcast variables and their usage.
Explain accumulators and their usage.
Explain foldByKey() transformation.
Explain subtractByKey() transformation.
Explain join optimizations for pair RDDs.
Explain caching strategies for iterative algorithms.
Explain checkpointing vs caching.
Explain transformations on DataFrames compared to RDDs.
Explain mapPartitionsWithIndex() transformation.
Explain wide transformations and task parallelism.
Explain best practices for memory management during transformations.

Expert

Explain spark.sql.shuffle.partitions and its impact.
Explain narrow dependency scheduling optimizations.
How does Spark handle task failures during actions?
Explain lineage and recomputation during RDD failure.
Explain transformations in Structured Streaming.
Explain actions in Structured Streaming.
How to optimize joins on large datasets?
Explain partition tuning for large-scale RDDs.
Explain avoiding shuffles with map-side reductions.
Explain caching strategies for MLlib pipelines.
Explain the difference between cache() and persist(StorageLevel.MEMORY_AND_DISK).
Explain how Spark plans tasks for wide transformations.
Explain the difference between reduceByKey() and aggregateByKey() performance.
How to handle skewed keys in joins?
Explain Spark UI metrics related to transformations and actions.
Explain how stage boundaries are created in wide transformations.
Explain partition coalescing to reduce shuffle.
Explain RDD lineage graph and fault tolerance.
Explain advanced join strategies: broadcast join, shuffle hash join.
Explain the difference between DataFrame and RDD transformations.
Explain transformation optimization by Catalyst (for DataFrames).
Explain whole-stage code generation for DataFrame transformations.
Explain streaming aggregation and its fault tolerance.
Explain stateful transformations in streaming RDDs.
Explain advanced tuning techniques for actions on huge datasets.

Prime_Questions

Popular Posts

14 January 2026

#Transformations & Actions

Key Concepts

Interview question

Basic

Intermediate

Advanced

Expert

Related Topics