09 January 2026

#DataFrames

#DataFrames

Key Concepts


S.No Topic Sub-Topics
1Apache Spark & DataFramesSpark overview, RDD vs DataFrame, Spark architecture, Lazy evaluation, Use cases
2Spark Setup & EnvironmentLocal mode, Cluster mode, SparkSession, spark-submit, Configuration basics
3SparkSession & Entry PointsSparkSession creation, SQLContext, HiveContext, Config options, Best practices
4Creating DataFramesFrom files, From RDD, From collections, Schema inference, Explicit schema
5DataFrame Schema & Data TypesStructType, StructField, Primitive types, Complex types, Schema evolution
6Reading Data SourcesCSV, JSON, Parquet, ORC, Avro basics
7Writing DataFramesSave modes, Partitioning, Bucketing, File formats, Compression
8DataFrame Basic Operationsselect, withColumn, drop, filter, where
9Column OperationsColumn expressions, alias, cast, when/otherwise, lit
10Row Operations & Actionsshow, collect, take, count, first
11DataFrame FunctionsBuilt-in functions, String functions, Date functions, Math functions, Null handling
12Filtering & Conditional Logicfilter vs where, isin, like, rlike, case when
13Sorting & DeduplicationorderBy, sort, distinct, dropDuplicates, Sorting optimization
14Aggregation & GroupinggroupBy, agg, count, sum, avg
15Joins in DataFramesInner join, Left/Right join, Full join, Semi/Anti join
16Join OptimizationBroadcast join, Shuffle join, Join hints, Skew handling, AQE
17Handling Missing & Bad Datadropna, fillna, replace, Null checks, Data validation
18Window FunctionsWindow spec, row_number, rank, lead/lag, Running totals
19UDF & UDAFUDF creation, Performance impact, Pandas UDF, Serialization, Best practices
20DataFrame Caching & Persistencecache, persist, Storage levels, Memory vs disk, When to cache
21Spark SQL with DataFramesTemp views, Global views, SQL queries, Mixing SQL & DF, Optimization
22Partitioning & Repartitioningrepartition, coalesce, Partition pruning, File partitioning, Performance tuning
23Performance Optimization BasicsCatalyst optimizer, Tungsten, Predicate pushdown, Column pruning, AQE
24DataFrame Execution PlanLogical plan, Physical plan, explain(), DAG, Stage breakdown
25Handling Large DatasetsSkew issues, Sampling, Checkpointing, Memory tuning, Spill handling
26Integration with HiveHive tables, External tables, Metastore, Partitioned tables, Hive SQL
27Streaming DataFrames (Structured Streaming)Streaming sources, Sinks, Watermarking, Windowed aggregations, Triggers
28Error Handling & DebuggingCommon errors, Serialization issues, Logging, Debug tools, Retry strategies
29Best Practices & Design PatternsCode structure, Reusability, Performance patterns, Anti-patterns, Testing
30Real-world Use Cases & ProjectsETL pipelines, Data lake processing, Analytics workloads, Reporting, Optimization review

Interview question

Basic Level

  1. What is a DataFrame in Apache Spark?
  2. How is a DataFrame different from an RDD?
  3. What are the advantages of DataFrames?
  4. Is DataFrame immutable?
  5. What is SparkSession?
  6. How do you create a DataFrame in Spark?
  7. What are the different data sources supported by DataFrames?
  8. What is a schema in DataFrame?
  9. How can you infer schema automatically?
  10. How do you define a custom schema?
  11. What is show() in DataFrame?
  12. What is printSchema()?
  13. What is select() in DataFrame?
  14. What is withColumn()?
  15. Difference between withColumn() and select()?
  16. What is filter() / where() in DataFrame?
  17. Difference between filter() and where()?
  18. What is limit()?
  19. What is collect()?
  20. What is count()?
  21. What is distinct()?
  22. What is drop()?
  23. What is dropDuplicates()?
  24. What is alias()?
  25. How do you rename a column in DataFrame?

Intermediate Level

  1. What are DataFrame transformations?
  2. What are DataFrame actions?
  3. What is lazy evaluation in DataFrames?
  4. What is explain() in DataFrame?
  5. What is logical plan?
  6. What is physical plan?
  7. What is Catalyst Optimizer?
  8. What is Tungsten execution engine?
  9. What is column expression?
  10. What are built-in Spark SQL functions?
  11. Difference between UDF and built-in functions?
  12. What is UDF?
  13. Performance impact of UDFs?
  14. What is groupBy()?
  15. What is agg()?
  16. Difference between groupBy and window functions?
  17. What is orderBy()?
  18. Difference between orderBy() and sort()?
  19. What is join() in DataFrame?
  20. What are the types of joins in Spark?
  21. What is inner join?
  22. What is left outer join?
  23. What is right outer join?
  24. What is full outer join?
  25. What is cross join?

Advanced Level

  1. What is broadcast join?
  2. When does Spark automatically use broadcast join?
  3. What is shuffle join?
  4. How does Spark handle join optimization?
  5. What is partitioning in DataFrame?
  6. Difference between repartition() and coalesce()?
  7. What is bucketing in Spark?
  8. Difference between bucketing and partitioning?
  9. What is window function?
  10. Explain row_number(), rank(), dense_rank().
  11. What is caching in DataFrame?
  12. Difference between cache() and persist()?
  13. What storage levels are available?
  14. What is checkpointing in DataFrame?
  15. Difference between cache and checkpoint?
  16. What is skewed data?
  17. How to handle data skew in joins?
  18. What is salting technique?
  19. What is adaptive query execution (AQE)?
  20. What are Spark SQL hints?
  21. What is explain(true)?
  22. How to optimize wide transformations?
  23. What is column pruning?
  24. What is predicate pushdown?
  25. What is vectorized reader?

Expert Level

  1. Why DataFrames are faster than RDDs?
  2. How does Catalyst optimize DataFrame queries?
  3. How does Tungsten improve performance?
  4. What happens internally when a DataFrame action is triggered?
  5. How does Spark generate optimized bytecode?
  6. What is whole-stage code generation?
  7. What is off-heap memory?
  8. How does Spark handle memory management for DataFrames?
  9. How does AQE change execution plans at runtime?
  10. How do you debug slow DataFrame jobs?
  11. How do you analyze Spark UI for DataFrame jobs?
  12. What are common DataFrame performance anti-patterns?
  13. Why excessive withColumn() is discouraged?
  14. How do you design efficient Spark SQL pipelines?
  15. What are limitations of DataFrames?
  16. How do DataFrames handle schema evolution?
  17. How does DataFrame support semi-structured data?
  18. What is explode() function?
  19. What is from_json() and to_json()?
  20. How to handle nested columns efficiently?
  21. What is Delta Lake DataFrame integration?
  22. How does DataFrame handle ACID properties?
  23. What is cost-based optimization (CBO)?
  24. How do you tune Spark SQL configurations?
  25. Explain real-time production use cases of DataFrames.

Related Topics