09 January 2026

#Spark SQL

#Spark SQL

Key Concepts


S.No Topic Sub-Topics
1Spark SQLWhat is Spark SQL, SQL vs DataFrame API, Use cases, Architecture overview, Components
2Spark SQL ArchitectureCatalyst optimizer, Logical plan, Physical plan, Tungsten engine, Execution flow
3SparkSession & SQL Entry PointsSparkSession.sql(), SQLContext, HiveContext, Configurations, Best practices
4Creating Tables & ViewsManaged tables, External tables, Temporary views, Global views, CTAS
5Data Types & SchemaPrimitive types, Complex types, Struct, Array, Map
6Reading Data SourcesCSV, JSON, Parquet, ORC, Avro basics
7Writing Data using SQLINSERT INTO, INSERT OVERWRITE, Save modes, Partition writes, Bucketing
8Basic SELECT QueriesSelect columns, Expressions, Aliases, DISTINCT, LIMIT
9Filtering & WHERE ClauseWHERE conditions, AND/OR, BETWEEN, IN, LIKE
10Sorting & OrderingORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY, Performance impact
11Aggregation FunctionsCOUNT, SUM, AVG, MIN, MAX
12GROUP BY & HAVINGGrouping rules, Multiple group keys, HAVING clause, Aggregation filters, Optimization
13Join Types in Spark SQLInner join, Left join, Right join, Full join, Cross join
14Join Optimization TechniquesBroadcast joins, Shuffle joins, Join hints, Skew handling, AQE
15Subqueries & CTEsScalar subqueries, Correlated subqueries, WITH clause, Nested queries, Optimization
16Window FunctionsOVER clause, PARTITION BY, ORDER BY, Ranking functions, Analytical functions
17String & Date FunctionsString manipulation, Date arithmetic, Timestamp functions, Formatting, Parsing
18Handling NULLsIS NULL, IS NOT NULL, COALESCE, NVL, NULLIF
19Complex Data TypesArrays, Maps, Structs, explode, lateral view, JSON functions
20User Defined Functions (UDF)Creating UDFs, Registering UDFs, Performance impact, When to avoid UDFs, Alternatives
21Partitioning & BucketingTable partitioning, Static vs dynamic partitions, Bucketing concepts, Query pruning, Performance
22Performance OptimizationPredicate pushdown, Column pruning, Caching tables, AQE, Cost-based optimizer
23Execution Plans & DebuggingEXPLAIN, Logical vs physical plans, DAG stages, Common bottlenecks, Tuning
24Integration with HiveHive metastore, HiveQL support, External tables, SerDe, Compatibility issues
25Transactional TablesACID tables, Delta Lake basics, MERGE, UPDATE, DELETE, Time travel
26Structured Streaming with SQLStreaming tables, Continuous queries, Watermarking, Window aggregations, Triggers
27Security & Access ControlTable permissions, Column masking, Row-level security, Auditing, Best practices
28Error Handling & Data QualityBad records handling, Schema mismatch, Try-catch patterns, Data validation, Logging
29Best Practices & SQL StandardsNaming conventions, Query readability, Anti-patterns, Reusability, Testing
30Real-world Use Cases & ProjectsETL pipelines, Data warehousing, Reporting, Optimization review, End-to-end project

Interview question

Basic Level

  1. What is Spark SQL?
  2. How is Spark SQL different from traditional SQL?
  3. What are the main components of Spark SQL?
  4. What is a SparkSession?
  5. How do you create a DataFrame using Spark SQL?
  6. What is a temporary view?
  7. What is a global temporary view?
  8. Difference between temporary view and global temporary view?
  9. What is a table in Spark SQL?
  10. What is a managed table?
  11. What is an external table?
  12. Difference between managed and external tables?
  13. What is a database in Spark SQL?
  14. How do you show all databases?
  15. How do you use USE database command?
  16. What is show tables command?
  17. What is select statement in Spark SQL?
  18. What is where clause?
  19. What is order by clause?
  20. What is group by clause?
  21. What is having clause?
  22. What is limit clause?
  23. What is distinct keyword?
  24. What is count() function?
  25. What is null handling in Spark SQL?

Intermediate Level

  1. What is lazy evaluation in Spark SQL?
  2. What is Catalyst Optimizer?
  3. What are logical plans in Spark SQL?
  4. What are physical plans in Spark SQL?
  5. What is explain() command?
  6. Difference between explain() and explain(true)?
  7. What is schema inference in Spark SQL?
  8. What is SQLContext?
  9. Difference between SQLContext and SparkSession?
  10. What are built-in SQL functions?
  11. What is case when expression?
  12. What is cast() in Spark SQL?
  13. What are date and time functions?
  14. What is join in Spark SQL?
  15. What are different types of joins?
  16. What is inner join?
  17. What is left outer join?
  18. What is right outer join?
  19. What is full outer join?
  20. What is cross join?
  21. What is subquery in Spark SQL?
  22. What is correlated subquery?
  23. What is union vs union all?
  24. What is view vs table?
  25. What is window function?

Advanced Level

  1. What is broadcast join in Spark SQL?
  2. How does Spark decide join strategies?
  3. What is shuffle in Spark SQL?
  4. What is adaptive query execution (AQE)?
  5. What is cost-based optimizer (CBO)?
  6. How does Spark SQL handle data skew?
  7. What is salting technique in Spark SQL?
  8. What is partition pruning?
  9. What is dynamic partition pruning?
  10. What is bucketing in Spark SQL?
  11. Difference between partitioning and bucketing?
  12. What is column pruning?
  13. What is predicate pushdown?
  14. What is vectorized execution?
  15. What is whole-stage code generation?
  16. What is Tungsten engine?
  17. How does Spark SQL manage memory?
  18. What is caching in Spark SQL?
  19. Difference between cache and persist?
  20. What is checkpointing?
  21. What is Spark SQL UDF?
  22. Difference between UDF and built-in functions?
  23. Performance impact of UDFs?
  24. What is map-side aggregation?
  25. How to optimize aggregation queries?

Expert Level

  1. Why Spark SQL is faster than Hive?
  2. How does Catalyst apply rule-based optimization?
  3. How does Catalyst apply cost-based optimization?
  4. Explain Spark SQL query execution lifecycle.
  5. What happens internally when a Spark SQL query is executed?
  6. How does Spark SQL generate Java bytecode?
  7. What is off-heap memory in Spark SQL?
  8. How does AQE modify execution plans at runtime?
  9. How to debug slow Spark SQL queries?
  10. How to analyze Spark UI for SQL queries?
  11. What are common Spark SQL performance anti-patterns?
  12. Why SELECT * is discouraged?
  13. How does Spark SQL handle schema evolution?
  14. How does Spark SQL work with semi-structured data?
  15. How does Spark SQL handle nested data?
  16. What is explode() in Spark SQL?
  17. What is from_json() and to_json()?
  18. What is Delta Lake integration with Spark SQL?
  19. How does Spark SQL support ACID transactions?
  20. What is time travel in Spark SQL?
  21. What is Z-Ordering?
  22. What is OPTIMIZE command?
  23. How do you tune Spark SQL configurations?
  24. Explain real-world Spark SQL optimization scenario.
  25. When should Spark SQL not be used?

Related Topics