Prime_Questions: #Spark SQL

#Spark SQL

Key Concepts

S.No	Topic	Sub-Topics
1	Spark SQL	What is Spark SQL, SQL vs DataFrame API, Use cases, Architecture overview, Components
2	Spark SQL Architecture	Catalyst optimizer, Logical plan, Physical plan, Tungsten engine, Execution flow
3	SparkSession & SQL Entry Points	SparkSession.sql(), SQLContext, HiveContext, Configurations, Best practices
4	Creating Tables & Views	Managed tables, External tables, Temporary views, Global views, CTAS
5	Data Types & Schema	Primitive types, Complex types, Struct, Array, Map
6	Reading Data Sources	CSV, JSON, Parquet, ORC, Avro basics
7	Writing Data using SQL	INSERT INTO, INSERT OVERWRITE, Save modes, Partition writes, Bucketing
8	Basic SELECT Queries	Select columns, Expressions, Aliases, DISTINCT, LIMIT
9	Filtering & WHERE Clause	WHERE conditions, AND/OR, BETWEEN, IN, LIKE
10	Sorting & Ordering	ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY, Performance impact
11	Aggregation Functions	COUNT, SUM, AVG, MIN, MAX
12	GROUP BY & HAVING	Grouping rules, Multiple group keys, HAVING clause, Aggregation filters, Optimization
13	Join Types in Spark SQL	Inner join, Left join, Right join, Full join, Cross join
14	Join Optimization Techniques	Broadcast joins, Shuffle joins, Join hints, Skew handling, AQE
15	Subqueries & CTEs	Scalar subqueries, Correlated subqueries, WITH clause, Nested queries, Optimization
16	Window Functions	OVER clause, PARTITION BY, ORDER BY, Ranking functions, Analytical functions
17	String & Date Functions	String manipulation, Date arithmetic, Timestamp functions, Formatting, Parsing
18	Handling NULLs	IS NULL, IS NOT NULL, COALESCE, NVL, NULLIF
19	Complex Data Types	Arrays, Maps, Structs, explode, lateral view, JSON functions
20	User Defined Functions (UDF)	Creating UDFs, Registering UDFs, Performance impact, When to avoid UDFs, Alternatives
21	Partitioning & Bucketing	Table partitioning, Static vs dynamic partitions, Bucketing concepts, Query pruning, Performance
22	Performance Optimization	Predicate pushdown, Column pruning, Caching tables, AQE, Cost-based optimizer
23	Execution Plans & Debugging	EXPLAIN, Logical vs physical plans, DAG stages, Common bottlenecks, Tuning
24	Integration with Hive	Hive metastore, HiveQL support, External tables, SerDe, Compatibility issues
25	Transactional Tables	ACID tables, Delta Lake basics, MERGE, UPDATE, DELETE, Time travel
26	Structured Streaming with SQL	Streaming tables, Continuous queries, Watermarking, Window aggregations, Triggers
27	Security & Access Control	Table permissions, Column masking, Row-level security, Auditing, Best practices
28	Error Handling & Data Quality	Bad records handling, Schema mismatch, Try-catch patterns, Data validation, Logging
29	Best Practices & SQL Standards	Naming conventions, Query readability, Anti-patterns, Reusability, Testing
30	Real-world Use Cases & Projects	ETL pipelines, Data warehousing, Reporting, Optimization review, End-to-end project

Interview question

Basic Level

What is Spark SQL?
How is Spark SQL different from traditional SQL?
What are the main components of Spark SQL?
What is a SparkSession?
How do you create a DataFrame using Spark SQL?
What is a temporary view?
What is a global temporary view?
Difference between temporary view and global temporary view?
What is a table in Spark SQL?
What is a managed table?
What is an external table?
Difference between managed and external tables?
What is a database in Spark SQL?
How do you show all databases?
How do you use USE database command?
What is show tables command?
What is select statement in Spark SQL?
What is where clause?
What is order by clause?
What is group by clause?
What is having clause?
What is limit clause?
What is distinct keyword?
What is count() function?
What is null handling in Spark SQL?

Intermediate Level

What is lazy evaluation in Spark SQL?
What is Catalyst Optimizer?
What are logical plans in Spark SQL?
What are physical plans in Spark SQL?
What is explain() command?
Difference between explain() and explain(true)?
What is schema inference in Spark SQL?
What is SQLContext?
Difference between SQLContext and SparkSession?
What are built-in SQL functions?
What is case when expression?
What is cast() in Spark SQL?
What are date and time functions?
What is join in Spark SQL?
What are different types of joins?
What is inner join?
What is left outer join?
What is right outer join?
What is full outer join?
What is cross join?
What is subquery in Spark SQL?
What is correlated subquery?
What is union vs union all?
What is view vs table?
What is window function?

Advanced Level

What is broadcast join in Spark SQL?
How does Spark decide join strategies?
What is shuffle in Spark SQL?
What is adaptive query execution (AQE)?
What is cost-based optimizer (CBO)?
How does Spark SQL handle data skew?
What is salting technique in Spark SQL?
What is partition pruning?
What is dynamic partition pruning?
What is bucketing in Spark SQL?
Difference between partitioning and bucketing?
What is column pruning?
What is predicate pushdown?
What is vectorized execution?
What is whole-stage code generation?
What is Tungsten engine?
How does Spark SQL manage memory?
What is caching in Spark SQL?
Difference between cache and persist?
What is checkpointing?
What is Spark SQL UDF?
Difference between UDF and built-in functions?
Performance impact of UDFs?
What is map-side aggregation?
How to optimize aggregation queries?

Expert Level

Why Spark SQL is faster than Hive?
How does Catalyst apply rule-based optimization?
How does Catalyst apply cost-based optimization?
Explain Spark SQL query execution lifecycle.
What happens internally when a Spark SQL query is executed?
How does Spark SQL generate Java bytecode?
What is off-heap memory in Spark SQL?
How does AQE modify execution plans at runtime?
How to debug slow Spark SQL queries?
How to analyze Spark UI for SQL queries?
What are common Spark SQL performance anti-patterns?
Why SELECT * is discouraged?
How does Spark SQL handle schema evolution?
How does Spark SQL work with semi-structured data?
How does Spark SQL handle nested data?
What is explode() in Spark SQL?
What is from_json() and to_json()?
What is Delta Lake integration with Spark SQL?
How does Spark SQL support ACID transactions?
What is time travel in Spark SQL?
What is Z-Ordering?
What is OPTIMIZE command?
How do you tune Spark SQL configurations?
Explain real-world Spark SQL optimization scenario.
When should Spark SQL not be used?

Prime_Questions

Popular Posts

09 January 2026

#Spark SQL

Key Concepts

Interview question

Basic Level

Intermediate Level

Advanced Level

Expert Level

Related Topics