What is Apache Spark? How is it different from Hadoop MapReduce? |
Explain the components of Spark architecture. |
What are RDDs in Spark? |
How does Spark achieve fault tolerance? |
What are the advantages of using Spark? |
Difference between Spark RDD, DataFrame, and Dataset? |
How is Spark different from Hadoop? |
What is lineage in Spark? |
What is lazy evaluation in Spark? |
What is DAG (Directed Acyclic Graph) in Spark? |
What are transformations and actions in Spark? |
Explain map(), flatMap(), and reduceByKey(). |
What is the difference between reduceByKey() and groupByKey()? |
What is a SparkContext? |
Explain the purpose of the SparkSession. |
What are accumulators and broadcast variables? |
Explain narrow vs wide transformations. |
What is checkpointing? When is it needed? |
What is Spark SQL? |
Difference between DataFrame and RDD? |
What are the advantages of DataFrames over RDDs? |
How can you create a DataFrame in Spark? |
What is Catalyst optimizer? |
Explain Tungsten execution engine. |
How do you run SQL queries on a DataFrame? |
What is the role of the Schema in Spark SQL? |
What is a temporary view? |
What is Spark Streaming? |
What are DStreams? |
Difference between DStreams and RDDs? |
How does Spark Streaming handle fault tolerance? |
What is windowed operation in Spark Streaming? |
How does backpressure work in Spark Streaming? |
What is structured streaming? |
What is MLlib? |
What are the main algorithms provided in MLlib? |
Difference between transformers and estimators in Spark ML? |
What is a pipeline in Spark MLlib? |
How do you tune Spark jobs? |
What are partitions in Spark? |
How can you control the number of partitions? |
What is coalesce() and repartition()? |
Explain broadcast join vs shuffle join. |
How do you handle skewed data in Spark? |
What are Spark configurations you often tune? |
How does Spark run on YARN? |
Difference between client and cluster mode in Spark? |
What is the role of the driver and executor? |
What is the difference between local, standalone, and cluster modes? |
What is dynamic allocation in Spark? |
How does Spark handle data locality? |
How do you debug a failed Spark job? |
What strategies do you use to optimize joins in Spark? |
Have you used caching and persistence in Spark? When? |
How do you monitor Spark jobs? |
How do you handle out-of-memory issues in Spark? |
How does Spark SQL handle schema evolution? |
Explain the difference between inner, left, right, and full joins in Spark SQL. |
How can you improve performance of Spark SQL queries? |
What is predicate pushdown? |
How does partition pruning work? |
What is a broadcast hash join and when should you use it? |
How does Spark integrate with Hive? |
How can you cache a table in Spark SQL? |
How does Spark handle null values in joins and filters? |
What are the triggers available in structured streaming? |
How does structured streaming guarantee exactly-once semantics? |
Explain watermarking in structured streaming. |
How can you perform joins in structured streaming? |
What are the sink types supported in structured streaming? |
What is the difference between append, update, and complete output modes? |
What is stateful streaming in Spark? |
How do you manage state size in structured streaming? |
How do you find the stage and task breakdown of a job in Spark UI? |
What is speculative execution in Spark? |
What is the shuffle? How does it impact performance? |
Explain persist() vs cache(). |
How do you avoid OOM (Out of Memory) errors in Spark? |
How to optimize large joins in Spark? |
How can skew in joins be identified and solved? |
When would you use salting in Spark? |
How would you process a 1 TB log file with Spark? |
How do you handle schema evolution in a Data Lake with Spark? |
Describe a time you handled performance issues in a production Spark pipeline. |
How would you process millions of real-time events per minute using Spark? |
You observe frequent job failures?what steps would you take to diagnose? |
How would you improve performance for a job that takes 5+ hours to run? |
How would you manage data skew for a column with 95% nulls? |
What is Databricks? How is it different from open-source Spark? |
How does Delta Lake enhance Apache Spark? |
What are the key benefits of Delta Lake over Parquet? |
Explain ACID transactions in Delta Lake. |
How does schema enforcement work in Delta Lake? |
What is OPTIMIZE ZORDER in Databricks? |
How do you schedule jobs in Databricks? |
How do you use Git with Databricks notebooks? |
What is Unity Catalog? |
How do you secure data in Databricks (RBAC, data masking, etc.)? |
How do you use Spark with Kafka? |
How do you use Spark with Cassandra / MongoDB? |
|
|
|
No comments:
Post a Comment