08 November 2020

#Apache_Spark

#Apache_Spark
What is Apache Spark? How is it different from Hadoop MapReduce?
Explain the components of Spark architecture.
What are RDDs in Spark?
How does Spark achieve fault tolerance?
What are the advantages of using Spark?
Difference between Spark RDD, DataFrame, and Dataset?
How is Spark different from Hadoop?
What is lineage in Spark?
What is lazy evaluation in Spark?
What is DAG (Directed Acyclic Graph) in Spark?
What are transformations and actions in Spark?
Explain map(), flatMap(), and reduceByKey().
What is the difference between reduceByKey() and groupByKey()?
What is a SparkContext?
Explain the purpose of the SparkSession.
What are accumulators and broadcast variables?
Explain narrow vs wide transformations.
What is checkpointing? When is it needed?
What is Spark SQL?
Difference between DataFrame and RDD?
What are the advantages of DataFrames over RDDs?
How can you create a DataFrame in Spark?
What is Catalyst optimizer?
Explain Tungsten execution engine.
How do you run SQL queries on a DataFrame?
What is the role of the Schema in Spark SQL?
What is a temporary view?
What is Spark Streaming?
What are DStreams?
Difference between DStreams and RDDs?
How does Spark Streaming handle fault tolerance?
What is windowed operation in Spark Streaming?
How does backpressure work in Spark Streaming?
What is structured streaming?
What is MLlib?
What are the main algorithms provided in MLlib?
Difference between transformers and estimators in Spark ML?
What is a pipeline in Spark MLlib?
How do you tune Spark jobs?
What are partitions in Spark?
How can you control the number of partitions?
What is coalesce() and repartition()?
Explain broadcast join vs shuffle join.
How do you handle skewed data in Spark?
What are Spark configurations you often tune?
How does Spark run on YARN?
Difference between client and cluster mode in Spark?
What is the role of the driver and executor?
What is the difference between local, standalone, and cluster modes?
What is dynamic allocation in Spark?
How does Spark handle data locality?
How do you debug a failed Spark job?
What strategies do you use to optimize joins in Spark?
Have you used caching and persistence in Spark? When?
How do you monitor Spark jobs?
How do you handle out-of-memory issues in Spark?
How does Spark SQL handle schema evolution?
Explain the difference between inner, left, right, and full joins in Spark SQL.
How can you improve performance of Spark SQL queries?
What is predicate pushdown?
How does partition pruning work?
What is a broadcast hash join and when should you use it?
How does Spark integrate with Hive?
How can you cache a table in Spark SQL?
How does Spark handle null values in joins and filters?
What are the triggers available in structured streaming?
How does structured streaming guarantee exactly-once semantics?
Explain watermarking in structured streaming.
How can you perform joins in structured streaming?
What are the sink types supported in structured streaming?
What is the difference between append, update, and complete output modes?
What is stateful streaming in Spark?
How do you manage state size in structured streaming?
How do you find the stage and task breakdown of a job in Spark UI?
What is speculative execution in Spark?
What is the shuffle? How does it impact performance?
Explain persist() vs cache().
How do you avoid OOM (Out of Memory) errors in Spark?
How to optimize large joins in Spark?
How can skew in joins be identified and solved?
When would you use salting in Spark?
How would you process a 1 TB log file with Spark?
How do you handle schema evolution in a Data Lake with Spark?
Describe a time you handled performance issues in a production Spark pipeline.
How would you process millions of real-time events per minute using Spark?
You observe frequent job failures?what steps would you take to diagnose?
How would you improve performance for a job that takes 5+ hours to run?
How would you manage data skew for a column with 95% nulls?
What is Databricks? How is it different from open-source Spark?
How does Delta Lake enhance Apache Spark?
What are the key benefits of Delta Lake over Parquet?
Explain ACID transactions in Delta Lake.
How does schema enforcement work in Delta Lake?
What is OPTIMIZE ZORDER in Databricks?
How do you schedule jobs in Databricks?
How do you use Git with Databricks notebooks?
What is Unity Catalog?
How do you secure data in Databricks (RBAC, data masking, etc.)?
How do you use Spark with Kafka?
How do you use Spark with Cassandra / MongoDB?
  • Core Components - Spark Core ,Spark SQL, Spark Streaming, MLlib
  • GraphX ? Graph processing
  • Spark Execution - Driver, Executors, Cluster Manager
  • RDD (Resilient Distributed Dataset)
  • DataFrame & Dataset
  • Lazy transformations: map(), filter(), flatMap().
  • Actions: collect(), count(), reduce(), save().
  • Spark SQL
  • Catalyst Optimizer, Tungsten Engine
  • Spark Streaming - DStreams and Receivers
  • Structured Streaming
  • Fault Tolerance
  • spark.executor.memory and spark.driver.memory
  • spark.sql.shuffle.partitions
  • coalesce() and repartition()

No comments:

Post a Comment

Most views on this month

Popular Posts