09 January 2026

#Deep Learning

#Deep Learning

Key Concepts


S.No Topic Sub-Topics
1 Deep Learning What is deep learning, History, DL vs ML, Applications, Challenges
2 Linear Algebra Refresher Vectors, Matrices, Dot product, Eigenvalues, Matrix operations
3 Probability & Statistics Random variables, Probability distributions, Mean & variance, Bayes theorem, Expectation
4 Optimization Basics Loss functions, Cost functions, Convex vs non-convex, Gradient descent, Learning rate
5 Neural Network Basics Perceptron, Neurons, Weights & bias, Activation functions, Forward propagation
6 Activation Functions Sigmoid, Tanh, ReLU, Leaky ReLU, Softmax
7 Backpropagation Chain rule, Gradient computation, Weight updates, Vanishing gradients, Exploding gradients
8 Training Deep Neural Networks Epochs, Batch size, Initialization, Convergence, Overfitting
9 Regularization Techniques L1, L2, Dropout, Early stopping, Data augmentation
10 Optimizers SGD, Momentum, RMSProp, Adam, AdamW
11 Loss Functions MSE, MAE, Cross-entropy, Hinge loss, KL divergence
12 Deep Learning Frameworks TensorFlow basics, Keras API, PyTorch basics, Autograd, Model training loop
13 Convolutional Neural Networks (CNN) Convolution layers, Pooling, Padding, Stride, Feature maps
14 CNN Architectures LeNet, AlexNet, VGG, ResNet, Inception
15 Image Classification Dataset preparation, Transfer learning, Fine-tuning, Evaluation metrics, Deployment basics
16 Sequence Modeling Time series, Sequential data, Tokenization, Padding, Masking
17 Recurrent Neural Networks (RNN) RNN architecture, BPTT, Vanishing gradients, Use cases, Limitations
18 LSTM & GRU Cell states, Gates, LSTM vs GRU, Applications, Training tips
19 Attention Mechanism Why attention, Self-attention, Encoder-decoder attention, Scaled dot-product, Benefits
20 Transformers Transformer architecture, Positional encoding, Multi-head attention, Encoder-decoder, Training
21 Natural Language Processing with DL Word embeddings, RNN for NLP, Transformers for NLP, Text classification, NER
22 Autoencoders Basic autoencoder, Sparse AE, Denoising AE, Variational AE, Use cases
23 Generative Models GAN basics, Generator & discriminator, Training instability, DCGAN, Applications
24 Advanced CNN Applications Object detection, Image segmentation, Face recognition, Medical imaging, OCR
25 Transfer Learning & Fine-tuning Pretrained models, Layer freezing, Domain adaptation, Benefits, Limitations
26 Model Evaluation Accuracy, Precision, Recall, F1-score, ROC-AUC
27 Hyperparameter Tuning Grid search, Random search, Bayesian optimization, Learning rate schedules, Batch tuning
28 Model Optimization Pruning, Quantization, Knowledge distillation, Mixed precision, Inference optimization
29 Deployment of DL Models REST APIs, Model serving, Cloud deployment, Edge deployment, Monitoring
30 Advanced & Emerging Topics Self-supervised learning, Multimodal models, Foundation models, Ethical AI, Future trends

Interview question

Basic Level

  1. What is Deep Learning?
  2. Difference between Machine Learning and Deep Learning.
  3. What is a Neural Network?
  4. What is a perceptron?
  5. What is an activation function?
  6. What are weights and biases?
  7. What is forward propagation?
  8. What is backpropagation?
  9. What is a loss function?
  10. What is gradient descent?
  11. What is a learning rate?
  12. What are epochs and batches?
  13. What is overfitting?
  14. What is underfitting?
  15. What is regularization?
  16. What is dropout?
  17. What is a convolutional neural network (CNN)?
  18. What is pooling in CNN?
  19. What is a recurrent neural network (RNN)?
  20. What is an LSTM?
  21. What is a GRU?
  22. What is batch normalization?
  23. What is data augmentation?
  24. What is feature extraction?
  25. What is a deep neural network (DNN)?

Intermediate Level

  1. Explain how backpropagation works.
  2. What is the vanishing gradient problem?
  3. What is the exploding gradient problem?
  4. What is weight initialization?
  5. What is Xavier/Glorot initialization?
  6. What is He initialization?
  7. What are optimizers (SGD, Adam, RMSProp)?
  8. What is model capacity?
  9. What is early stopping?
  10. What is cross-entropy loss?
  11. What is mean squared error?
  12. What are CNN kernels/filters?
  13. What is padding and stride?
  14. What is a fully connected layer?
  15. What is transfer learning?
  16. What are embeddings?
  17. What is a sequence-to-sequence (Seq2Seq) model?
  18. What is teacher forcing?
  19. What is attention mechanism?
  20. What is self-attention?
  21. What is multi-head attention?
  22. What is positional encoding?
  23. What is a transformer?
  24. What are residual connections?
  25. What is a skip connection?

Advanced Level

  1. Explain the architecture of a CNN end-to-end.
  2. Explain the architecture of an RNN end-to-end.
  3. Explain the architecture of LSTM and GRU in detail.
  4. What is the receptive field in CNN?
  5. What is dilated convolution?
  6. What is depthwise separable convolution?
  7. What is batch vs layer vs group normalization?
  8. What is label smoothing?
  9. What is attention score calculation?
  10. What is cross-attention?
  11. What are encoder and decoder blocks in transformers?
  12. What is beam search decoding?
  13. What is scheduled sampling?
  14. What is gradient clipping?
  15. What is gradient checkpointing?
  16. Explain vanishing gradient mitigation methods.
  17. What are VAEs (Variational Autoencoders)?
  18. What are GANs (Generative Adversarial Networks)?
  19. What is mode collapse in GANs?
  20. What is contrastive learning?
  21. What is self-supervised learning?
  22. What is metric learning?
  23. What is Siamese network?
  24. What is cosine similarity in embeddings?
  25. What is knowledge distillation?

Expert Level

  1. Explain the transformer architecture at scale.
  2. What are Mixture-of-Experts (MoE) models?
  3. What is sparse attention?
  4. What is FlashAttention?
  5. What is rotary positional embedding (RoPE)?
  6. What is ALiBi (Attention with Linear Biases)?
  7. What is diffusion modeling?
  8. Explain the U-Net architecture used in diffusion.
  9. What is reinforcement learning in deep learning?
  10. What is Deep Q-Learning?
  11. What is policy gradient?
  12. What is PPO (Proximal Policy Optimization)?
  13. What is distributed training?
  14. Difference between data parallelism and model parallelism.
  15. What is pipeline parallelism?
  16. What is tensor parallelism?
  17. What is quantization-aware training?
  18. What is pruning in neural networks?
  19. What is model compression?
  20. What is federated learning?
  21. What is edge deployment for deep learning models?
  22. What are safety and alignment concerns in deep models?
  23. What are adversarial attacks?
  24. What are adversarial defense techniques?
  25. What are current research trends in deep learning?

Related Topics


#Spark SQL

#Spark SQL

Key Concepts


S.No Topic Sub-Topics
1Spark SQLWhat is Spark SQL, SQL vs DataFrame API, Use cases, Architecture overview, Components
2Spark SQL ArchitectureCatalyst optimizer, Logical plan, Physical plan, Tungsten engine, Execution flow
3SparkSession & SQL Entry PointsSparkSession.sql(), SQLContext, HiveContext, Configurations, Best practices
4Creating Tables & ViewsManaged tables, External tables, Temporary views, Global views, CTAS
5Data Types & SchemaPrimitive types, Complex types, Struct, Array, Map
6Reading Data SourcesCSV, JSON, Parquet, ORC, Avro basics
7Writing Data using SQLINSERT INTO, INSERT OVERWRITE, Save modes, Partition writes, Bucketing
8Basic SELECT QueriesSelect columns, Expressions, Aliases, DISTINCT, LIMIT
9Filtering & WHERE ClauseWHERE conditions, AND/OR, BETWEEN, IN, LIKE
10Sorting & OrderingORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY, Performance impact
11Aggregation FunctionsCOUNT, SUM, AVG, MIN, MAX
12GROUP BY & HAVINGGrouping rules, Multiple group keys, HAVING clause, Aggregation filters, Optimization
13Join Types in Spark SQLInner join, Left join, Right join, Full join, Cross join
14Join Optimization TechniquesBroadcast joins, Shuffle joins, Join hints, Skew handling, AQE
15Subqueries & CTEsScalar subqueries, Correlated subqueries, WITH clause, Nested queries, Optimization
16Window FunctionsOVER clause, PARTITION BY, ORDER BY, Ranking functions, Analytical functions
17String & Date FunctionsString manipulation, Date arithmetic, Timestamp functions, Formatting, Parsing
18Handling NULLsIS NULL, IS NOT NULL, COALESCE, NVL, NULLIF
19Complex Data TypesArrays, Maps, Structs, explode, lateral view, JSON functions
20User Defined Functions (UDF)Creating UDFs, Registering UDFs, Performance impact, When to avoid UDFs, Alternatives
21Partitioning & BucketingTable partitioning, Static vs dynamic partitions, Bucketing concepts, Query pruning, Performance
22Performance OptimizationPredicate pushdown, Column pruning, Caching tables, AQE, Cost-based optimizer
23Execution Plans & DebuggingEXPLAIN, Logical vs physical plans, DAG stages, Common bottlenecks, Tuning
24Integration with HiveHive metastore, HiveQL support, External tables, SerDe, Compatibility issues
25Transactional TablesACID tables, Delta Lake basics, MERGE, UPDATE, DELETE, Time travel
26Structured Streaming with SQLStreaming tables, Continuous queries, Watermarking, Window aggregations, Triggers
27Security & Access ControlTable permissions, Column masking, Row-level security, Auditing, Best practices
28Error Handling & Data QualityBad records handling, Schema mismatch, Try-catch patterns, Data validation, Logging
29Best Practices & SQL StandardsNaming conventions, Query readability, Anti-patterns, Reusability, Testing
30Real-world Use Cases & ProjectsETL pipelines, Data warehousing, Reporting, Optimization review, End-to-end project

Interview question

Basic Level

  1. What is Spark SQL?
  2. How is Spark SQL different from traditional SQL?
  3. What are the main components of Spark SQL?
  4. What is a SparkSession?
  5. How do you create a DataFrame using Spark SQL?
  6. What is a temporary view?
  7. What is a global temporary view?
  8. Difference between temporary view and global temporary view?
  9. What is a table in Spark SQL?
  10. What is a managed table?
  11. What is an external table?
  12. Difference between managed and external tables?
  13. What is a database in Spark SQL?
  14. How do you show all databases?
  15. How do you use USE database command?
  16. What is show tables command?
  17. What is select statement in Spark SQL?
  18. What is where clause?
  19. What is order by clause?
  20. What is group by clause?
  21. What is having clause?
  22. What is limit clause?
  23. What is distinct keyword?
  24. What is count() function?
  25. What is null handling in Spark SQL?

Intermediate Level

  1. What is lazy evaluation in Spark SQL?
  2. What is Catalyst Optimizer?
  3. What are logical plans in Spark SQL?
  4. What are physical plans in Spark SQL?
  5. What is explain() command?
  6. Difference between explain() and explain(true)?
  7. What is schema inference in Spark SQL?
  8. What is SQLContext?
  9. Difference between SQLContext and SparkSession?
  10. What are built-in SQL functions?
  11. What is case when expression?
  12. What is cast() in Spark SQL?
  13. What are date and time functions?
  14. What is join in Spark SQL?
  15. What are different types of joins?
  16. What is inner join?
  17. What is left outer join?
  18. What is right outer join?
  19. What is full outer join?
  20. What is cross join?
  21. What is subquery in Spark SQL?
  22. What is correlated subquery?
  23. What is union vs union all?
  24. What is view vs table?
  25. What is window function?

Advanced Level

  1. What is broadcast join in Spark SQL?
  2. How does Spark decide join strategies?
  3. What is shuffle in Spark SQL?
  4. What is adaptive query execution (AQE)?
  5. What is cost-based optimizer (CBO)?
  6. How does Spark SQL handle data skew?
  7. What is salting technique in Spark SQL?
  8. What is partition pruning?
  9. What is dynamic partition pruning?
  10. What is bucketing in Spark SQL?
  11. Difference between partitioning and bucketing?
  12. What is column pruning?
  13. What is predicate pushdown?
  14. What is vectorized execution?
  15. What is whole-stage code generation?
  16. What is Tungsten engine?
  17. How does Spark SQL manage memory?
  18. What is caching in Spark SQL?
  19. Difference between cache and persist?
  20. What is checkpointing?
  21. What is Spark SQL UDF?
  22. Difference between UDF and built-in functions?
  23. Performance impact of UDFs?
  24. What is map-side aggregation?
  25. How to optimize aggregation queries?

Expert Level

  1. Why Spark SQL is faster than Hive?
  2. How does Catalyst apply rule-based optimization?
  3. How does Catalyst apply cost-based optimization?
  4. Explain Spark SQL query execution lifecycle.
  5. What happens internally when a Spark SQL query is executed?
  6. How does Spark SQL generate Java bytecode?
  7. What is off-heap memory in Spark SQL?
  8. How does AQE modify execution plans at runtime?
  9. How to debug slow Spark SQL queries?
  10. How to analyze Spark UI for SQL queries?
  11. What are common Spark SQL performance anti-patterns?
  12. Why SELECT * is discouraged?
  13. How does Spark SQL handle schema evolution?
  14. How does Spark SQL work with semi-structured data?
  15. How does Spark SQL handle nested data?
  16. What is explode() in Spark SQL?
  17. What is from_json() and to_json()?
  18. What is Delta Lake integration with Spark SQL?
  19. How does Spark SQL support ACID transactions?
  20. What is time travel in Spark SQL?
  21. What is Z-Ordering?
  22. What is OPTIMIZE command?
  23. How do you tune Spark SQL configurations?
  24. Explain real-world Spark SQL optimization scenario.
  25. When should Spark SQL not be used?

Related Topics


#RDD

#RDD

Key Concepts


S.No Topic Sub-Topics
1Apache Spark & RDDWhat is Spark, What is RDD, RDD characteristics, Use cases, Spark ecosystem
2Spark ArchitectureDriver, Executors, Cluster manager, Jobs, Stages & Tasks
3SparkContext & RDD CreationSparkContext, parallelize(), textFile(), wholeTextFiles(), makeRDD()
4RDD Types & CharacteristicsParallel collections, Hadoop RDDs, Typed RDDs, Immutable nature, Partitioning
5RDD Transformations Basicsmap, flatMap, filter, distinct, union
6RDD Actions Basicscollect, count, first, take, reduce
7RDD Key-Value PairsPair RDDs, mapToPair, reduceByKey, groupByKey, combineByKey
8Key-Value AggregationsreduceByKey, foldByKey, aggregateByKey, countByKey, sortByKey
9RDD Joinsjoin, leftOuterJoin, rightOuterJoin, fullOuterJoin, Cartesian
10RDD Set Operationsunion, intersection, subtract, distinct, zip
11Partitioning in RDDsHashPartitioner, RangePartitioner, Custom partitioner, Repartition, Coalesce
12Shuffle OperationsShuffle concept, Wide vs narrow transformations, Optimization, Skew issues, Tuning
13RDD Persistence & Cachingcache, persist, Storage levels, Memory management, When to cache
14Fault Tolerance & LineageRDD lineage, DAG, Re-computation, Checkpointing, Recovery
15Shared VariablesBroadcast variables, Accumulators, Use cases, Performance benefits, Limitations
16RDD with FilesText files, Sequence files, Object files, Hadoop input formats, Output formats
17RDD vs DataFrame vs DatasetAPI differences, Performance, Type safety, Optimization, When to use RDD
18Custom TransformationsmapPartitions, mapPartitionsWithIndex, foreachPartition, Efficiency, Use cases
19RDD Sorting & OrderingsortBy, sortByKey, takeOrdered, top, Ordering logic
20Error Handling & DebuggingCommon errors, Serialization issues, Logging, Debug techniques, Best practices
21Performance OptimizationPartition sizing, Avoiding shuffles, Serialization tuning, Memory tuning, Best practices
22Handling Large DatasetsSkew handling, Sampling, Checkpointing, Spill management, Resource tuning
23RDD Execution ModelJob submission, Stage creation, Task execution, DAG scheduling, Execution flow
24RDD with Hive & HDFSHDFS integration, Hive tables, InputFormat, OutputFormat, Metadata handling
25Advanced Key-Value OperationscombineByKey internals, map-side aggregation, Custom combiners, Performance tuning, Examples
26Checkpointing & ReliabilityCheckpoint types, Reliable storage, Performance trade-offs, Use cases, Configuration
27Security & Access ControlAuthentication, Authorization, Secure clusters, Kerberos basics, Best practices
28Testing RDD CodeLocal mode testing, Unit testing, Test data, Assertions, Debug runs
29Best Practices & Anti-patternsCommon mistakes, Performance anti-patterns, Code structure, Reusability, Guidelines
30Real-world Use Cases & ProjectsLog processing, ETL pipelines, Graph processing, Analytics workloads, Review

Interview question

Basic Level

  1. What is an RDD in Apache Spark?
  2. What does RDD stand for?
  3. What are the main characteristics of RDDs?
  4. How is RDD different from a DataFrame?
  5. Is RDD immutable? Why?
  6. How do you create an RDD in Spark?
  7. What are the two ways to create an RDD?
  8. What is parallelize() in Spark?
  9. What is textFile() method?
  10. What is lazy evaluation in RDD?
  11. What are transformations in RDD?
  12. What are actions in RDD?
  13. Give examples of RDD transformations.
  14. Give examples of RDD actions.
  15. What is map() transformation?
  16. What is flatMap()?
  17. Difference between map() and flatMap()?
  18. What is filter() in RDD?
  19. What is collect() action?
  20. What is count() action?
  21. What is take() action?
  22. What is foreach() in RDD?
  23. What is SparkContext?
  24. What is an RDD partition?
  25. Why is RDD fault tolerant?

Intermediate Level

  1. What is lineage in RDD?
  2. How does RDD achieve fault tolerance?
  3. What is DAG in Spark?
  4. What is narrow transformation?
  5. What is wide transformation?
  6. Difference between narrow and wide transformations?
  7. What is shuffle in Spark?
  8. Which RDD operations cause shuffle?
  9. What is reduceByKey()?
  10. Difference between reduceByKey() and groupByKey()?
  11. What is aggregateByKey()?
  12. What is combineByKey()?
  13. What is cache() in RDD?
  14. What is persist()?
  15. Difference between cache() and persist()?
  16. What storage levels are supported in RDD?
  17. What is MEMORY_ONLY storage level?
  18. What happens if cached RDD does not fit in memory?
  19. What is checkpointing?
  20. Difference between cache and checkpoint?
  21. What is coalesce()?
  22. Difference between coalesce() and repartition()?
  23. What is union() operation?
  24. What is intersection()?
  25. What is subtract() in RDD?

Advanced Level

  1. What is a Pair RDD?
  2. How do you create a Pair RDD?
  3. What is sortByKey()?
  4. How does partitioning work in RDD?
  5. What is HashPartitioner?
  6. What is RangePartitioner?
  7. Why is partitioning important?
  8. What is mapPartitions()?
  9. Difference between map() and mapPartitions()?
  10. What is foreachPartition()?
  11. What is zip() in RDD?
  12. What are broadcast variables?
  13. What are accumulators?
  14. Can RDDs be shared between Spark applications?
  15. What is task in Spark?
  16. What is stage in Spark?
  17. How are stages created from RDD operations?
  18. What is speculative execution?
  19. What is memory spill in Spark?
  20. How does Spark handle executor memory?
  21. What is a long lineage problem?
  22. How does checkpoint help performance?
  23. What is serialization in RDD?
  24. Java serializer vs Kryo serializer?
  25. How to optimize RDD performance?

Expert Level

  1. When should you use RDD over DataFrame?
  2. Why are RDDs considered low-level APIs?
  3. How does RDD differ from Dataset?
  4. How does Spark recover lost RDD partitions?
  5. What happens if a node fails during RDD computation?
  6. How does Spark schedule RDD tasks?
  7. What is locality level in Spark?
  8. Explain process of RDD execution internally.
  9. How does Spark avoid recomputation?
  10. Explain RDD execution with an example job.
  11. How do you debug RDD performance issues?
  12. How does garbage collection affect RDD performance?
  13. What is out-of-memory error in RDD jobs?
  14. How to tune Spark for large RDD jobs?
  15. What are common RDD anti-patterns?
  16. Why groupByKey is discouraged?
  17. How do you manage skewed data in RDD?
  18. Explain custom partitioner usage.
  19. What is RDD persistence across stages?
  20. Can RDD operations be optimized by Catalyst?
  21. Why Spark recommends DataFrame over RDD?
  22. How does RDD handle schema-less data?
  23. What are the limitations of RDD?
  24. How does RDD support iterative algorithms?
  25. Explain RDD usage in real-time production scenarios.

Related Topics


#DataFrames

#DataFrames

Key Concepts


S.No Topic Sub-Topics
1Apache Spark & DataFramesSpark overview, RDD vs DataFrame, Spark architecture, Lazy evaluation, Use cases
2Spark Setup & EnvironmentLocal mode, Cluster mode, SparkSession, spark-submit, Configuration basics
3SparkSession & Entry PointsSparkSession creation, SQLContext, HiveContext, Config options, Best practices
4Creating DataFramesFrom files, From RDD, From collections, Schema inference, Explicit schema
5DataFrame Schema & Data TypesStructType, StructField, Primitive types, Complex types, Schema evolution
6Reading Data SourcesCSV, JSON, Parquet, ORC, Avro basics
7Writing DataFramesSave modes, Partitioning, Bucketing, File formats, Compression
8DataFrame Basic Operationsselect, withColumn, drop, filter, where
9Column OperationsColumn expressions, alias, cast, when/otherwise, lit
10Row Operations & Actionsshow, collect, take, count, first
11DataFrame FunctionsBuilt-in functions, String functions, Date functions, Math functions, Null handling
12Filtering & Conditional Logicfilter vs where, isin, like, rlike, case when
13Sorting & DeduplicationorderBy, sort, distinct, dropDuplicates, Sorting optimization
14Aggregation & GroupinggroupBy, agg, count, sum, avg
15Joins in DataFramesInner join, Left/Right join, Full join, Semi/Anti join
16Join OptimizationBroadcast join, Shuffle join, Join hints, Skew handling, AQE
17Handling Missing & Bad Datadropna, fillna, replace, Null checks, Data validation
18Window FunctionsWindow spec, row_number, rank, lead/lag, Running totals
19UDF & UDAFUDF creation, Performance impact, Pandas UDF, Serialization, Best practices
20DataFrame Caching & Persistencecache, persist, Storage levels, Memory vs disk, When to cache
21Spark SQL with DataFramesTemp views, Global views, SQL queries, Mixing SQL & DF, Optimization
22Partitioning & Repartitioningrepartition, coalesce, Partition pruning, File partitioning, Performance tuning
23Performance Optimization BasicsCatalyst optimizer, Tungsten, Predicate pushdown, Column pruning, AQE
24DataFrame Execution PlanLogical plan, Physical plan, explain(), DAG, Stage breakdown
25Handling Large DatasetsSkew issues, Sampling, Checkpointing, Memory tuning, Spill handling
26Integration with HiveHive tables, External tables, Metastore, Partitioned tables, Hive SQL
27Streaming DataFrames (Structured Streaming)Streaming sources, Sinks, Watermarking, Windowed aggregations, Triggers
28Error Handling & DebuggingCommon errors, Serialization issues, Logging, Debug tools, Retry strategies
29Best Practices & Design PatternsCode structure, Reusability, Performance patterns, Anti-patterns, Testing
30Real-world Use Cases & ProjectsETL pipelines, Data lake processing, Analytics workloads, Reporting, Optimization review

Interview question

Basic Level

  1. What is a DataFrame in Apache Spark?
  2. How is a DataFrame different from an RDD?
  3. What are the advantages of DataFrames?
  4. Is DataFrame immutable?
  5. What is SparkSession?
  6. How do you create a DataFrame in Spark?
  7. What are the different data sources supported by DataFrames?
  8. What is a schema in DataFrame?
  9. How can you infer schema automatically?
  10. How do you define a custom schema?
  11. What is show() in DataFrame?
  12. What is printSchema()?
  13. What is select() in DataFrame?
  14. What is withColumn()?
  15. Difference between withColumn() and select()?
  16. What is filter() / where() in DataFrame?
  17. Difference between filter() and where()?
  18. What is limit()?
  19. What is collect()?
  20. What is count()?
  21. What is distinct()?
  22. What is drop()?
  23. What is dropDuplicates()?
  24. What is alias()?
  25. How do you rename a column in DataFrame?

Intermediate Level

  1. What are DataFrame transformations?
  2. What are DataFrame actions?
  3. What is lazy evaluation in DataFrames?
  4. What is explain() in DataFrame?
  5. What is logical plan?
  6. What is physical plan?
  7. What is Catalyst Optimizer?
  8. What is Tungsten execution engine?
  9. What is column expression?
  10. What are built-in Spark SQL functions?
  11. Difference between UDF and built-in functions?
  12. What is UDF?
  13. Performance impact of UDFs?
  14. What is groupBy()?
  15. What is agg()?
  16. Difference between groupBy and window functions?
  17. What is orderBy()?
  18. Difference between orderBy() and sort()?
  19. What is join() in DataFrame?
  20. What are the types of joins in Spark?
  21. What is inner join?
  22. What is left outer join?
  23. What is right outer join?
  24. What is full outer join?
  25. What is cross join?

Advanced Level

  1. What is broadcast join?
  2. When does Spark automatically use broadcast join?
  3. What is shuffle join?
  4. How does Spark handle join optimization?
  5. What is partitioning in DataFrame?
  6. Difference between repartition() and coalesce()?
  7. What is bucketing in Spark?
  8. Difference between bucketing and partitioning?
  9. What is window function?
  10. Explain row_number(), rank(), dense_rank().
  11. What is caching in DataFrame?
  12. Difference between cache() and persist()?
  13. What storage levels are available?
  14. What is checkpointing in DataFrame?
  15. Difference between cache and checkpoint?
  16. What is skewed data?
  17. How to handle data skew in joins?
  18. What is salting technique?
  19. What is adaptive query execution (AQE)?
  20. What are Spark SQL hints?
  21. What is explain(true)?
  22. How to optimize wide transformations?
  23. What is column pruning?
  24. What is predicate pushdown?
  25. What is vectorized reader?

Expert Level

  1. Why DataFrames are faster than RDDs?
  2. How does Catalyst optimize DataFrame queries?
  3. How does Tungsten improve performance?
  4. What happens internally when a DataFrame action is triggered?
  5. How does Spark generate optimized bytecode?
  6. What is whole-stage code generation?
  7. What is off-heap memory?
  8. How does Spark handle memory management for DataFrames?
  9. How does AQE change execution plans at runtime?
  10. How do you debug slow DataFrame jobs?
  11. How do you analyze Spark UI for DataFrame jobs?
  12. What are common DataFrame performance anti-patterns?
  13. Why excessive withColumn() is discouraged?
  14. How do you design efficient Spark SQL pipelines?
  15. What are limitations of DataFrames?
  16. How do DataFrames handle schema evolution?
  17. How does DataFrame support semi-structured data?
  18. What is explode() function?
  19. What is from_json() and to_json()?
  20. How to handle nested columns efficiently?
  21. What is Delta Lake DataFrame integration?
  22. How does DataFrame handle ACID properties?
  23. What is cost-based optimization (CBO)?
  24. How do you tune Spark SQL configurations?
  25. Explain real-time production use cases of DataFrames.

Related Topics


#Vector Databases

#Vector Databases

Key Concepts


S.No Topic Sub-Topics
1Vector DatabasesWhat is a vector, embeddings basics, similarity search, use cases, traditional DB vs vector DB
2Mathematics Behind VectorsLinear algebra basics, cosine similarity, Euclidean distance, dot product, normalization
3Embeddings FundamentalsText embeddings, image embeddings, dimensionality, dense vs sparse vectors, embedding quality
4Embedding ModelsWord2Vec, GloVe, FastText, Sentence Transformers, OpenAI embeddings
5Vector Similarity MetricsCosine similarity, L2 distance, inner product, Hamming distance, trade-offs
6Indexing TechniquesFlat index, inverted index, IVF, HNSW, PQ
7Approximate Nearest Neighbor (ANN)Why ANN, accuracy vs speed, recall, latency, scalability
8HNSW AlgorithmGraph layers, insertion, search process, parameters, performance tuning
9IVF and PQ IndexesClustering, centroids, compression, memory optimization, search flow
10Index OptimizationRe-indexing, shard sizing, dimension reduction, caching, pruning
11Popular Vector DatabasesPinecone, Weaviate, Milvus, Qdrant, FAISS overview
12Pinecone ArchitectureIndexes, namespaces, pods, metadata filtering, scaling
13Weaviate ArchitectureSchema, classes, modules, GraphQL API, hybrid search
14Milvus ArchitectureCollections, partitions, segments, coordinators, storage layers
15Open Source vs Managed Vector DBsCost, scalability, maintenance, performance, use cases
16Data Modeling in Vector DBsVector schema, metadata fields, hybrid models, versioning, updates
17CRUD OperationsInsert vectors, update vectors, delete vectors, batch operations, upserts
18Metadata FilteringStructured filters, boolean logic, range queries, hybrid search, performance impact
19Hybrid SearchKeyword + vector search, BM25, re-ranking, score fusion, use cases
20Vector Search APIsREST APIs, SDKs, query parameters, pagination, result tuning
21Scalability & ShardingHorizontal scaling, replicas, partitioning, load balancing, fault tolerance
22Performance TuningLatency optimization, recall tuning, batch queries, memory usage, caching
23Consistency & DurabilityReplication, WAL, backups, recovery, data integrity
24Security in Vector DatabasesAuthentication, authorization, encryption, network security, compliance
25Monitoring & ObservabilityMetrics, logging, tracing, alerts, capacity planning
26Vector DB + LLM IntegrationRAG pattern, prompt injection, context windowing, retrieval tuning, pipelines
27Semantic Search ApplicationsSearch engines, document retrieval, FAQs, chatbots, recommendation systems
28Multimodal Vector SearchText-image search, audio embeddings, video embeddings, fusion strategies, use cases
29Production Best PracticesSchema evolution, re-embedding, cost optimization, SLA planning, testing
30Advanced & Future TrendsVector + graph DBs, real-time embeddings, edge deployment, AI agents, research trends

Interview question


Related Topics


#RAG

#RAG

Key Concepts


S.No Topic Sub-Topics
1 RAG What is RAG, Why RAG, RAG vs LLM-only, RAG use cases, RAG limitations
2 LLM Fundamentals for RAG Transformer basics, Context window, Tokens, Prompt-response flow, Hallucinations
3 Text Embeddings What are embeddings, Vector representation, Embedding models, Dimensionality, Similarity meaning
4 Embedding Models OpenAI embeddings, SentenceTransformers, Multilingual embeddings, Trade-offs, Model selection
5 Vector Databases Basics Vector DB concept, ANN search, Indexing basics, Metadata storage, Vector lifecycle
6 Vector DB Tools FAISS, Pinecone, Weaviate, Milvus, ChromaDB
7 Distance Metrics Cosine similarity, Dot product, Euclidean distance, Trade-offs, Metric selection
8 Chunking Strategies Fixed chunking, Semantic chunking, Chunk size, Overlap, Parent-child chunks
9 Document Ingestion PDF ingestion, Text files, HTML ingestion, Cleaning text, Normalization
10 Indexing Pipeline Embedding generation, Batch indexing, Metadata tagging, Versioning, Index updates
11 Retrieval Basics Top-k retrieval, Similarity threshold, Recall vs precision, Retrieval latency, Query flow
12 Hybrid Search Dense search, Sparse search, Keyword search, BM25, Hybrid ranking
13 Metadata Filtering Structured filters, Access control, User-based filtering, Time filters, Security filters
14 Prompt Engineering for RAG Prompt templates, Context injection, Instructions, Citations, Answer formatting
15 Naive RAG Architecture Single retriever, Single prompt, Context stuffing, Limitations, Failure cases
16 Advanced RAG Architecture Multi-retriever, Reranking, Compression, Query rewriting, Modular design
17 Reranking Techniques Cross-encoders, Relevance scoring, Latency trade-off, Top-n rerank, Quality boost
18 Context Optimization Token limits, Context pruning, Compression, Redundancy removal, Ordering chunks
19 Multi-hop Retrieval Complex queries, Query decomposition, Iterative retrieval, Chain-of-thought, Examples
20 Agentic RAG LLM agents, Tool calling, Planner-executor, Memory, Autonomous retrieval
21 Structured Data RAG SQL integration, CSV data, APIs, Knowledge graphs, Hybrid retrieval
22 RAG with LangChain Retrievers, Chains, Vector stores, Memory, RAG pipelines
23 RAG with LlamaIndex Indexes, Query engines, Node parsing, Storage context, Tools
24 Evaluation of RAG Retrieval metrics, Answer quality, Faithfulness, Relevance, Latency
25 RAGAS Framework Faithfulness score, Context recall, Answer relevance, Ground truth, Automation
26 Security in RAG Prompt injection, Data leakage, RBAC, PII handling, Secure retrieval
27 Scalability & Performance Index sharding, Caching, Async retrieval, Load balancing, Cost control
28 Production Deployment API design, Model hosting, Vector DB hosting, Monitoring, Logging
29 Monitoring & Feedback User feedback, Drift detection, Retrieval errors, Continuous improvement, Alerts
30 Enterprise RAG Use Cases Chatbots, Search engines, Knowledge assistants, Analytics, Decision support

Interview question

Basic Level

  1. What is Retrieval-Augmented Generation (RAG)?
  2. Why is RAG needed for LLM applications?
  3. What problems does RAG solve?
  4. What are the core components of a RAG system?
  5. What is retrieval in RAG?
  6. What is generation in RAG?
  7. How is RAG different from fine-tuning?
  8. How is RAG different from prompt engineering?
  9. What is a knowledge base in RAG?
  10. What type of data can RAG consume?
  11. What are embeddings?
  12. Why are embeddings used in RAG?
  13. What is a vector database?
  14. Examples of vector databases?
  15. What is semantic search?
  16. What is similarity search?
  17. What distance metrics are commonly used?
  18. What is cosine similarity?
  19. What is text chunking?
  20. Why is chunking important in RAG?
  21. What is context window?
  22. What is prompt grounding?
  23. What is hallucination in LLMs?
  24. How does RAG reduce hallucinations?
  25. What are common RAG use cases?

Intermediate Level

  1. Explain the end-to-end RAG workflow.
  2. How are embeddings generated?
  3. Which embedding models are commonly used?
  4. What is embedding dimensionality?
  5. How does chunk size affect retrieval?
  6. What is chunk overlap?
  7. What is metadata filtering?
  8. What is hybrid search?
  9. Difference between sparse and dense retrieval?
  10. What is keyword search vs vector search?
  11. What is top-k retrieval?
  12. How do you decide the value of k?
  13. What is reranking?
  14. Why is reranking important?
  15. What is prompt templating in RAG?
  16. How is retrieved context injected into prompts?
  17. What is latency challenge in RAG?
  18. How do you improve RAG response speed?
  19. What is document indexing?
  20. How do you update knowledge base data?
  21. What is FAISS?
  22. What is Pinecone?
  23. What is Weaviate?
  24. What is Chroma DB?
  25. What role does LangChain play in RAG?

Advanced Level

  1. What are different RAG architectures?
  2. What is naive RAG?
  3. What is advanced RAG?
  4. What is agentic RAG?
  5. What is multi-hop retrieval?
  6. What is query rewriting?
  7. What is a self-query retriever?
  8. What is parent-child chunking?
  9. Difference between document-level and chunk-level retrieval?
  10. What is contextual compression?
  11. How do you handle long documents in RAG?
  12. How does RAG integrate with structured data?
  13. How can SQL databases be used in RAG?
  14. What is retrieval evaluation?
  15. What metrics are used to evaluate RAG?
  16. What is recall vs precision in RAG?
  17. What is MMR (Max Marginal Relevance)?
  18. How does MMR help improve answer quality?
  19. What is data skew in retrieval?
  20. How do you handle stale data?
  21. How do you implement real-time RAG?
  22. How is access control handled in RAG?
  23. How do you secure sensitive documents?
  24. How does multilingual RAG work?
  25. What are common RAG failure patterns?

Expert Level

  1. How do you design a production-grade RAG system?
  2. How does RAG scale to millions of documents?
  3. What are trade-offs between RAG and fine-tuning?
  4. How do you optimize RAG for low latency?
  5. How do you debug poor RAG responses?
  6. What causes irrelevant retrieval?
  7. How do you improve retrieval accuracy?
  8. How do context limits impact RAG?
  9. What strategies help reduce token usage?
  10. How do you prevent prompt injection in RAG?
  11. How do you measure answer faithfulness?
  12. What is RAGAS evaluation framework?
  13. How do you monitor RAG systems in production?
  14. How do you build feedback loops?
  15. What is continuous indexing?
  16. How do you version embeddings?
  17. How do you migrate vector databases safely?
  18. How do you control RAG operational costs?
  19. How do you handle LLM model upgrades?
  20. How does RAG enable explainability?
  21. What is citation-based RAG?
  22. How does RAG work with AI agents?
  23. What are emerging RAG patterns?
  24. What are the limitations of RAG?
  25. Explain enterprise-level RAG use cases.

Related Topics