18 January 2026

#Databricks

#Databricks

Key Concepts


S.No Topic Sub-Topics
1DatabricksWhat is Databricks, Lakehouse concept, Databricks vs Hadoop, Use cases, Architecture overview
2Databricks WorkspaceWorkspace UI, Notebooks, Clusters, Jobs, Repos
3Databricks ArchitectureControl plane, Data plane, Workspace components, Security layers, Execution flow
4Clusters in DatabricksAll-purpose clusters, Job clusters, Autoscaling, Cluster policies, Init scripts
5Databricks RuntimeDBR versions, Photon engine, ML runtime, GPU runtime, Performance tuning
6NotebooksLanguages supported, Notebook workflows, Magic commands, Versioning, Collaboration
7Databricks Utilities (dbutils)File system ops, Secrets, Widgets, Notebook workflows, FS mounts
8Data IngestionBatch ingestion, Streaming ingestion, Auto Loader, File formats, Schema inference
9Delta Lake FundamentalsACID transactions, Delta log, Schema enforcement, Time travel, File compaction
10Delta Lake AdvancedOPTIMIZE, Z-ORDER, Vacuum, Delta constraints, Change Data Feed
11Spark SQL in DatabricksSQL editor, ANSI SQL, Views, CTEs, Query optimization
12DataFrames & DatasetsAPI overview, Transformations, Actions, Lazy evaluation, Performance tips
13Databricks SQL WarehousesServerless SQL, Query execution, Dashboards, Alerts, Access control
14Jobs & WorkflowsJob types, Task dependencies, Scheduling, Retries, Monitoring
15Databricks ReposGit integration, Branching, CI/CD basics, Repo permissions, Best practices
16Security & Access ControlUsers & groups, IAM integration, Table ACLs, Cluster policies, Secrets
17Unity CatalogMetastore, Catalogs & schemas, Data lineage, Fine-grained access, Auditing
18Streaming with DatabricksStructured Streaming, Triggers, Watermarking, Stateful ops, Fault tolerance
19Auto LoaderCloudFiles, Incremental ingestion, Schema evolution, Notifications, Performance tuning
20Databricks ML OverviewML workspace, ML runtime, Experiment tracking, Feature store, Model registry
21MLflow in DatabricksTracking, Projects, Models, Model registry, Deployment
22Feature StoreFeature tables, Offline features, Online features, Reusability, Governance
23Model TrainingDistributed training, Hyperparameter tuning, AutoML, GPUs, Evaluation metrics
24Model DeploymentBatch inference, Real-time serving, Model endpoints, A/B testing, Monitoring
25Performance OptimizationPartitioning, Caching, Broadcast joins, Skew handling, Photon usage
26Monitoring & LoggingSpark UI, Ganglia, Job metrics, Logs, Alerts
27Cost OptimizationCluster sizing, Spot instances, Autoscaling, Job clusters, Usage reports
28Databricks on CloudAWS architecture, Azure architecture, GCP basics, Networking, Storage integration
29CI/CD & DevOpsRepos + pipelines, Databricks CLI, Asset bundles, Environment promotion, Automation
30Real-world Use CasesETL pipelines, Streaming analytics, ML pipelines, Lakehouse design, Interview prep

Interview question

Basic

  • What is Databricks?
  • What problems does Databricks solve?
  • What is Apache Spark?
  • How is Databricks different from Apache Spark?
  • What are Databricks Workspaces?
  • What is a Databricks Cluster?
  • Types of clusters in Databricks?
  • What is a Notebook in Databricks?
  • Supported languages in Databricks?
  • What is DBFS?
  • Difference between DBFS and HDFS?
  • What is Delta Lake?
  • Advantages of Delta Lake?
  • What is a Databricks Job?
  • What is Auto-scaling?
  • What is Auto-termination?
  • What is a Databricks Runtime?
  • Difference between Standard and ML Runtime?
  • What is a DataFrame?
  • What is a Spark Session?
  • What is a cell execution?
  • What is a notebook revision history?
  • What is MAGIC command?
  • What is %sql in Databricks?
  • What is Unity Catalog?

Intermediate

  • What is Delta Table?
  • What is ACID compliance in Delta Lake?
  • What is schema enforcement?
  • What is schema evolution?
  • What is OPTIMIZE in Delta Lake?
  • What is Z-ORDER?
  • Difference between Managed and External tables?
  • What is Time Travel in Delta Lake?
  • How do you handle duplicate data in Databricks?
  • What is Databricks SQL?
  • Difference between Spark SQL and Databricks SQL?
  • What is a Job Cluster vs Interactive Cluster?
  • How does Databricks handle fault tolerance?
  • What is caching in Databricks?
  • What is broadcast join?
  • What is Shuffle?
  • What is lazy evaluation?
  • Difference between RDD and DataFrame?
  • What is checkpointing?
  • How does Databricks integrate with cloud storage?
  • What is Structured Streaming?
  • Difference between batch and streaming?
  • What is watermarking?
  • What is MLflow?
  • What is Feature Store?

Advanced

  • Explain Databricks Lakehouse architecture
  • How does Delta Lake handle concurrent writes?
  • What is vacuum in Delta Lake?
  • Explain Delta log (_delta_log)
  • How does Z-ORDER improve performance?
  • What is Photon engine?
  • How does Photon improve query performance?
  • Explain cluster sizing strategy
  • How do you optimize Spark jobs in Databricks?
  • Explain adaptive query execution
  • What is cost optimization in Databricks?
  • What is data skipping?
  • Explain file compaction
  • How does Databricks handle skewed data?
  • What is Unity Catalog security model?
  • Difference between table ACLs and Unity Catalog?
  • What is lineage in Databricks?
  • How do you manage secrets in Databricks?
  • What is Databricks REST API?
  • How do you deploy code using Databricks Repos?
  • What is CI/CD in Databricks?
  • What is MLflow tracking?
  • What is model registry?
  • Explain real-time pipeline in Databricks
  • How do you handle late-arriving data?

Expert

  • Design an end-to-end Lakehouse architecture
  • How does Databricks ensure data governance at scale?
  • Explain multi-hop architecture (Bronze, Silver, Gold)
  • How do you design CDC pipelines in Databricks?
  • Explain Delta Live Tables (DLT)
  • Difference between DLT and normal pipelines?
  • How do you handle schema drift in production?
  • Explain exactly-once processing
  • How do you tune Spark for petabyte-scale data?
  • How does Photon compare with Spark Tungsten?
  • Explain Databricks serverless SQL
  • How do you secure PII data in Databricks?
  • How do you implement row-level and column-level security?
  • Explain workload isolation
  • How do you migrate from Hive to Databricks?
  • How do you monitor Databricks jobs?
  • Explain cost vs performance trade-offs
  • How do you manage large joins efficiently?
  • Explain Lakehouse vs Data Warehouse
  • Future roadmap of Databricks platform?
  • How does Databricks support AI workloads?
  • Explain vector search in Databricks
  • How do you handle model versioning at scale?
  • Explain MLOps in Databricks
  • How do you design enterprise-grade Databricks solution?

Related Topics