In the rapidly evolving world of data science and big data analytics, Databricks has emerged as a pivotal platform that bridges the gap between massive datasets and actionable insights. For the uninitiated and the experienced alike, understanding the terminology unique to Databricks can be the difference between merely using a tool and mastering it.
Here’s a short glossary that not only clarifies what Databricks supports but also can improve your proficiency in this dynamic environment.
Think of the Databricks Workspace as your virtual lab. This is where you develop code, create data visualizations, and manage all your computational assets like clusters and jobs. It’s designed to foster collaboration, allowing multiple users to interact with notebooks, libraries, and data all in one space.
At the heart of Databricks is the Databricks Runtime — The set of core components that run on the clusters managed by Databricks. It’s an enhanced version of Apache Spark, with performance and security optimizations that can handle your heaviest data loads.
The Notebook is a web-based interface within Databricks where you can write and execute code, and visualize results. It supports multiple programming languages,Python, Scala, SQL, and R, and allows for mixed-language notebooks.
Clusters provide the computational horsepower needed to analyze data. Customizable based on your needs, you can configure clusters by specifying the type of machines to use, autoscaling options, and more.
DBFS (Databricks File System)
The DBFS is a distributed file system mounted onto Databricks clusters. It’s your data playground, allowing you to easily store and move data around while integrating with various external data storage systems.
Databricks Delta adds a layer of sophistication to data handling by introducing ACID transactions to big data workloads. It ensures data integrity and consistency, even when dealing with the most complex and large datasets.
In Databricks, Jobs are scheduled or triggered tasks that are run on Databricks clusters. These can be notebook workflows or JARs.
MLflow is an open-source platform for the complete machine learning lifecycle. It includes tools for tracking experiments, packaging code into reproducible runs, sharing and deploying models, bringing reproducibility and collaboration to your ML projects.
The Spark UI is a web interface for monitoring Spark applications running on Databricks clusters. It’s a monitoring tool that provides details about task execution, helping you optimize and troubleshoot as necessary.
In the Databricks ecosystem, a Table is your structured dataset, ready for analysis. You can create tables from data residing in DBFS or pull in data from other remote sources.
Libraries are the plugins of Databricks, providing additional functionality. Whether you’re using a data manipulation package in Python or a machine learning library in Scala, libraries can be installed on clusters to enhance your code’s capabilities.
A term specific Databricks Delta, Shard refers to the chunk of data that’s processed, read or written in parallel with other chunks, enhancing the efficiency and speed of data operations.
Z-ordering is an optimization in Databricks Delta that co-locates related information in the same set of files, significantly reducing the time taken for data retrieval.
A set of features within Databricks, Databricks SQL allows you to execute SQL queries against your datasets and visualize the results.
Databricks Connect allows you to connect your local machine to a Databricks cluster and execute Spark code.
Spark Driver is the program that declares the transformations and actions on data and submits such requests to the master. In Databricks, the driver runs on a dedicated node of the cluster.
A critical component of the Spark runtime, Spark Executor is a process launched for an application on a worker node that runs tasks and keeps data in memory or disk storage.
Fundamental to Databricks clusters, Worker Nodes are responsible for executing the tasks assigned by the Spark Driver and storing data as part of the computational resources.
Databricks Cluster types
All-purpose jobs are designed for ad hoc and exploratory workloads. They allow you to run a notebook or a JAR with different parameters each time you execute the job. These jobs are suitable for situations where you need to test different parameters frequently or when the job doesn’t have a regular schedule. They are especially handy for data engineers and data scientists during the data exploration and model development phase.
Workflow (Compute) Jobs
Workflow or compute jobs are designed for scheduled and predictable workloads. They let you execute a notebook or a JAR with the same set of parameters on a regular basis. They are typically used for production workloads where you need regular data processing, like ETL tasks that run at the same time every day, or for scheduled reporting.
Photon is Databricks’ native vectorized execution engine. It’s built to increase the performance of modern data workloads (like semi-structured data processing, AI, etc.) on CPU and GPU. Photon achieves this by vectorizing operations (processing data in batches instead of one item at a time) and by having a tight integration with Delta Lake.
- Performance: Photon offers significant performance improvements over the standard Spark engine for specific workloads.
- Flexibility: It seamlessly integrates with popular BI tools, ensuring that the enhanced performance isn’t just limited to tasks within Databricks but can also be leveraged for visualizations and reports generated using external tools.
- Cost-Efficient: By reducing the compute time, it can lead to cost savings especially if you’re using on-demand or spot instances.
Stay alert, because there are many exciting developments regarding Photon and the performant future of Spark SQL.
Now that you know the terminology for Databricks, it’s time to optimize your data workloads. Download The Guide to Databricks Optimization to start saving on data costs.