What Is Azure Databricks?
Azure Databricks is a cloud-based unified data analytics platform that is built on Apache Spark. It is a collaboration between Microsoft Azure and Databricks and provides a scalable and secure environment for big data processing, machine learning, and data visualization.
Azure Databricks offers a powerful set of tools for data science and engineering, including a collaborative workspace for data exploration and experimentation, as well as a managed cloud infrastructure for running large-scale workloads. It also provides seamless integration with other Azure services, making it easy to incorporate Azure Databricks into existing workflows and architectures.
In this article:
- Azure Databricks Pricing: Standard vs. Premium Tier Features
- Azure Databricks Pricing Models
- 9 Tips for Azure Databricks Cost Optimization
Azure Databricks Pricing: Standard vs. Premium Tier Features
Azure Databricks offers a standard tier and a premium tier, with different features for various workloads. When customers create an Azure Databricks workspace, they specify the tier – they can switch to the other pricing tier later if they change their mind.
Standard tier features include:
- Apache Spark on Databricks: All-purpose, jobs, and light jobs compute
- Job scheduling with libraries: All-purpose, jobs, and light jobs compute
- Job scheduling with notebooks: All-purpose and jobs compute
- Autopilot clusters: All-purpose and jobs compute
- Databricks Delta: All-purpose and jobs compute
- Databricks Runtime for Machine Learning: All-purpose and jobs compute
- MLflow on Databricks Preview: All-purpose and jobs compute
- Interactive clusters: All-purpose compute
- Notebooks and collaboration: All-purpose compute
- Ecosystem integration: All-purpose compute
Premium tier features include:
- Role-based access control for clusters, tables, notebooks, and jobs: Interactive and automated workloads
- JDBC/ODBC endpoint authentication: Interactive and automated workloads
- Audit logs: Interactive and automated workloads
- All standard tier features: Interactive and automated workloads
- Credential passthrough (Azure AD): Interactive and automated workloads (no light jobs compute)
- Conditional authentication: Interactive workloads (no automated workloads or light jobs compute)
- IP access list: Preview available
- Cluster policies: Preview available
- Token management API: Preview available
Related content: Read our guide to Databricks pricing (coming soon)
Azure Databricks Pricing Models
Azure Databricks customers can pay according to the following pricing plans.
Azure Databricks charges for the VMs (virtual machines) provisioned and the DBUs (Databricks Units) – prices differ based on the chosen VM instance. DBUs are a unit representing processing capacity, with usage billed per second. DBU consumption differs based on the type and size of instance that runs Azure Databricks.
Here is an example of the pricing for the Central US region:
- All-purpose compute: Standard = $0.40 per DBU/hour, Premium = $0.55 per DBU/hour
- Jobs compute: Standard = $0.15 per DBU/hour, Premium = $0.30 per DBU/hour
- Light jobs compute: Standard = $0.07 per DBU/hour, Premium = $0.22 per DBU/hour
- SQL compute: Premium = $0.22 per DBU/hour
- SQL pro compute: Premium = $0.33 per DBU/hour
- Serverless SQL: Premium = $0.42 per DBU/hour
- Serverless real-time inference: Premium = $0.079 per DBU/hour
This plan provides savings of up to 37% compared to the pay-as-you-go model. Customers pre-purchase DBUs as a set of Databricks Commit Units (DBCUs) for a one- or three-year period. A DBCU normalizes the customer’s Databricks usage to generate a unified purchase based on the DBU usage across all tiers and workloads.
Here are the prices for DBCU purchases in the Central US region, showing the discounts compared to on-demand pricing:
For a one-year DBU pre-purchase plan:
- 25,000 DCBUs: One-year plan = $23,500 (6% discount)
- 50,000 DCBUs: One-year plan = $46,000 (8% discount)
- 75,000 DCBUs: Three-year plan = $69,00 (8% discount)
- 100,000 DCBUs: One-year plan = $89,500 (11% discount)
- 150,000 DCBUs: Three-year plan = $135,000 (10% discount)
- 200,000 DCBUs: Three-year plan = $172,500 (14% discount)
- 300,000 DCBUs: One-year plan = $261,000 (13% discount)
- 350,000 DCBUs: One-year plan = $287,000 (18% discount)
- 500,000 DCBUs: One-year plan = $400,000 (20% discount)
- 600,000 DCBUs: Three-year plan = $504,000 (16% discount)
- 750,000 DCBUs: One-year plan = $577,500 (23% discount)
- 1,000,000 DCBUs: One-year plan = $730,000 (27% discount)
- 1,050,000 DCBUs: Three-year plan = $819,000 (22% discount)
- 1,500,000 DCBUs: One-year plan = $1,050,000 (30% discount), Three-year plan = $1,140,000 (24% discount)
- 2,000,000 DCBUs: One-year plan = $1,340,000 (33% discount)
- 2,250,000 DCBUs: Three-year plan = $1,642,500 (27% discount)
- 3,000,000 DCBUs: Three-year plan = $2,070,000 (31% discount)
- 4,500,000 DCBUs: Three-year plan = $2,970,000 (34% discount)
- 6,000,000 DCBUs: Three-year plan = $3,780,000 (37% discount)
9 Tips for Azure Databricks Cost Optimization
Azure Databricks can be a powerful tool for big data processing and machine learning workloads, but it’s essential to optimize costs while using the platform. Here are some strategies and best practices for cost optimization in Azure Databricks:
- Choose the right pricing tier: Assess your organization’s needs and choose between the Standard and Premium tiers of Azure Databricks. Select the tier that provides the necessary features and capabilities without incurring unnecessary costs.
- Autoscaling: Enable autoscaling for your Databricks clusters to adjust the number of worker nodes dynamically based on the workload. Autoscaling helps ensure you’re using the optimal number of resources for your tasks, reducing costs by not over-provisioning resources.
- Terminate idle clusters: Set up cluster termination policies to automatically terminate clusters after a period of inactivity. This prevents paying for unused resources when clusters are idle.
- Use spot instances: Leverage Azure spot instances for your Databricks clusters to reduce compute costs. Spot instances are unused Azure resources offered at a discounted rate, although they can be reclaimed by Azure with short notice. They are suitable for fault-tolerant workloads or development and testing environments.
- Optimize data storage: Use efficient data formats like Parquet or Delta Lake, which offer built-in compression and reduce storage costs. Additionally, store your data in cost-effective storage services like Azure Blob Storage or Azure Data Lake Storage.
- Cache data: Cache frequently accessed data in memory to reduce the number of disk I/O operations and improve performance. Faster execution can lead to reduced cluster runtime and lower costs.
- Optimize queries and transformations: Write efficient queries and transformations to minimize data shuffling and reduce the execution time of your jobs. Use techniques like partition pruning, predicate pushdown, and broadcast joins to improve performance.
- Monitor and analyze usage: Regularly monitor and analyze your Databricks usage using Azure Cost Management, Azure Monitor, and Databricks usage metrics. Identify areas of inefficiency and opportunities for cost optimization.
- Use reserved instances: If you have predictable, long-term workloads, consider using Azure reserved instances to reserve compute resources in advance at a discounted rate.
By following these best practices and regularly reviewing your Azure Databricks usage, you can optimize costs while maintaining high performance for your big data and machine learning workloads.