Phase 8: Design compute configuration

In this phase, you design compute resources and workspace settings to optimize performance, cost, and security.

Databricks recommends using serverless compute as the primary option. Serverless requires no configuration, is always available, and scales automatically with workloads in seconds. Only configure classic compute manually if serverless does not support your use case.

Design cluster sizing strategy

For classic compute workloads, follow cluster sizing best practices to find a reasonable starting point for your workloads.

Cluster sizing considerations

Workload type: Batch processing requires larger clusters. Interactive workloads benefit from autoscaling.
Data volume: Size clusters based on expected data volume and parallelism requirements.
Performance requirements: Balance between cost and query latency.
Autoscaling: Enable autoscaling for workloads with variable demand.
Instance types: Choose instance types based on CPU, memory, and I/O requirements.

Cluster sizing patterns

Small clusters (2-8 nodes): Development, testing, and small datasets.
Medium clusters (8-32 nodes): Production ETL and analytics workloads.
Large clusters (32+ nodes): Large-scale batch processing and machine learning training.

Best practices for cluster sizing

Start with a baseline configuration and iterate based on performance metrics.
Use autoscaling to handle variable workloads efficiently.
Monitor cluster utilization metrics (for example, CPU, memory, I/O) to right-size clusters.
Use spot/preemptible instances for fault-tolerant workloads to reduce cost.
Document cluster sizing decisions and performance baselines.

For detailed cluster sizing guidance, see Classic compute configuration best practices.

Design SQL warehouse sizing strategy

While data science teams are usually small, Business Intelligence (BI) use cases often involve thousands of analysts. To support this many users, Databricks SQL uses an auto-scaling mechanism that adds or removes clusters based on three main factors:

Query throughput: How many queries are currently running?
Queue size: How many queries are waiting for a spot?
Predicted demand: The estimated workload for the next two minutes.

Essentially, Databricks adds clusters when it calculates that the current hardware cannot process the existing and upcoming queries quickly enough.

Choose the right SQL warehouse configuration

Selecting the right setup involves balancing compute size (power) and compute count (concurrency).

Picking the size (XS to XL)

The "T-shirt size" of your cluster determines how much compute power is available per query.

Small tiers (XS, S): Best for simple, fast queries. Most cost-effective for basic dashboards.
Large tiers (L, XL): Necessary for complex, heavy queries that process massive datasets to prevent performance bottlenecks.

Picking the count (concurrency)

The number of compute resources determines how many simultaneous users you can support.

Rule of thumb: Plan for roughly 10 concurrent queries per cluster.
High concurrency: If you have many users running small queries, use many small clusters.
Low concurrency: If you have a few users running massive queries, use a few large clusters.

Summary: Increase size to make single queries faster. Increase count to handle more users at the same time.

Best practices for SQL warehouse sizing

Start with serverless SQL warehouses (no sizing required).
For classic SQL warehouses, begin with medium size and adjust based on query patterns.
Monitor query performance and queueing metrics.
Use multiple warehouses for different use cases (ad-hoc vs reporting).
Document performance SLAs for different user groups.

Design cluster policy strategy

With cluster policies, Databricks admins can control many aspects of the clusters that are spun up. Cluster policies are recommended for all organizations. Common patterns include:

Cluster policy use cases

Limit users to prescribed settings: Including available instance types, Databricks versions, and instance sizes.
Simplify user interface: By fixing and hiding some values.
Control costs: By limiting the maximum cost per cluster.
Enforce compliance: By requiring external metastores or specific cluster tags to comply with corporate policies.

Cluster policy patterns

Development policy: Small, cost-effective clusters for development and testing.
Production policy: Larger, more powerful clusters with specific instance types and tags.
ML policy: GPU-enabled clusters with ML runtimes.
Spot/preemptible policy: Use spot instances for fault-tolerant workloads.

Best practices for cluster policies

Create separate policies for different teams or use cases.
Use cluster policies to enforce cost controls and resource limits.
Require tags on all clusters for cost attribution.
Simplify the UI by hiding unnecessary configuration options.
Document cluster policy purposes and restrictions.

For detailed cluster policy configuration, see Create and manage compute policies.

Design usage policy strategy

Usage policies consist of tags that are applied to any serverless compute activity incurred by a user assigned to the policy. The tags are logged in your billing records, allowing you to attribute select serverless usage to specific budgets.

Usage policy use cases

Attribute serverless compute costs to specific departments or projects.
Track costs for different environments (for example, dev, staging, production).
Monitor spending against budget limits.
Generate cost reports by business unit or cost center.

Best practices for usage policies

Create usage policies for each department or project.
Use consistent tagging schemes across all compute resources.
Monitor budget usage through system tables and dashboards.
Set up alerts for budget thresholds.
Review and adjust budget allocations quarterly.

After you apply a policy to a notebook, job, or pipeline, Databricks propagates any tags in the policy to your system.billing.usage system table in the custom_tags column.

Monitor usage and costs

Import pre-built usage dashboards into your workspaces to monitor account- and workspace-level usage.

Usage monitoring best practices

Import and customize pre-built usage dashboards.
Monitor usage trends by workspace, user, and compute type.
Set up alerts for unusual spending patterns.
Review usage reports monthly with finance teams.
Use system tables for custom usage analysis.

For usage dashboard templates, see System tables reference.

Design access control strategy

For a default Databricks installation, all users can create and modify workspace objects unless an administrator enables workspace access control. Organizations that want to implement access control segregation within a workspace can enable workspace access control.

Access control patterns

Permissive (default): All users can create clusters, jobs, and notebooks.
Restricted: Users require explicit permissions to create workspace objects.
Segregated: Different teams have access to different workspace resources.

Best practices for workspace access control

Enable workspace access control for production environments.
Use groups to manage permissions rather than individual users.
Implement least-privilege access (grant only necessary permissions).
Review and audit workspace permissions regularly.
Document access control policies in your runbook.

For detailed access control configuration, see Access control lists.

Review workspace settings

The Workspace Settings Page within the Admin Console contains a large number of important settings, many of which are not covered by APIs (and thus cannot be automated). Review all these settings before a workspace is made production-ready.

Download the Security Best Practices guide from https://www.databricks.com/trust/security-features/best-practices and implement the suggestions accordingly.

Critical workspace settings to review

Access/Visibility Control: Enabled by default. Controls workspace, cluster, pool, and jobs visibility.
Table Access Control: Disabled by default. Consider leaving this disabled and using Unity Catalog when fine-grained table ACLs are required.
Enforce User Isolation: Disabled by default. Enable to avoid using "No Isolation" clusters.
Container Services: Disabled by default. Enable to allow custom Docker containers.
Repos Git Allow Lists: Disabled by default. Consider enabling to limit repositories users can access.
Exfiltration Protections: Consider disabling features to avoid data exfiltration:
- Download Button for Notebook Results
- Upload Data using UI
- Notebook Exporting
- Notebook Table Clipboard Feature
- MLflow Run Artifact Download
Interactive Notebook Results Storage: Enable to store results in customer account.

Best practices for workspace settings

Review all workspace settings before production deployment.
Enable features based on security and compliance requirements.
Disable features that could enable data exfiltration for sensitive workspaces.
Document workspace settings and rationale in your runbook.
Use IaC (Terraform) to manage workspace settings where possible.

Workspace compute recommendations

Recommended

Use serverless compute as the primary option for SQL, notebooks, jobs, and Lakeflow pipelines.
Create initial configuration for clusters and SQL warehouses, then refine based on realistic loads.
Keep cost/performance tradeoff in mind when designing for capacity.
Use cluster policies to restrict permissions and enforce reasonable cluster sizes.
Use usage policies to attribute serverless costs to departments or projects.
Monitor usage through system tables and pre-built dashboards.
Enable workspace access control for production environments.
Carefully review workspace settings before production deployment.

Avoid these patterns

Do not manually create clusters without cluster policies in production.
Do not allow unlimited cluster sizes or instance types without controls.
Do not skip testing with realistic workloads before finalizing cluster sizes.
Avoid enabling features that could enable data exfiltration without security review.
Do not deploy to production without reviewing all workspace settings.

Phase 8 outcomes

After completing Phase 8, you should have:

Compute strategy defined (serverless vs classic compute for different workloads).
Cluster sizing strategy designed for classic compute workloads.
SQL warehouse sizing strategy designed for BI and analytics workloads.
Cluster policy strategy designed with policies for different teams or use cases.
Usage policy strategy designed for serverless cost attribution.
Usage monitoring approach defined with dashboards and alerts.
Access control strategy designed for workspace objects.
Workspace settings reviewed and documented for production deployment.
Cost optimization strategy defined (for example, spot instances, autoscaling, right-sizing).

Next phase: Phase 9: Design observability strategy

Implementation guidance: For step-by-step instructions to implement your compute configuration, see Compute.

Design cluster sizing strategy​

Design SQL warehouse sizing strategy​

Choose the right SQL warehouse configuration​

Design cluster policy strategy​

Design usage policy strategy​

Monitor usage and costs​

Design access control strategy​

Review workspace settings​

Workspace compute recommendations​

Phase 8 outcomes​