Databricks data science and ML capabilities
Databricks has a unified platform for the full data science (DS) and machine learning (ML) lifecycle, from raw data ingestion through feature engineering, model training, deployment, and production monitoring. Databricks integrates with popular open-source ML frameworks, adding enterprise-grade governance, observability, and operational tooling, collectively known as MLOps.
This page lists major DS and ML capabilities, organized by workflow stage.
Exploratory data analysis
Databricks simplifies exploratory data analysis (EDA) by providing interactive, collaborative, and AI-assisted tools for data scientists. Data scientists can explore data using natural language chat, UIs, or code, and they can collaborate using both real-time co-editing and Git-based code sharing. Genie Code can do fully automated EDA or act as an interactive assistant.
Category | Features |
|---|---|
User interface |
|
Collaboration |
|
AI assistants |
|
Prepare and serve features
Databricks simplifies data for ML by unifying governance of data and ML workloads. With all data managed under Unity Catalog with fine-grained access controls, you can adjust data engineering and ML boundaries to fit your organization. Data can be prepared for ML using any data engineering tools such as Lakeflow Spark Declarative Pipelines. Features are managed in a Feature Store for both batch and real-time serving, with a single, governed source of truth for features.
Genie Code accelerates data discovery and preparation by browsing Unity Catalog to discover relevant tables, suggesting feature transformations, and generating code for ingestion and feature pipelines.
Feature type | Features |
|---|---|
Batch features |
|
Real-time features | Declarative features provide a new API for defining features which can then be used for batch or real-time feature computation. |
Unstructured data | AI Search allows serving unstructured data and running semantic search. |
Train ML models
Databricks has flexible tools for training ML and deep learning models. Pre-configured and customizable environments allow you to use custom ML libraries, and serverless CPU and GPU-accelerated compute resources allow scaling up and scaling out on demand. Genie Code provides intelligent AutoML, taking natural language requests and building full multi-notebook workflows for featurization, training, tuning, evaluation, and deployment.
Category | Features |
|---|---|
Types of ML | Databricks supports all types of ML, including:
For generative AI, see Databricks generative AI capabilities. |
Compute |
|
Environments and libraries |
|
AI coding assistants |
|
Track and manage experiments
Databricks-managed MLflow provides the foundation for reproducible, auditable ML development. Its integrations with Unity Catalog and Git provide tracking and lineage for data and code assets. Each model version in the registry links back to the training run, dataset, environment, and git commit that produced it, providing a complete audit trail for any deployed model.
Category | Features |
|---|---|
Experiment tracking | MLflow tracking logs parameters, metrics, and artifacts for every training run. Compare runs in the MLflow UI to identify the best-performing configuration. |
Model registry | Models in Unity Catalog provides an MLflow model registry integrated with Unity Catalog. Versioned model artifacts are governed with lifecycle aliases ( |
Reproducibility | Notebooks and code can be versioned using Databricks Git folders and integrated with any Git provider. |
Deploy and serve models
Databricks supports both batch inference and real-time serving. Batch inference applies models efficiently to large datasets, whereas real-time serving provides models as low-latency API endpoints. Genie Code can both generate code for model deployment and diagnose issues and performance for model serving endpoints.
Serving pattern | Features |
|---|---|
Batch inference |
|
Real-time serving | Model Serving provides low-latency, high-uptime managed REST endpoints with serverless autoscaling. This supports CPU and GPU serving for any ML framework, and you can use Genie to assess and troubleshoot serving endpoints. |
SQL-native inference |
|
Evaluate and monitor
Databricks provides flexible evaluation for training and continuous monitoring for production. Real-time serving logs to inference tables governed in Unity Catalog, and data quality monitoring provides monitoring with custom metrics, dashboards, and alerts.
Category | Features |
|---|---|
Evaluation |
|
Prediction logging | Inference Tables log serving requests and responses, enabling monitoring, analytics, and training set construction. |
Monitoring and alerts |
|
MLOps and governance
Databricks provides a full suite of tools for ML operations (MLOps) and governance. MLOps Stacks provides templates for enabling automated, repeatable promotion from development to production using infrastructure-as-code. Data, features, models, and endpoints are fully governed by Unity Catalog and AI Gateway.
Category | Features |
|---|---|
CI/CD for ML | MLOps Stacks, built on Declarative Automation Bundles, provides code-based management and deployment of ML infrastructure and workflows. This includes CI/CD templates for automating training, evaluation, and deployment. |
Workflow orchestration | Lakeflow Jobs orchestrates multi-step ML workflows as scheduled or triggered pipelines. |
Data and model asset governance | Unity Catalog provides unified governance for data, features, and registered models. Fine-grained access controls, lineage tracking, and audit logs apply to all assets. |
Model endpoint governance | AI Gateway provides centralized governance and monitoring for model endpoints, including rate limits, usage tracking, and payload logging. |
Open source support
Databricks provides full support for the open-source ML ecosystem.
You can use any open-source ML framework on Databricks: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, Hugging Face Transformers, Ray, and more. MLflow or your custom tools can store model artifacts in open formats that can be exported and run outside Databricks.
MLflow is open-source, created by Databricks and used by 10,000+ organizations. Your experiment tracking data, model artifacts, and pipeline definitions are stored in open formats.
Data and AI governance are built upon the open-source Unity Catalog APIs, and data storage is based upon the open Delta Lake format. Your feature data and training datasets remain in open, portable files.