Databricks data science and ML capabilities

Databricks has a unified platform for the full data science (DS) and machine learning (ML) lifecycle, from raw data ingestion through feature engineering, model training, deployment, and production monitoring. Databricks integrates with popular open-source ML frameworks, adding enterprise-grade governance, observability, and operational tooling, collectively known as MLOps.

This page lists major DS and ML capabilities, organized by workflow stage.

Exploratory data analysis

Databricks simplifies exploratory data analysis (EDA) by providing interactive, collaborative, and AI-assisted tools for data scientists. Data scientists can explore data using natural language chat, UIs, or code, and they can collaborate using both real-time co-editing and Git-based code sharing. Genie Code can do fully automated EDA or act as an interactive assistant.

Category	Features
User interface	Notebooks provide collaborative spaces for exploration, visualization, and documentation for EDA. Dashboards provide SQL and visualization-based EDA. Genie Chat has a natural-language interface for asking data questions.
Collaboration	Notebooks, dashboards, and other workspace assets are all shareable and governed by workspace permissions. See for example Collaborate using Databricks notebooks. Notebooks and Git folders allow Git-based versioning and collaboration.
AI assistants	Genie Code can perform fully automated EDA or act as an interactive assistant. Agent skills for AI coding assistants boost the performance of third-party assistants writing code for Databricks.

Prepare and serve features

Databricks simplifies data for ML by unifying governance of data and ML workloads. With all data managed under Unity Catalog with fine-grained access controls, you can adjust data engineering and ML boundaries to fit your organization. Data can be prepared for ML using any data engineering tools such as Lakeflow pipelines. Features are managed in a Feature Store for both batch and real-time serving, with a single, governed source of truth for features.

Genie Code accelerates data discovery and preparation by browsing Unity Catalog to discover relevant tables, suggesting feature transformations, and generating code for ingestion and feature pipelines.

Feature type	Features
Batch features	Feature tables in Unity Catalog store precomputed batch features with automatic lineage and governance. Teams discover and reuse existing features rather than rebuilding pipelines from scratch. Feature Views provide a new API for defining features which can then be used for batch or real-time feature computation.
Real-time features	Feature Views provide a new API for defining features which can then be used for batch or real-time feature computation.
Unstructured data	AI Search allows serving unstructured data and running semantic search.

Train ML models

Databricks has flexible tools for training ML and deep learning models. Pre-configured and customizable environments allow you to use custom ML libraries, and serverless CPU and GPU-accelerated compute resources allow scaling up and scaling out on demand. Genie Code provides intelligent AutoML, taking natural language requests and building full multi-notebook workflows for featurization, training, tuning, evaluation, and deployment.

Category	Features
Types of ML	Databricks supports all types of ML, including: Classic ML: Supervised and unsupervised learning with scikit-learn, XGBoost, LightGBM, Apache Spark MLlib, and other ML frameworks Deep learning: Neural network training with PyTorch, TensorFlow, and Hugging Face Transformers, including distributed training across multiple GPUs Hyperparameter tuning: Automated search across algorithm and hyperparameter spaces using tools like Optuna and Ray For generative AI, see Databricks generative AI capabilities.
Compute	Serverless compute starts instantly for interactive notebooks and scheduled workflows, with automatic scaling and no cluster management. It supports both CPU and GPU-accelerated clusters. Classic compute has single-machine and cluster management, for both CPU and GPU workloads.
Environments and libraries	Serverless compute environments provide base environments that can be fully customized for ML. For classic compute, the Databricks Runtime for Machine Learning provides pre-configured cluster environments with major ML libraries pre-installed and tested together, for both CPU and GPU-accelerated clusters.
AI coding assistants	Genie Code can discover Unity Catalog data, generate ML notebooks, and troubleshoot pipelines. Agent skills for AI coding assistants boost the performance of third-party assistants writing code for Databricks.

Track and manage experiments

Databricks-managed MLflow provides the foundation for reproducible, auditable ML development. Its integrations with Unity Catalog and Git provide tracking and lineage for data and code assets. Each model version in the registry links back to the training run, dataset, environment, and git commit that produced it, providing a complete audit trail for any deployed model.

Category	Features
Experiment tracking	MLflow tracking logs parameters, metrics, and artifacts for every training run. Compare runs in the MLflow UI to identify the best-performing configuration.
Model registry	Models in Unity Catalog provides an MLflow model registry integrated with Unity Catalog. Versioned model artifacts are governed with lifecycle aliases (`Staging`, `Production`), access control, lineage, and cross-workspace sharing.
Reproducibility	Notebooks and code can be versioned using Databricks Git folders and integrated with any Git provider.

Deploy and serve models

Databricks supports both batch inference and real-time serving. Batch inference applies models efficiently to large datasets, whereas real-time serving provides models as low-latency API endpoints. Genie Code can both generate code for model deployment and diagnose issues and performance for model serving endpoints.

Serving pattern	Features
Batch inference	`ai_query` provides efficient batch inference for custom models deployed as Model Serving endpoints. You can also use custom code with Apache Spark UDFs (example) or `mlflow.pyfunc` for batch inference.
Real-time serving	Model Serving provides low-latency, high-uptime managed REST endpoints with serverless autoscaling. This supports CPU and GPU serving for any ML framework, and you can use Genie to assess and troubleshoot serving endpoints.
SQL-native inference	AI functions provide SQL-accessible ML predictions for forecasting, anomaly detection, and driver analysis, with no Python or model deployment required. For custom models, the AI function `ai_query` provides efficient batch inference backed by Model Serving endpoints.

Evaluate and monitor

Databricks provides flexible evaluation for training and continuous monitoring for production. Real-time serving logs to inference tables governed in Unity Catalog, and data quality monitoring provides monitoring with custom metrics, dashboards, and alerts.

Category	Features
Evaluation	MLflow ML evaluation can be used to define metrics to log to MLflow, or MLflow tracking can log metrics computed using your custom framework. Genie Code can assist in selecting evaluation metrics and writing evaluation code.
Prediction logging	Inference Tables log serving requests and responses, enabling monitoring, analytics, and training set construction.
Monitoring and alerts	Data quality monitoring tracks data quality, drift, and custom metrics, with built-in anomaly detection and data profiling. Data quality monitoring provides a monitoring UI, and you can build custom dashboards from monitoring tables. You can set alerts for anomaly detection to escalate incidents quickly.

MLOps and governance

Databricks provides a full suite of tools for ML operations (MLOps) and governance. MLOps Stacks provides templates for enabling automated, repeatable promotion from development to production using infrastructure-as-code. Data, features, models, and endpoints are fully governed by Unity Catalog and AI Gateway.

Category	Features
CI/CD for ML	MLOps Stacks, built on Declarative Automation Bundles, provides code-based management and deployment of ML infrastructure and workflows. This includes CI/CD templates for automating training, evaluation, and deployment.
Workflow orchestration	Lakeflow Jobs orchestrates multi-step ML workflows as scheduled or triggered pipelines.
Data and model asset governance	Unity Catalog provides unified governance for data, features, and registered models. Fine-grained access controls, lineage tracking, and audit logs apply to all assets.
Model endpoint governance	AI Gateway provides centralized governance and monitoring for model endpoints, including rate limits, usage tracking, and payload logging.

Open source support

Databricks provides full support for the open-source ML ecosystem.

You can use any open-source ML framework on Databricks: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, Hugging Face Transformers, Ray, and more. MLflow or your custom tools can store model artifacts in open formats that can be exported and run outside Databricks.

MLflow is open-source, created by Databricks and used by 10,000+ organizations. Your experiment tracking data, model artifacts, and pipeline definitions are stored in open formats.

Data and AI governance are built upon the open-source Unity Catalog APIs, and data storage is based upon the open Delta Lake format. Your feature data and training datasets remain in open, portable files.

Exploratory data analysis​

Prepare and serve features​

Train ML models​

Track and manage experiments​

Deploy and serve models​

Evaluate and monitor​

MLOps and governance​

Open source support​

Additional resources​