Skip to main content

Databricks data science and ML capabilities

Databricks has a unified platform for the full data science (DS) and machine learning (ML) lifecycle, from raw data ingestion through feature engineering, model training, deployment, and production monitoring. Databricks integrates with popular open-source ML frameworks, adding enterprise-grade governance, observability, and operational tooling, collectively known as MLOps.

This page lists major DS and ML capabilities, organized by workflow stage.

Exploratory data analysis

Databricks simplifies exploratory data analysis (EDA) by providing interactive, collaborative, and AI-assisted tools for data scientists. Data scientists can explore data using natural language chat, UIs, or code, and they can collaborate using both real-time co-editing and Git-based code sharing. Genie Code can do fully automated EDA or act as an interactive assistant.

Category

Features

User interface

  • Notebooks provide collaborative spaces for exploration, visualization, and documentation for EDA.
  • Dashboards provide SQL and visualization-based EDA.
  • Genie Chat has a natural-language interface for asking data questions.

Collaboration

AI assistants

Prepare and serve features

Databricks simplifies data for ML by unifying governance of data and ML workloads. With all data managed under Unity Catalog with fine-grained access controls, you can adjust data engineering and ML boundaries to fit your organization. Data can be prepared for ML using any data engineering tools such as Lakeflow Spark Declarative Pipelines. Features are managed in a Feature Store for both batch and real-time serving, with a single, governed source of truth for features.

Genie Code accelerates data discovery and preparation by browsing Unity Catalog to discover relevant tables, suggesting feature transformations, and generating code for ingestion and feature pipelines.

Feature type

Features

Batch features

  • Feature tables in Unity Catalog store precomputed batch features with automatic lineage and governance. Teams discover and reuse existing features rather than rebuilding pipelines from scratch.
  • Declarative features provide a new API for defining features which can then be used for batch or real-time feature computation.

Real-time features

Declarative features provide a new API for defining features which can then be used for batch or real-time feature computation.

Unstructured data

AI Search allows serving unstructured data and running semantic search.

Train ML models

Databricks has flexible tools for training ML and deep learning models. Pre-configured and customizable environments allow you to use custom ML libraries, and serverless CPU and GPU-accelerated compute resources allow scaling up and scaling out on demand. Genie Code provides intelligent AutoML, taking natural language requests and building full multi-notebook workflows for featurization, training, tuning, evaluation, and deployment.

Category

Features

Types of ML

Databricks supports all types of ML, including:

  • Classic ML: Supervised and unsupervised learning with scikit-learn, XGBoost, LightGBM, Apache Spark MLlib, and other ML frameworks
  • Deep learning: Neural network training with PyTorch, TensorFlow, and Hugging Face Transformers, including distributed training across multiple GPUs
  • Hyperparameter tuning: Automated search across algorithm and hyperparameter spaces using tools like Optuna and Ray

For generative AI, see Databricks generative AI capabilities.

Compute

  • Serverless compute starts instantly for interactive notebooks and scheduled workflows, with automatic scaling and no cluster management. It supports both CPU and GPU-accelerated clusters.
  • Classic compute has single-machine and cluster management, for both CPU and GPU workloads.

Environments and libraries

AI coding assistants

Track and manage experiments

Databricks-managed MLflow provides the foundation for reproducible, auditable ML development. Its integrations with Unity Catalog and Git provide tracking and lineage for data and code assets. Each model version in the registry links back to the training run, dataset, environment, and git commit that produced it, providing a complete audit trail for any deployed model.

Category

Features

Experiment tracking

MLflow tracking logs parameters, metrics, and artifacts for every training run. Compare runs in the MLflow UI to identify the best-performing configuration.

Model registry

Models in Unity Catalog provides an MLflow model registry integrated with Unity Catalog. Versioned model artifacts are governed with lifecycle aliases (Staging, Production), access control, lineage, and cross-workspace sharing.

Reproducibility

Notebooks and code can be versioned using Databricks Git folders and integrated with any Git provider.

Deploy and serve models

Databricks supports both batch inference and real-time serving. Batch inference applies models efficiently to large datasets, whereas real-time serving provides models as low-latency API endpoints. Genie Code can both generate code for model deployment and diagnose issues and performance for model serving endpoints.

Serving pattern

Features

Batch inference

Real-time serving

Model Serving provides low-latency, high-uptime managed REST endpoints with serverless autoscaling. This supports CPU and GPU serving for any ML framework, and you can use Genie to assess and troubleshoot serving endpoints.

SQL-native inference

  • AI functions provide SQL-accessible ML predictions for forecasting, anomaly detection, and driver analysis, with no Python or model deployment required.
  • For custom models, the AI function ai_query provides efficient batch inference backed by Model Serving endpoints.

Evaluate and monitor

Databricks provides flexible evaluation for training and continuous monitoring for production. Real-time serving logs to inference tables governed in Unity Catalog, and data quality monitoring provides monitoring with custom metrics, dashboards, and alerts.

Category

Features

Evaluation

  • MLflow ML evaluation can be used to define metrics to log to MLflow, or MLflow tracking can log metrics computed using your custom framework.
  • Genie Code can assist in selecting evaluation metrics and writing evaluation code.

Prediction logging

Inference Tables log serving requests and responses, enabling monitoring, analytics, and training set construction.

Monitoring and alerts

MLOps and governance

Databricks provides a full suite of tools for ML operations (MLOps) and governance. MLOps Stacks provides templates for enabling automated, repeatable promotion from development to production using infrastructure-as-code. Data, features, models, and endpoints are fully governed by Unity Catalog and AI Gateway.

Category

Features

CI/CD for ML

MLOps Stacks, built on Declarative Automation Bundles, provides code-based management and deployment of ML infrastructure and workflows. This includes CI/CD templates for automating training, evaluation, and deployment.

Workflow orchestration

Lakeflow Jobs orchestrates multi-step ML workflows as scheduled or triggered pipelines.

Data and model asset governance

Unity Catalog provides unified governance for data, features, and registered models. Fine-grained access controls, lineage tracking, and audit logs apply to all assets.

Model endpoint governance

AI Gateway provides centralized governance and monitoring for model endpoints, including rate limits, usage tracking, and payload logging.

Open source support

Databricks provides full support for the open-source ML ecosystem.

You can use any open-source ML framework on Databricks: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, Hugging Face Transformers, Ray, and more. MLflow or your custom tools can store model artifacts in open formats that can be exported and run outside Databricks.

MLflow is open-source, created by Databricks and used by 10,000+ organizations. Your experiment tracking data, model artifacts, and pipeline definitions are stored in open formats.

Data and AI governance are built upon the open-source Unity Catalog APIs, and data storage is based upon the open Delta Lake format. Your feature data and training datasets remain in open, portable files.

Learn more