Skip to main content

Concepts: Data science and machine learning on Databricks

Data science and machine learning (DS and ML) extract insight and build predictive models from data. DS and ML include both interactive exploration and modeling and automated production systems. Classic ML includes techniques like classification, regression, anomaly detection, forecasting, and recommendation.

Modern deep learning and generative AI (GenAI) methods are technically types of ML. This section covers deep learning. For GenAI, see Concepts: Generative AI on Databricks.

The ML lifecycle

The ML lifecycle covers the end-to-end journey from raw data to a production model and back again through monitoring and retraining. Key stages include:

  1. Scope the use case by defining the prediction target, success metrics, and production requirements.
  2. Run exploratory data analysis (EDA) to understand data distributions, predictive signals, and data quality issues before modeling.
  3. Prepare data and features, managed within a feature store.
  4. Train models and track experiments, logging experiment metadata for analysis and for deployment.
  5. Evaluate model quality against held-out data and stakeholder criteria.
  6. Register, stage and test models before promoting to production.
  7. Deploy to production in real-time endpoints or batch inference jobs.
  8. Monitor and retrain to adapt models to changing data or user behavior.

See Machine learning lifecycle for a guide to each stage.

AI-assisted development and operations

Databricks has Genie Code, an AI assistant integrated across notebooks and the workspace. Use it for development, debugging, and ongoing operations, drawing on its specialized knowledge of your enterprise context. See Use Genie Code for data science.

You can use Genie Code at every step of your workflow:

You can also use third-party coding tools to develop and maintain ML pipelines on Databricks. See Agent skills for AI coding assistants.

What is an ML platform?

An ML platform is the combined infrastructure, tooling, and governance layer that supports the full ML lifecycle, from raw data to production models. A well-designed ML platform connects data engineering, interactive data science and production ML in a single governed system.

Key components include:

  • Data assets such as files, tables, processing pipelines, and feature stores
  • Experimentation tools such as notebooks and visualizations, with simple collaboration and AI assistance
  • Training infrastructure with customizable environments and flexible compute resources
  • Deployment and monitoring infrastructure for batch and real-time serving, with production dashboards and alerts
  • MLOps and governance tools for orchestration, CI/CD, lineage, access management and audit logging

Key governance capabilities include:

Also see Databricks data science and ML capabilities and Databricks architecture.

ML vs. deep learning vs. GenAI

The boundaries between machine learning (ML), deep learning (DL), and generative AI (GenAI) can be fuzzy. This guide focuses on ML and deep learning, but the following platform features support all three paradigms:

  • Model Serving supports classic ML, deep learning, and custom GenAI models for both real-time and batch inference.
  • ai_query supports SQL queries and batch inference workloads for all three paradigms.

Learn more