Log model dependencies

In this article, you learn how to log a model and its dependencies as model artifacts, so they are available in your environment for production tasks like model serving.

Log Python package model dependencies

MLflow has native support for some Python ML libraries, where MLflow can reliably log dependencies for models that use these libraries. See built-in model flavors.

For example, MLflow supports scikit-learn in the mlflow.sklearn module, and the command mlflow.sklearn.log_model logs the sklearn version. The same applies for autologging with those ML libraries. See the MLflow github repository for additional examples.

For ML libraries that can be installed with pip install PACKAGE_NAME==VERSION, but do not have built-in MLflow model flavors, you can log those packages using the mlflow.pyfunc.log_model method. Be sure to log the requirements with the exact library version, for example, f"nltk=={nltk.__version__}" instead of just nltk.

mlflow.pyfunc.log_model supports logging for:

  • Public and custom libraries packaged as eggs or wheels.

  • Public packages on PyPI and privately hosted packages on your own PyPI server.

With mlflow.pyfunc.log_model, MLflow tries to infer the dependencies automatically. MLflow infers the dependencies using mlflow.models.infer_pip_requirements, and logs them to a requirements.txt file as a model artifact.

In older versions, MLflow sometimes did not identify all Python requirements automatically, especially if the library was not a built-in model flavor. In these cases, you can specify additional dependencies with the extra_pip_requirements parameter in the log_model command.

Important

You can also overwrite the entire set of requirements with the conda_env and pip_requirements parameters, but doing so is generally discouraged because this overrides the dependencies which MLflow picks up automatically. See an example of how to use the `pip_requirements` parameter to overwrite requirements.

Customized model logging

For scenarios where more customized model logging is necessary, you can either:

  • Write a custom Python model. Doing so allows you to subclass mlflow.pyfunc.PythonModel to customize initialization and prediction. This approach works well for customization of Python-only models.

  • Write a custom flavor. In this scenario, you can customize logging more than the generic pyfunc flavor, but doing so requires more work to implement.

Custom Python code

You may have Python code dependencies that can’t be installed using the %pip install command, such as one or more .py files.

When logging a model, you can tell MLflow that the model can find those dependencies at a specified path by using the code_path parameter in mlflow.pyfunc.log_model. MLflow stores any files or directories passed using code_path as artifacts along with the model in a code directory. When loading the model, MLflow adds these files or directories to the Python path. This route also works with custom Python wheels, which can be included in the model using code_path, just like .py files.

mlflow.pyfunc.log_model( artifact_path=artifact_path,
                         code_path=[filename.py],
                         data_path=data_path,
                         conda_env=conda_env,
                       )

Log non-Python package model dependencies

MLflow does not automatically pick up non-Python dependencies, such as Java packages, R packages, and native packages (such as Linux packages). For these packages, you need to log additional data.

  • Dependency list: Databricks recommends logging an artifact with the model specifying these non-Python dependencies. This could be a simple .txt or .json file. mlflow.pyfunc.log_model allows you to specify this additional artifact using the artifacts argument.

  • Custom packages: Just as for custom Python dependencies above, you need to ensure that the packages are available in your deployment environment. For packages in a central location such as Maven Central or your own repository, make sure that the location is available at scoring or serving time. For private packages not hosted elsewhere, you can log packages along with the model as artifacts.

Deploy models with dependencies

When deploying a model from the MLflow Tracking Server or Model Registry, you need to ensure that the deployment environment has the right dependencies installed. The simplest path may depend on your deployment mode: batch/streaming or online serving, and on the types of dependencies.

For all deployment modes, Databricks recommends running inference on the same runtime version that you used during training, since the Databricks Runtime in which you created your model has various libraries already installed. MLflow in Databricks automatically saves that runtime version in the MLmodel metadata file in a databricks_runtime field, such as databricks_runtime: 10.2.x-cpu-ml-scala2.12.

Online serving: Databricks model serving

Databricks offers model serving with Serverless Real-Time Inference, where your MLflow machine learning models are exposed as scalable REST API endpoints.

For Python dependencies in the requirements.txt file, Databricks and MLflow handle everything for public PyPI dependencies. Similarly, if you specified .py files or wheels when logging the model by using the code_path argument, MLflow loads those dependencies for you automatically.

For these model serving scenarios, see the following:

Databricks offers model serving with Classic MLflow Model Serving, where your MLflow machine learning models are exposed as scalable REST API endpoints.

For Python dependencies in the requirements.txt file, Databricks and MLflow handle everything for public PyPI dependencies. Similarly, if you specified .py files or wheels when logging the model by using the code_path argument, MLflow loads those dependencies for you automatically.

Online serving: third-party systems or Docker containers

If your scenario requires serving to third-party serving solutions or your own Docker-based solution, you can export your model as a Docker container.

Databricks recommends the following for third-party serving that automatically handles Python dependencies. However, for non-Python dependencies, the container needs to be modified to include them.

Batch and streaming jobs

Batch and streaming scoring should be run as Databricks Jobs. A notebook job often suffices, and the simplest way to prepare code is to use the Databricks Model Registry to generate a scoring notebook.

The following describes the process and the steps to follow to ensure dependencies are installed and applied accordingly:

  1. Start your scoring cluster with the same Databricks Runtime version used during training. Read the databricks_runtime field from the MLmodel metadata file, and start a cluster with that runtime version.

    • This can be done manually in the cluster configuration or automated with custom logic. For automation, the runtime version format that you read from the metadata file in the Jobs API and Clusters API.

  2. Next, install any non-Python dependencies. To ensure your non-Python dependencies are accessible to your deployment environment, you can either:

    • Manually install the non-Python dependencies of your model on the Databricks cluster as part of the cluster configuration before running inference.

    • Alternatively, you can write custom logic in your scoring job deployment to automate the installation of the dependencies onto your cluster. Assuming you saved your non-Python dependencies as artifacts as described in Log non-Python package model dependencies, this automation can install libraries using the Libraries API. Or, you can write specific code to generate a cluster-scoped initialization script to install the dependencies.

  3. Your scoring job installs the Python dependencies in the job execution environment. In Databricks, the Model Registry allows you to generate a notebook for inference which does this for you.

    • When you use the Databricks Model Registry to generate a scoring notebook, the notebook contains code to install the Python dependencies in the model’s requirements.txt file. For your notebook job for batch or streaming scoring, this code initializes your notebook environment, so that the model dependencies are installed and ready for your model.

  4. MLflow handles any custom Python code included in the code_path parameter in log_model. This code is added to the Python path when the model’s predict() method is called. You can also do this manually by either:

    Note

    If you specified .py files or wheels when logging the model using the code_path argument, MLflow loads those dependencies for you automatically.