Featurization for transfer learning
Databricks supports featurization with deep learning models. Pre-trained deep learning models may be used to compute features for use in other downstream models. Databricks supports featurization at scale, distributing the computation across a cluster. You can perform featurization with deep learning libraries included in Databricks Runtime ML, including TensorFlow and PyTorch.
Databricks also supports transfer learning, a technique closely related to featurization. Transfer learning allows you to reuse knowledge from one problem domain in a related domain. Featurization is itself a simple and powerful method for transfer learning: computing features using a pre-trained deep learning model transfers knowledge about good features from the original domain.
This article demonstrates how to compute features for transfer learning using a pre-trained TensorFlow model, using the following workflow:
Start with a pre-trained deep learning model, in this case an image classification model from
Truncate the last layer(s) of the model. The modified model produces a tensor of features as output, rather than a prediction.
Apply that model to a new image dataset from a different problem domain, computing features for the images.
Use these features to train a new model. The following notebook omits this final step. For examples of training a simple model such as logistic regression, see Introduction to Databricks Machine Learning.
The following notebook uses pandas UDFs to perform the featurization step. pandas UDFs, and their newer variant Scalar Iterator pandas UDFs, offer flexible APIs, support any deep learning library, and give high performance.