Feature engineering with MLlib

Apache Spark MLlib contains many utility functions for performing feature engineering at scale, including methods for encoding and transforming features. These methods can also be used to process features for other machine learning libraries.

Databricks recommends the following Apache Spark MLLib guides:

Extracting, transforming and selecting features with MLlib
MLlib Programming Guide
Python API Reference
Scala API Reference

This PySpark-based notebook includes preprocessing steps that convert categorical data to numeric variables using category indexing and one-hot encoding.

Binary classification example

Open notebook in new tab