Feature engineering with MLlib
Apache Spark MLlib contains many utility functions for performing feature engineering at scale, including methods for encoding and transforming features. These methods can also be used to process features for other machine learning libraries.
Databricks recommends the following Apache Spark MLLib guides:
- Extracting, transforming and selecting features with MLlib
- MLlib Programming Guide
- Python API Reference
- Scala API Reference
This PySpark-based notebook includes preprocessing steps that convert categorical data to numeric variables using category indexing and one-hot encoding.