Natural language processing

You can perform natural language processing tasks on Databricks using popular open source libraries such as Spark ML and spark-nlp or proprietary libraries through the Databricks partnership with John Snow Labs.

Feature creation from text using Spark ML

Spark ML contains a range of text processing tools to create features from text columns. You can create input features from text for model training algorithms directly in your Spark ML pipelines using Spark ML. Spark ML supports a range of text processors, including tokenization, stop-word processing, word2vec, and feature hashing.

Training and inference using Spark NLP

You can scale out many deep learning methods for natural language processing on Spark using the open-source Spark NLP library. This library supports standard natural language processing operations such as tokenizing, named entity recognition, and vectorization using the included annotators. You can also summarize, perform named entity recognition, translate, and generate text using many pre-trained deep learning models based on Spark NLP’s transformers such as BERT and T5 Marion.

Perform inference in batch using Spark NLP on CPUs

Spark NLP provides many pre-trained models you can use with minimal code. This section contains an example of using the Marian Transformer for machine translation. For the full set of examples, see the Spark NLP documentation.

Requirements

To use Spark NLP, create or use a cluster running any compatible runtime. Install Spark NLP on the cluster using the latest Maven coordinates for Spark NLP, such as com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0.

Example code for Machine Translation

In a notebook cell, install sparknlp python libraries:

%pip install sparknlp

Construct a pipeline for translation and run it on some sample text:

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import SentenceDetectorDLModel, MarianTransformer
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
  .setInputCols("document").setOutputCol("sentence")

marian_transformer = MarianTransformer.pretrained() \
  .setInputCols("sentence").setOutputCol("translation")

pipeline = Pipeline().setStages([document_assembler,  sentence_detector, marian_transformer])

data = spark.createDataFrame([["You can use Spark NLP to translate text. " + \
                               "This example pipeline translates English to French"]]).toDF("text")

# Create a pipeline model that can be reused across multiple data frames
model = pipeline.fit(data)

# You can use the model on any data frame that has a “text” column
result = model.transform(data)

display(result.select("text", "translation.result"))

Train and use a named-entity recognition model using Spark NLP and MLflow

The example notebook illustrates how to train a named entity recognition model using Spark NLP, save the model to MLflow, and use the model for inference on text. Refer to the John Snow Labs documentation for Spark NLP to learn how to train additional natural language processing models.

Spark NLP model training and inference notebook

Open notebook in new tab

Healthcare NLP with John Snow Labs partnership

John Snow Labs Spark NLP for Healthcare is a proprietary library for clinical and biomedical text mining. This library provides pre-trained models for recognizing and working with clinical entities, drugs, risk factors, anatomy, demographics, and sensitive data. You can try Spark NLP for Healthcare using the Partner Connect integration with John Snow Labs. You will need a trial or paid account with John Snow Labs. Read more about the full capabilities of John Snow Labs Spark NLP for Healthcare and documentation for use at their website.