You can perform natural language processing tasks on Databricks using popular open source libraries such as Spark ML and spark-nlp or proprietary libraries through the Databricks partnership with John Snow Labs.
Spark ML contains a range of text processing tools to create features from text columns. You can create input features from text for model training algorithms directly in your Spark ML pipelines using Spark ML. Spark ML supports a range of text processors, including tokenization, stop-word processing, word2vec, and feature hashing.
You can scale out many deep learning methods for natural language processing on Spark using the open-source Spark NLP library. This library supports standard natural language processing operations such as tokenizing, named entity recognition, and vectorization using the included annotators. You can also summarize, perform named entity recognition, translate, and generate text using many pre-trained deep learning models based on Spark NLP’s transformers such as BERT and T5 Marion.
Spark NLP provides many pre-trained models you can use with minimal code. This section contains an example of using the Marian Transformer for machine translation. For the full set of examples, see the Spark NLP documentation.
To use Spark NLP, create or use a cluster running
any compatible runtime.
Install Spark NLP on the cluster using the latest Maven coordinates for Spark NLP, such
In a notebook cell, install
sparknlp python libraries:
%pip install sparknlp
Construct a pipeline for translation and run it on some sample text:
from sparknlp.base import DocumentAssembler from sparknlp.annotator import SentenceDetectorDLModel, MarianTransformer from pyspark.ml import Pipeline document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \ .setInputCols("document").setOutputCol("sentence") marian_transformer = MarianTransformer.pretrained() \ .setInputCols("sentence").setOutputCol("translation") pipeline = Pipeline().setStages([document_assembler, sentence_detector, marian_transformer]) data = spark.createDataFrame([["You can use Spark NLP to translate text. " + \ "This example pipeline translates English to French"]]).toDF("text") # Create a pipeline model that can be reused across multiple data frames model = pipeline.fit(data) # You can use the model on any data frame that has a “text” column result = model.transform(data) display(result.select("text", "translation.result"))
The example notebook illustrates how to train a named entity recognition model using Spark NLP, save the model to MLflow, and use the model for inference on text. Refer to the John Snow Labs documentation for Spark NLP to learn how to train additional natural language processing models.
John Snow Labs Spark NLP for Healthcare is a proprietary library for clinical and biomedical text mining. This library provides pre-trained models for recognizing and working with clinical entities, drugs, risk factors, anatomy, demographics, and sensitive data. You can try Spark NLP for Healthcare using the Partner Connect integration with John Snow Labs. You will need a trial or paid account with John Snow Labs. Read more about the full capabilities of John Snow Labs Spark NLP for Healthcare and documentation for use at their website.