Spark NLP model training and inference

%pip install sparknlp mlflow

Python interpreter will be restarted. Collecting sparknlp Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB) Collecting mlflow Downloading mlflow-1.29.0-py3-none-any.whl (16.9 MB) Requirement already satisfied: numpy in /databricks/python3/lib/python3.9/site-packages (from sparknlp) (1.20.3) Collecting spark-nlp Downloading spark_nlp-4.2.0-py2.py3-none-any.whl (641 kB) Requirement already satisfied: gunicorn<21 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (20.1.0) Collecting docker<7,>=4.0.0 Downloading docker-6.0.0-py3-none-any.whl (147 kB) Requirement already satisfied: pyyaml<7,>=5.1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (6.0) Collecting prometheus-flask-exporter<1 Downloading prometheus_flask_exporter-0.20.3-py3-none-any.whl (18 kB) Requirement already satisfied: pandas<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.3.4) Requirement already satisfied: gitpython<4,>=2.1.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.1.27) Requirement already satisfied: cloudpickle<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.0.0) Requirement already satisfied: protobuf<5,>=3.12.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.19.4) Requirement already satisfied: pytz<2023 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2021.3) Requirement already satisfied: requests<3,>=2.17.3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.26.0) Collecting sqlalchemy<2,>=1.4.0 Downloading SQLAlchemy-1.4.41-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB) Collecting alembic<2 Downloading alembic-1.8.1-py3-none-any.whl (209 kB) Requirement already satisfied: packaging<22 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (21.0) Requirement already satisfied: scipy<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.7.1) Requirement already satisfied: Flask<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.1.2) Requirement already satisfied: importlib-metadata!=4.7.0,<5,>=3.7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (4.8.1) Requirement already satisfied: entrypoints<1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.3) Collecting querystring-parser<2 Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB) Requirement already satisfied: sqlparse<1,>=0.4.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.4.2) Requirement already satisfied: databricks-cli<1,>=0.8.7 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.17.0) Requirement already satisfied: click<9,>=7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (8.0.3) Requirement already satisfied: Mako in /databricks/python3/lib/python3.9/site-packages (from alembic<2->mlflow) (1.2.0) Requirement already satisfied: tabulate>=0.7.7 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (0.8.9) Requirement already satisfied: six>=1.10.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (1.16.0) Requirement already satisfied: oauthlib>=3.1.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (3.2.0) Requirement already satisfied: pyjwt>=1.7.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (2.4.0) Requirement already satisfied: urllib3>=1.26.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.26.7) Requirement already satisfied: websocket-client>=0.32.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.3.1) Requirement already satisfied: itsdangerous>=0.24 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.1) Requirement already satisfied: Jinja2>=2.10.1 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.11.3) Requirement already satisfied: Werkzeug>=0.15 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.2) Requirement already satisfied: gitdb<5,>=4.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitpython<4,>=2.1.0->mlflow) (4.0.9) Requirement already satisfied: smmap<6,>=3.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=2.1.0->mlflow) (5.0.0) Requirement already satisfied: setuptools>=3.0 in /usr/local/lib/python3.9/dist-packages (from gunicorn<21->mlflow) (58.0.4) Requirement already satisfied: zipp>=0.5 in /databricks/python3/lib/python3.9/site-packages (from importlib-metadata!=4.7.0,<5,>=3.7.0->mlflow) (3.6.0) Requirement already satisfied: MarkupSafe>=0.23 in /databricks/python3/lib/python3.9/site-packages (from Jinja2>=2.10.1->Flask<3->mlflow) (2.0.1) Requirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.9/site-packages (from packaging<22->mlflow) (3.0.4) Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.9/site-packages (from pandas<2->mlflow) (2.8.2) Requirement already satisfied: prometheus-client in /databricks/python3/lib/python3.9/site-packages (from prometheus-flask-exporter<1->mlflow) (0.11.0) Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (3.2) Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2.0.4) Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2021.10.8) Collecting greenlet!=0.4.17 Downloading greenlet-1.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (154 kB) Installing collected packages: greenlet, sqlalchemy, spark-nlp, querystring-parser, prometheus-flask-exporter, docker, alembic, sparknlp, mlflow Successfully installed alembic-1.8.1 docker-6.0.0 greenlet-1.1.3 mlflow-1.29.0 prometheus-flask-exporter-0.20.3 querystring-parser-1.2.4 spark-nlp-4.2.0 sparknlp-1.0.0 sqlalchemy-1.4.41 Python interpreter will be restarted.

!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa

from sparknlp.training import CoNLLtraining_data = CoNLL().readDataset(spark, 'file:/databricks/driver/eng.train')test_data = CoNLL().readDataset(spark, 'file:/databricks/driver/eng.testa')

import sparknlpfrom sparknlp.base import *from sparknlp.annotator import * import mlflowmlflow_run = mlflow.start_run() max_epochs=1lr=0.003batch_size=32random_seed=0verbose=1validation_split= 0.2evaluation_log_extended= Trueenable_output_logs= Trueinclude_confidence= Trueoutput_logs_path="dbfs:/ner_logs" dbutils.fs.mkdirs(output_logs_path) nerTagger = NerDLApproach()\  .setInputCols(["sentence", "token", "embeddings"])\  .setLabelColumn("label")\  .setOutputCol("ner")\  .setMaxEpochs(max_epochs)\  .setLr(lr)\  .setBatchSize(batch_size)\  .setRandomSeed(random_seed)\  .setVerbose(verbose)\  .setValidationSplit(validation_split)\  .setEvaluationLogExtended(evaluation_log_extended)\  .setEnableOutputLogs(enable_output_logs)\  .setIncludeConfidence(include_confidence)\  .setOutputLogsPath(output_logs_path) # Log model training parameters to MLflow.mlflow.log_params({  "max_epochs": max_epochs,  "lr": lr,  "batch_size": batch_size,  "random_seed": random_seed,  "verbose": verbose,  "validation_split": validation_split,  "evaluation_log_extended": evaluation_log_extended,  "enable_output_logs": enable_output_logs,  "include_confidence": include_confidence,  "output_logs_path": output_logs_path}) # The training and evaluation data is already tokenized, so you can directly # apply the embedding model and then fit a named-entity recognizer on the embeddings.glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\          .setInputCols(["document", "token"])\          .setOutputCol("embeddings") ner_pipeline = Pipeline(stages=[          glove_embeddings,          nerTagger ]) ner_model = ner_pipeline.fit(training_data)

glove_100d download started this may take some time. Approximate size to download 145.3 MB [OK!]

predictions = ner_model.transform(test_data)

import pyspark.sql.functions as Fdisplay(predictions.select(F.col('token.result').alias("tokens"),                           F.col('label.result').alias("ground_truth"),                           F.col('ner.result').alias("predictions")).limit(3))

Table

tokens

ground_truth

1

2

3

["CRICKET", "-", "LEICESTERSHIRE", "TAKE", "OVER", "AT", "TOP", "AFTER", "INNINGS", "VICTORY", "."]

["O", "O", "B-ORG", "O", "O", "O", "O", "O", "O", "O", "O"]

["LONDON", "1996-08-30"]

["B-LOC", "O"]

["West", "Indian", "all-rounder", "Phil", "Simmons", "took", "four", "for", "38", "on", "Friday", "as", "Leicestershire", "beat", "Somerset", "by", "an", "innings", "and", "39", "runs", "in", "two", "days", "to", "take", "over", "at", "the", "head", "of", "the", "county", "championship", "."]

["B-MISC", "I-MISC", "O", "B-PER", "I-PER", "O", "O", "O", "O", "O", "O", "O", "B-ORG", "O", "B-ORG", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]

Showing all 3 rows.

# Reformat data to one token per row for evaluation.predictions_pandas = predictions.select(F.explode(F.arrays_zip(predictions.token.result,                                                     predictions.label.result,                                                     predictions.ner.result)).alias("cols")) \                              .select(F.expr("cols['0']").alias("token"),                                      F.expr("cols['1']").alias("ground_truth"),                                      F.expr("cols['2']").alias("prediction")).toPandas()

display(predictions_pandas.head(20))

Table

token

ground_truth

prediction

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

CRICKET

O

-

O

LEICESTERSHIRE

B-ORG

TAKE

O

OVER

O

AT

O

TOP

O

AFTER

O

INNINGS

O

VICTORY

O

.

O

LONDON

B-LOC

1996-08-30

O

West

B-MISC

B-LOC

Indian

I-MISC

I-LOC

all-rounder

O

Phil

B-PER

Showing all 20 rows.

from sklearn.metrics import classification_report # Generate a classification report.report = classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction'], output_dict=True) # Directly log accuracy to MLflow.mlflow.log_metric("accuracy", report["accuracy"])# Log the full classification report by token type as an artifact to MLflow.mlflow.log_dict(report, "classification_report.yaml") # Print out the report to view it in the notebook.print (classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction'])) 

precision recall f1-score support B-LOC 0.87 0.95 0.91 1837 B-MISC 0.92 0.80 0.86 922 B-ORG 0.86 0.87 0.86 1341 B-PER 0.95 0.95 0.95 1842 I-LOC 0.78 0.79 0.78 257 I-MISC 0.81 0.62 0.70 346 I-ORG 0.86 0.72 0.78 751 I-PER 0.96 0.97 0.96 1307 O 0.99 1.00 0.99 42759 accuracy 0.98 51362 macro avg 0.89 0.85 0.87 51362 weighted avg 0.98 0.98 0.98 51362

document = DocumentAssembler()\    .setInputCol("text")\    .setOutputCol("document") sentence = SentenceDetector()\    .setInputCols(['document'])\    .setOutputCol('sentence') token = Tokenizer()\    .setInputCols(['sentence'])\    .setOutputCol('token') # Pull out the model from the pipeline.loaded_ner_model = ner_model.stages[1] converter = NerConverter()\      .setInputCols(["document", "token", "ner"])\      .setOutputCol("ner_span") ner_prediction_pipeline = Pipeline(    stages = [        document,        sentence,        token,        glove_embeddings,        loaded_ner_model,        converter])

# Fitting with an empty data frame allows you to construct a pipeline model # without retraining the model.empty_data = spark.createDataFrame([['']]).toDF("text")prediction_model = ner_prediction_pipeline.fit(empty_data)

# In Databricks Runtime 11.2 and 11.2 ML, model logging is handled using Databricks MLflow utilities. # The Databricks MLflow utilities for DBFS in Databricks Runtime 11.2 do not support all filesystem calls that # Spark NLP uses for model serialization. The following command disables the use of the MLflow utilities and uses# standard DBFS support. import osif os.environ["DATABRICKS_RUNTIME_VERSION"] == "11.2":  os.environ["DISABLE_MLFLOWDBFS"] = "True"

# Log the model in MLflow and build a reference to the model URI.model_name = "NerPipelineModel"mlflow.spark.log_model(prediction_model, model_name)mlflow.end_run()mlflow_model_uri = "runs:/{}/{}".format(mlflow_run.info.run_id, model_name)display(mlflow_model_uri)

2022/09/30 17:30:58 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpwp054nzv/model, flavor: spark), fall back to return ['pyspark==3.3.0']. Set logging level to DEBUG to see the full traceback. 'runs:/c5ebd6a6ac1f4e2b935bb3cb72bf6db2/NerPipelineModel'

# Create sample text.text = "From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum, whose tremulous branches seemed hardly able to bear the burden of a beauty so flamelike as theirs; and now and then the fantastic shadows of birds in flight flitted across the long tussore-silk curtains that were stretched in front of the huge window, producing a kind of momentary Japanese effect, and making him think of those pallid, jade-faced painters of Tokyo who, through the medium of an art that is necessarily immobile, seek to convey the sense of swiftness and motion. The sullen murmur of the bees shouldering their way through the long unmown grass, or circling with monotonous insistence round the dusty gilt horns of the straggling woodbine, seemed to make the stillness more oppressive. The dim roar of London was like the bourdon note of a distant organ."sample_data = spark.createDataFrame([[text]]).toDF("text") # Load and use the model.mlflow_model = mlflow.spark.load_model(mlflow_model_uri)predictions = mlflow_model.transform(sample_data)

2022/09/30 17:31:02 INFO mlflow.spark: 'runs:/c5ebd6a6ac1f4e2b935bb3cb72bf6db2/NerPipelineModel' resolved as 'dbfs:/databricks/mlflow-tracking/1285644875494958/c5ebd6a6ac1f4e2b935bb3cb72bf6db2/artifacts/NerPipelineModel'

display(predictions)

Table

text

document

1

From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum, whose tremulous branches seemed hardly able to bear the burden of a beauty so flamelike as theirs; and now and then the fantastic shadows of birds in flight flitted across the long tussore-silk curtains that were stretched in front of the huge window, producing a ki...

[{"annotatorType": "document", "begin": 0, "end": 1004, "result": "From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum, whose tremulous branches seemed hardly able to bear the burden of a beauty so flamelike as theirs; and now and then the fantastic shadows of birds in flight flitted across the long tussore-silk curtains that were stretched in front of the huge window, producing a kind of momentary Japanese effect, and making him think of those pallid, jade-faced painters of Tokyo who, through the medium of an art that is necessarily immobile, seek to convey the sense of swiftness and motion. The sullen murmur of the bees shouldering their way through the long unmown grass, or circling with monotonous insistence round the dusty gilt horns of the straggling woodbine, seemed to make the stillness more oppressive. The dim roar of London was like the bourdon note of a distant organ.", "metadata": {"sentence": "0"}, "embeddings": []}]

Showing 1 row.

display(predictions.select(F.explode(F.arrays_zip(predictions.ner_span.result,predictions.ner_span.metadata)).alias("entities"))       .select(F.expr("entities['0']").alias("chunk"),              F.expr("entities['1'].entity").alias("entity")))

Table

chunk

entity

1

2

3

4

5

Persian

MISC

Lord Henry Wotton

PER

Japanese

MISC

Tokyo

LOC

London

LOC

Showing all 5 rows.

Spark NLP model training and inference(Python)

Requirements

Load sample training and evaluation data

Fit a pipeline on the training data

Evaluate on test data

Construct and log a prediction pipeline for text

Use model on text