Spark NLP model training and inference(Python)

Loading...

Requirements

To use Spark NLP, create or use a cluster with any compatible runtime version. Install Spark NLP on the cluster using Maven coordinates, such as com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0.

%pip install sparknlp mlflow
Python interpreter will be restarted. Collecting sparknlp Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB) Collecting mlflow Downloading mlflow-1.29.0-py3-none-any.whl (16.9 MB) Requirement already satisfied: numpy in /databricks/python3/lib/python3.9/site-packages (from sparknlp) (1.20.3) Collecting spark-nlp Downloading spark_nlp-4.2.0-py2.py3-none-any.whl (641 kB) Requirement already satisfied: gunicorn<21 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (20.1.0) Collecting docker<7,>=4.0.0 Downloading docker-6.0.0-py3-none-any.whl (147 kB) Requirement already satisfied: pyyaml<7,>=5.1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (6.0) Collecting prometheus-flask-exporter<1 Downloading prometheus_flask_exporter-0.20.3-py3-none-any.whl (18 kB) Requirement already satisfied: pandas<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.3.4) Requirement already satisfied: gitpython<4,>=2.1.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.1.27) Requirement already satisfied: cloudpickle<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.0.0) Requirement already satisfied: protobuf<5,>=3.12.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.19.4) Requirement already satisfied: pytz<2023 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2021.3) Requirement already satisfied: requests<3,>=2.17.3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.26.0) Collecting sqlalchemy<2,>=1.4.0 Downloading SQLAlchemy-1.4.41-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB) Collecting alembic<2 Downloading alembic-1.8.1-py3-none-any.whl (209 kB) Requirement already satisfied: packaging<22 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (21.0) Requirement already satisfied: scipy<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.7.1) Requirement already satisfied: Flask<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.1.2) Requirement already satisfied: importlib-metadata!=4.7.0,<5,>=3.7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (4.8.1) Requirement already satisfied: entrypoints<1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.3) Collecting querystring-parser<2 Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB) Requirement already satisfied: sqlparse<1,>=0.4.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.4.2) Requirement already satisfied: databricks-cli<1,>=0.8.7 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.17.0) Requirement already satisfied: click<9,>=7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (8.0.3) Requirement already satisfied: Mako in /databricks/python3/lib/python3.9/site-packages (from alembic<2->mlflow) (1.2.0) Requirement already satisfied: tabulate>=0.7.7 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (0.8.9) Requirement already satisfied: six>=1.10.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (1.16.0) Requirement already satisfied: oauthlib>=3.1.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (3.2.0) Requirement already satisfied: pyjwt>=1.7.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (2.4.0) Requirement already satisfied: urllib3>=1.26.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.26.7) Requirement already satisfied: websocket-client>=0.32.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.3.1) Requirement already satisfied: itsdangerous>=0.24 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.1) Requirement already satisfied: Jinja2>=2.10.1 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.11.3) Requirement already satisfied: Werkzeug>=0.15 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.2) Requirement already satisfied: gitdb<5,>=4.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitpython<4,>=2.1.0->mlflow) (4.0.9) Requirement already satisfied: smmap<6,>=3.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=2.1.0->mlflow) (5.0.0) Requirement already satisfied: setuptools>=3.0 in /usr/local/lib/python3.9/dist-packages (from gunicorn<21->mlflow) (58.0.4) Requirement already satisfied: zipp>=0.5 in /databricks/python3/lib/python3.9/site-packages (from importlib-metadata!=4.7.0,<5,>=3.7.0->mlflow) (3.6.0) Requirement already satisfied: MarkupSafe>=0.23 in /databricks/python3/lib/python3.9/site-packages (from Jinja2>=2.10.1->Flask<3->mlflow) (2.0.1) Requirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.9/site-packages (from packaging<22->mlflow) (3.0.4) Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.9/site-packages (from pandas<2->mlflow) (2.8.2) Requirement already satisfied: prometheus-client in /databricks/python3/lib/python3.9/site-packages (from prometheus-flask-exporter<1->mlflow) (0.11.0) Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (3.2) Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2.0.4) Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2021.10.8) Collecting greenlet!=0.4.17 Downloading greenlet-1.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (154 kB) Installing collected packages: greenlet, sqlalchemy, spark-nlp, querystring-parser, prometheus-flask-exporter, docker, alembic, sparknlp, mlflow Successfully installed alembic-1.8.1 docker-6.0.0 greenlet-1.1.3 mlflow-1.29.0 prometheus-flask-exporter-0.20.3 querystring-parser-1.2.4 spark-nlp-4.2.0 sparknlp-1.0.0 sqlalchemy-1.4.41 Python interpreter will be restarted.

Load sample training and evaluation data

!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa
from sparknlp.training import CoNLL
training_data = CoNLL().readDataset(spark, 'file:/databricks/driver/eng.train')
test_data = CoNLL().readDataset(spark, 'file:/databricks/driver/eng.testa')

Fit a pipeline on the training data

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
 
import mlflow
mlflow_run = mlflow.start_run()
 
max_epochs=1
lr=0.003
batch_size=32
random_seed=0
verbose=1
validation_split= 0.2
evaluation_log_extended= True
enable_output_logs= True
include_confidence= True
output_logs_path="dbfs:/ner_logs"
 
dbutils.fs.mkdirs(output_logs_path)
 
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(max_epochs)\
  .setLr(lr)\
  .setBatchSize(batch_size)\
  .setRandomSeed(random_seed)\
  .setVerbose(verbose)\
  .setValidationSplit(validation_split)\
  .setEvaluationLogExtended(evaluation_log_extended)\
  .setEnableOutputLogs(enable_output_logs)\
  .setIncludeConfidence(include_confidence)\
  .setOutputLogsPath(output_logs_path)
 
# Log model training parameters to MLflow.
mlflow.log_params({
  "max_epochs": max_epochs,
  "lr": lr,
  "batch_size": batch_size,
  "random_seed": random_seed,
  "verbose": verbose,
  "validation_split": validation_split,
  "evaluation_log_extended": evaluation_log_extended,
  "enable_output_logs": enable_output_logs,
  "include_confidence": include_confidence,
  "output_logs_path": output_logs_path
})
 
# The training and evaluation data is already tokenized, so you can directly 
# apply the embedding model and then fit a named-entity recognizer on the embeddings.
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")
 
ner_pipeline = Pipeline(stages=[
          glove_embeddings,
          nerTagger
 ])
 
ner_model = ner_pipeline.fit(training_data)
glove_100d download started this may take some time. Approximate size to download 145.3 MB [OK!]

Evaluate on test data

predictions = ner_model.transform(test_data)
import pyspark.sql.functions as F
display(predictions.select(F.col('token.result').alias("tokens"),
                           F.col('label.result').alias("ground_truth"),
                           F.col('ner.result').alias("predictions")).limit(3))
 
tokens
ground_truth
1
2
3
["CRICKET", "-", "LEICESTERSHIRE", "TAKE", "OVER", "AT", "TOP", "AFTER", "INNINGS", "VICTORY", "."]
["O", "O", "B-ORG", "O", "O", "O", "O", "O", "O", "O", "O"]
["LONDON", "1996-08-30"]
["B-LOC", "O"]
["West", "Indian", "all-rounder", "Phil", "Simmons", "took", "four", "for", "38", "on", "Friday", "as", "Leicestershire", "beat", "Somerset", "by", "an", "innings", "and", "39", "runs", "in", "two", "days", "to", "take", "over", "at", "the", "head", "of", "the", "county", "championship", "."]
["B-MISC", "I-MISC", "O", "B-PER", "I-PER", "O", "O", "O", "O", "O", "O", "O", "B-ORG", "O", "B-ORG", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]
Showing all 3 rows.
# Reformat data to one token per row for evaluation.
predictions_pandas = predictions.select(F.explode(F.arrays_zip(predictions.token.result,
                                                     predictions.label.result,
                                                     predictions.ner.result)).alias("cols")) \
                              .select(F.expr("cols['0']").alias("token"),
                                      F.expr("cols['1']").alias("ground_truth"),
                                      F.expr("cols['2']").alias("prediction")).toPandas()
display(predictions_pandas.head(20))
 
token
ground_truth
prediction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
CRICKET
O
O
-
O
O
LEICESTERSHIRE
B-ORG
B-ORG
TAKE
O
O
OVER
O
O
AT
O
O
TOP
O
O
AFTER
O
O
INNINGS
O
O
VICTORY
O
O
.
O
O
LONDON
B-LOC
B-LOC
1996-08-30
O
O
West
B-MISC
B-LOC
Indian
I-MISC
I-LOC
all-rounder
O
O
Phil
B-PER
B-PER
Showing all 20 rows.
from sklearn.metrics import classification_report
 
# Generate a classification report.
report = classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction'], output_dict=True)
 
# Directly log accuracy to MLflow.
mlflow.log_metric("accuracy", report["accuracy"])
# Log the full classification report by token type as an artifact to MLflow.
mlflow.log_dict(report, "classification_report.yaml")
 
# Print out the report to view it in the notebook.
print (classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction']))
 
precision recall f1-score support B-LOC 0.87 0.95 0.91 1837 B-MISC 0.92 0.80 0.86 922 B-ORG 0.86 0.87 0.86 1341 B-PER 0.95 0.95 0.95 1842 I-LOC 0.78 0.79 0.78 257 I-MISC 0.81 0.62 0.70 346 I-ORG 0.86 0.72 0.78 751 I-PER 0.96 0.97 0.96 1307 O 0.99 1.00 0.99 42759 accuracy 0.98 51362 macro avg 0.89 0.85 0.87 51362 weighted avg 0.98 0.98 0.98 51362

Construct and log a prediction pipeline for text

document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
 
sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')
 
token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')
 
# Pull out the model from the pipeline.
loaded_ner_model = ner_model.stages[1]
 
converter = NerConverter()\
      .setInputCols(["document", "token", "ner"])\
      .setOutputCol("ner_span")
 
ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        glove_embeddings,
        loaded_ner_model,
        converter])