%pip install sparknlp mlflow
Python interpreter will be restarted.
Collecting sparknlp
Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Collecting mlflow
Downloading mlflow-1.29.0-py3-none-any.whl (16.9 MB)
Requirement already satisfied: numpy in /databricks/python3/lib/python3.9/site-packages (from sparknlp) (1.20.3)
Collecting spark-nlp
Downloading spark_nlp-4.2.0-py2.py3-none-any.whl (641 kB)
Requirement already satisfied: gunicorn<21 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (20.1.0)
Collecting docker<7,>=4.0.0
Downloading docker-6.0.0-py3-none-any.whl (147 kB)
Requirement already satisfied: pyyaml<7,>=5.1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (6.0)
Collecting prometheus-flask-exporter<1
Downloading prometheus_flask_exporter-0.20.3-py3-none-any.whl (18 kB)
Requirement already satisfied: pandas<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.3.4)
Requirement already satisfied: gitpython<4,>=2.1.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.1.27)
Requirement already satisfied: cloudpickle<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.0.0)
Requirement already satisfied: protobuf<5,>=3.12.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.19.4)
Requirement already satisfied: pytz<2023 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2021.3)
Requirement already satisfied: requests<3,>=2.17.3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.26.0)
Collecting sqlalchemy<2,>=1.4.0
Downloading SQLAlchemy-1.4.41-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
Collecting alembic<2
Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
Requirement already satisfied: packaging<22 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (21.0)
Requirement already satisfied: scipy<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.7.1)
Requirement already satisfied: Flask<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.1.2)
Requirement already satisfied: importlib-metadata!=4.7.0,<5,>=3.7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (4.8.1)
Requirement already satisfied: entrypoints<1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.3)
Collecting querystring-parser<2
Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Requirement already satisfied: sqlparse<1,>=0.4.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.4.2)
Requirement already satisfied: databricks-cli<1,>=0.8.7 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.17.0)
Requirement already satisfied: click<9,>=7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (8.0.3)
Requirement already satisfied: Mako in /databricks/python3/lib/python3.9/site-packages (from alembic<2->mlflow) (1.2.0)
Requirement already satisfied: tabulate>=0.7.7 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (0.8.9)
Requirement already satisfied: six>=1.10.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (1.16.0)
Requirement already satisfied: oauthlib>=3.1.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (3.2.0)
Requirement already satisfied: pyjwt>=1.7.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (2.4.0)
Requirement already satisfied: urllib3>=1.26.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.26.7)
Requirement already satisfied: websocket-client>=0.32.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.3.1)
Requirement already satisfied: itsdangerous>=0.24 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.1)
Requirement already satisfied: Jinja2>=2.10.1 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.11.3)
Requirement already satisfied: Werkzeug>=0.15 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitpython<4,>=2.1.0->mlflow) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=2.1.0->mlflow) (5.0.0)
Requirement already satisfied: setuptools>=3.0 in /usr/local/lib/python3.9/dist-packages (from gunicorn<21->mlflow) (58.0.4)
Requirement already satisfied: zipp>=0.5 in /databricks/python3/lib/python3.9/site-packages (from importlib-metadata!=4.7.0,<5,>=3.7.0->mlflow) (3.6.0)
Requirement already satisfied: MarkupSafe>=0.23 in /databricks/python3/lib/python3.9/site-packages (from Jinja2>=2.10.1->Flask<3->mlflow) (2.0.1)
Requirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.9/site-packages (from packaging<22->mlflow) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.9/site-packages (from pandas<2->mlflow) (2.8.2)
Requirement already satisfied: prometheus-client in /databricks/python3/lib/python3.9/site-packages (from prometheus-flask-exporter<1->mlflow) (0.11.0)
Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (3.2)
Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2021.10.8)
Collecting greenlet!=0.4.17
Downloading greenlet-1.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (154 kB)
Installing collected packages: greenlet, sqlalchemy, spark-nlp, querystring-parser, prometheus-flask-exporter, docker, alembic, sparknlp, mlflow
Successfully installed alembic-1.8.1 docker-6.0.0 greenlet-1.1.3 mlflow-1.29.0 prometheus-flask-exporter-0.20.3 querystring-parser-1.2.4 spark-nlp-4.2.0 sparknlp-1.0.0 sqlalchemy-1.4.41
Python interpreter will be restarted.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
import mlflow
mlflow_run = mlflow.start_run()
max_epochs=1
lr=0.003
batch_size=32
random_seed=0
verbose=1
validation_split= 0.2
evaluation_log_extended= True
enable_output_logs= True
include_confidence= True
output_logs_path="dbfs:/ner_logs"
dbutils.fs.mkdirs(output_logs_path)
nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(max_epochs)\
.setLr(lr)\
.setBatchSize(batch_size)\
.setRandomSeed(random_seed)\
.setVerbose(verbose)\
.setValidationSplit(validation_split)\
.setEvaluationLogExtended(evaluation_log_extended)\
.setEnableOutputLogs(enable_output_logs)\
.setIncludeConfidence(include_confidence)\
.setOutputLogsPath(output_logs_path)
# Log model training parameters to MLflow.
mlflow.log_params({
"max_epochs": max_epochs,
"lr": lr,
"batch_size": batch_size,
"random_seed": random_seed,
"verbose": verbose,
"validation_split": validation_split,
"evaluation_log_extended": evaluation_log_extended,
"enable_output_logs": enable_output_logs,
"include_confidence": include_confidence,
"output_logs_path": output_logs_path
})
# The training and evaluation data is already tokenized, so you can directly
# apply the embedding model and then fit a named-entity recognizer on the embeddings.
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_pipeline = Pipeline(stages=[
glove_embeddings,
nerTagger
])
ner_model = ner_pipeline.fit(training_data)
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
# Reformat data to one token per row for evaluation.
predictions_pandas = predictions.select(F.explode(F.arrays_zip(predictions.token.result,
predictions.label.result,
predictions.ner.result)).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).toPandas()
from sklearn.metrics import classification_report
# Generate a classification report.
report = classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction'], output_dict=True)
# Directly log accuracy to MLflow.
mlflow.log_metric("accuracy", report["accuracy"])
# Log the full classification report by token type as an artifact to MLflow.
mlflow.log_dict(report, "classification_report.yaml")
# Print out the report to view it in the notebook.
print (classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction']))
precision recall f1-score support
B-LOC 0.87 0.95 0.91 1837
B-MISC 0.92 0.80 0.86 922
B-ORG 0.86 0.87 0.86 1341
B-PER 0.95 0.95 0.95 1842
I-LOC 0.78 0.79 0.78 257
I-MISC 0.81 0.62 0.70 346
I-ORG 0.86 0.72 0.78 751
I-PER 0.96 0.97 0.96 1307
O 0.99 1.00 0.99 42759
accuracy 0.98 51362
macro avg 0.89 0.85 0.87 51362
weighted avg 0.98 0.98 0.98 51362
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence = SentenceDetector()\
.setInputCols(['document'])\
.setOutputCol('sentence')
token = Tokenizer()\
.setInputCols(['sentence'])\
.setOutputCol('token')
# Pull out the model from the pipeline.
loaded_ner_model = ner_model.stages[1]
converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_span")
ner_prediction_pipeline = Pipeline(
stages = [
document,
sentence,
token,
glove_embeddings,
loaded_ner_model,
converter])
# In Databricks Runtime 11.2 and 11.2 ML, model logging is handled using Databricks MLflow utilities.
# The Databricks MLflow utilities for DBFS in Databricks Runtime 11.2 do not support all filesystem calls that
# Spark NLP uses for model serialization. The following command disables the use of the MLflow utilities and uses
# standard DBFS support.
import os
if os.environ["DATABRICKS_RUNTIME_VERSION"] == "11.2":
os.environ["DISABLE_MLFLOWDBFS"] = "True"
# Log the model in MLflow and build a reference to the model URI.
model_name = "NerPipelineModel"
mlflow.spark.log_model(prediction_model, model_name)
mlflow.end_run()
mlflow_model_uri = "runs:/{}/{}".format(mlflow_run.info.run_id, model_name)
display(mlflow_model_uri)
2022/09/30 17:30:58 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpwp054nzv/model, flavor: spark), fall back to return ['pyspark==3.3.0']. Set logging level to DEBUG to see the full traceback.
'runs:/c5ebd6a6ac1f4e2b935bb3cb72bf6db2/NerPipelineModel'
# Create sample text.
text = "From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum, whose tremulous branches seemed hardly able to bear the burden of a beauty so flamelike as theirs; and now and then the fantastic shadows of birds in flight flitted across the long tussore-silk curtains that were stretched in front of the huge window, producing a kind of momentary Japanese effect, and making him think of those pallid, jade-faced painters of Tokyo who, through the medium of an art that is necessarily immobile, seek to convey the sense of swiftness and motion. The sullen murmur of the bees shouldering their way through the long unmown grass, or circling with monotonous insistence round the dusty gilt horns of the straggling woodbine, seemed to make the stillness more oppressive. The dim roar of London was like the bourdon note of a distant organ."
sample_data = spark.createDataFrame([[text]]).toDF("text")
# Load and use the model.
mlflow_model = mlflow.spark.load_model(mlflow_model_uri)
predictions = mlflow_model.transform(sample_data)
2022/09/30 17:31:02 INFO mlflow.spark: 'runs:/c5ebd6a6ac1f4e2b935bb3cb72bf6db2/NerPipelineModel' resolved as 'dbfs:/databricks/mlflow-tracking/1285644875494958/c5ebd6a6ac1f4e2b935bb3cb72bf6db2/artifacts/NerPipelineModel'
Requirements
To use Spark NLP, create or use a cluster with any compatible runtime version. Install Spark NLP on the cluster using Maven coordinates, such as
com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0
.