%pip install sparknlp mlflow
Python interpreter will be restarted.
Collecting sparknlp
Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Collecting mlflow
Downloading mlflow-1.29.0-py3-none-any.whl (16.9 MB)
Requirement already satisfied: numpy in /databricks/python3/lib/python3.9/site-packages (from sparknlp) (1.20.3)
Collecting spark-nlp
Downloading spark_nlp-4.2.0-py2.py3-none-any.whl (641 kB)
Requirement already satisfied: gunicorn<21 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (20.1.0)
Collecting docker<7,>=4.0.0
Downloading docker-6.0.0-py3-none-any.whl (147 kB)
Requirement already satisfied: pyyaml<7,>=5.1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (6.0)
Collecting prometheus-flask-exporter<1
Downloading prometheus_flask_exporter-0.20.3-py3-none-any.whl (18 kB)
Requirement already satisfied: pandas<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.3.4)
Requirement already satisfied: gitpython<4,>=2.1.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.1.27)
Requirement already satisfied: cloudpickle<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.0.0)
Requirement already satisfied: protobuf<5,>=3.12.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (3.19.4)
Requirement already satisfied: pytz<2023 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2021.3)
Requirement already satisfied: requests<3,>=2.17.3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (2.26.0)
Collecting sqlalchemy<2,>=1.4.0
Downloading SQLAlchemy-1.4.41-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
Collecting alembic<2
Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
Requirement already satisfied: packaging<22 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (21.0)
Requirement already satisfied: scipy<2 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.7.1)
Requirement already satisfied: Flask<3 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (1.1.2)
Requirement already satisfied: importlib-metadata!=4.7.0,<5,>=3.7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (4.8.1)
Requirement already satisfied: entrypoints<1 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.3)
Collecting querystring-parser<2
Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Requirement already satisfied: sqlparse<1,>=0.4.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.4.2)
Requirement already satisfied: databricks-cli<1,>=0.8.7 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (0.17.0)
Requirement already satisfied: click<9,>=7.0 in /databricks/python3/lib/python3.9/site-packages (from mlflow) (8.0.3)
Requirement already satisfied: Mako in /databricks/python3/lib/python3.9/site-packages (from alembic<2->mlflow) (1.2.0)
Requirement already satisfied: tabulate>=0.7.7 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (0.8.9)
Requirement already satisfied: six>=1.10.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (1.16.0)
Requirement already satisfied: oauthlib>=3.1.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (3.2.0)
Requirement already satisfied: pyjwt>=1.7.0 in /databricks/python3/lib/python3.9/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (2.4.0)
Requirement already satisfied: urllib3>=1.26.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.26.7)
Requirement already satisfied: websocket-client>=0.32.0 in /databricks/python3/lib/python3.9/site-packages (from docker<7,>=4.0.0->mlflow) (1.3.1)
Requirement already satisfied: itsdangerous>=0.24 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.1)
Requirement already satisfied: Jinja2>=2.10.1 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.11.3)
Requirement already satisfied: Werkzeug>=0.15 in /databricks/python3/lib/python3.9/site-packages (from Flask<3->mlflow) (2.0.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitpython<4,>=2.1.0->mlflow) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /databricks/python3/lib/python3.9/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=2.1.0->mlflow) (5.0.0)
Requirement already satisfied: setuptools>=3.0 in /usr/local/lib/python3.9/dist-packages (from gunicorn<21->mlflow) (58.0.4)
Requirement already satisfied: zipp>=0.5 in /databricks/python3/lib/python3.9/site-packages (from importlib-metadata!=4.7.0,<5,>=3.7.0->mlflow) (3.6.0)
Requirement already satisfied: MarkupSafe>=0.23 in /databricks/python3/lib/python3.9/site-packages (from Jinja2>=2.10.1->Flask<3->mlflow) (2.0.1)
Requirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.9/site-packages (from packaging<22->mlflow) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.9/site-packages (from pandas<2->mlflow) (2.8.2)
Requirement already satisfied: prometheus-client in /databricks/python3/lib/python3.9/site-packages (from prometheus-flask-exporter<1->mlflow) (0.11.0)
Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (3.2)
Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests<3,>=2.17.3->mlflow) (2021.10.8)
Collecting greenlet!=0.4.17
Downloading greenlet-1.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (154 kB)
Installing collected packages: greenlet, sqlalchemy, spark-nlp, querystring-parser, prometheus-flask-exporter, docker, alembic, sparknlp, mlflow
Successfully installed alembic-1.8.1 docker-6.0.0 greenlet-1.1.3 mlflow-1.29.0 prometheus-flask-exporter-0.20.3 querystring-parser-1.2.4 spark-nlp-4.2.0 sparknlp-1.0.0 sqlalchemy-1.4.41
Python interpreter will be restarted.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
import mlflow
mlflow_run = mlflow.start_run()
max_epochs=1
lr=0.003
batch_size=32
random_seed=0
verbose=1
validation_split= 0.2
evaluation_log_extended= True
enable_output_logs= True
include_confidence= True
output_logs_path="dbfs:/ner_logs"
dbutils.fs.mkdirs(output_logs_path)
nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(max_epochs)\
.setLr(lr)\
.setBatchSize(batch_size)\
.setRandomSeed(random_seed)\
.setVerbose(verbose)\
.setValidationSplit(validation_split)\
.setEvaluationLogExtended(evaluation_log_extended)\
.setEnableOutputLogs(enable_output_logs)\
.setIncludeConfidence(include_confidence)\
.setOutputLogsPath(output_logs_path)
# Log model training parameters to MLflow.
mlflow.log_params({
"max_epochs": max_epochs,
"lr": lr,
"batch_size": batch_size,
"random_seed": random_seed,
"verbose": verbose,
"validation_split": validation_split,
"evaluation_log_extended": evaluation_log_extended,
"enable_output_logs": enable_output_logs,
"include_confidence": include_confidence,
"output_logs_path": output_logs_path
})
# The training and evaluation data is already tokenized, so you can directly
# apply the embedding model and then fit a named-entity recognizer on the embeddings.
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_pipeline = Pipeline(stages=[
glove_embeddings,
nerTagger
])
ner_model = ner_pipeline.fit(training_data)
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
# Reformat data to one token per row for evaluation.
predictions_pandas = predictions.select(F.explode(F.arrays_zip(predictions.token.result,
predictions.label.result,
predictions.ner.result)).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).toPandas()
from sklearn.metrics import classification_report
# Generate a classification report.
report = classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction'], output_dict=True)
# Directly log accuracy to MLflow.
mlflow.log_metric("accuracy", report["accuracy"])
# Log the full classification report by token type as an artifact to MLflow.
mlflow.log_dict(report, "classification_report.yaml")
# Print out the report to view it in the notebook.
print (classification_report(predictions_pandas['ground_truth'], predictions_pandas['prediction']))
precision recall f1-score support
B-LOC 0.87 0.95 0.91 1837
B-MISC 0.92 0.80 0.86 922
B-ORG 0.86 0.87 0.86 1341
B-PER 0.95 0.95 0.95 1842
I-LOC 0.78 0.79 0.78 257
I-MISC 0.81 0.62 0.70 346
I-ORG 0.86 0.72 0.78 751
I-PER 0.96 0.97 0.96 1307
O 0.99 1.00 0.99 42759
accuracy 0.98 51362
macro avg 0.89 0.85 0.87 51362
weighted avg 0.98 0.98 0.98 51362
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence = SentenceDetector()\
.setInputCols(['document'])\
.setOutputCol('sentence')
token = Tokenizer()\
.setInputCols(['sentence'])\
.setOutputCol('token')
# Pull out the model from the pipeline.
loaded_ner_model = ner_model.stages[1]
converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_span")
ner_prediction_pipeline = Pipeline(
stages = [
document,
sentence,
token,
glove_embeddings,
loaded_ner_model,
converter])
Requirements
To use Spark NLP, create or use a cluster with any compatible runtime version. Install Spark NLP on the cluster using Maven coordinates, such as
com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0
.