HyperAIHyperAI
Tutorials

Tutorial: UIE-based Named Entity Recognition

Using the UIE model in PaddleNLP to quickly improve NER performance through minimal data annotation

This tutorial is based on PaddleNLP's Universal Information Extraction for named entity recognition tasks, and demonstrates how to quickly improve model performance through fine-tuning with minimal annotated data.

:::note The complete Jupyter notebook, code, and annotated data can all be found at https://app.hyper.ai/console/open-tutorials/containers/lWyxi1DwhJU. :::

Import Dependencies

from pprint import pprint
from paddlenlp import Taskflow

Using uie-base for Named Entity Recognition

First, use the pre-trained model uie-base directly for named entity recognition without any tuning to see the results.

schema = [
    'Location',
    'Person',
    'Organization',
    'Time',
    'Product',
    'Price',
    'Weather'
]
ie = Taskflow('information_extraction', schema=schema)
pprint(ie("2K and Gearbox Software announced that Tiny Tina's Wonderlands will launch on Steam at 1 AM on June 24th, having previously been an Epic exclusive on PC. During the limited period, Steam players can purchase Tiny Tina's Wonderlands on Steam and receive the Golden Hero Armor Pack before July 8th, 2022."))
pprint(ie("Recently, quantum computing expert and ACM Prize in Computing recipient Scott Aaronson announced through his blog that he will be leaving the University of Texas at Austin (UT Austin) for a year this week and joining artificial intelligence research company OpenAI."))
[
  {
    "Person": [
      { "end": 32, "probability": 0.4801083732026494, "start": 24, "text": "Aaronson" },
      { "end": 23, "probability": 0.6648137293130958, "start": 18, "text": "Scott" }
    ],
    "Time": [{ "end": 43, "probability": 0.8425767345737043, "start": 41, "text": "this week" }],
    "Organization": [{ "end": 87, "probability": 0.5554367836811132, "start": 81, "text": "OpenAI" }]
  }
]

Using the default model uie-base for named entity recognition, the results are quite good. Most named entities were identified, but there are still issues such as some entities not being recognized and some text being misidentified. For example, "Scott Aaronson" was identified as two separate person names, and "University of Texas at Austin" was not recognized.

To improve recognition performance, this tutorial attempts to fine-tune the model by annotating a small amount of data.

Data Annotation

This tutorial uses the data annotation platform Label Studio for data annotation. All work is completed in an active Workspace.

Starting Label Studio

As shown above, open a terminal in Jupyter and execute openbayes-label-studio in the terminal to use LabelStudio in the Jupyter Workspace. Then use the url generated in the command line as shown below to start Label Studio:

Open the link in a browser, register an account and log in to start using it.

:::warning For different computing containers, the external access link in the red box varies. Using the link directly from this tutorial will not work; you need to replace it with the link prompted in your terminal. :::

Annotating Data

Specific steps are as follows:

  1. Create a project.
  2. Import data. The data used in this tutorial has been uploaded to this computing container, namely corpus.txt.
  3. Configure the labeling interface. Select the Named Entity Recognition template in Natural Language Processing, and add or modify labels as needed. The entity labels that need to be defined in this tutorial are 'Location', 'Person', 'Organization', 'Time', 'Product', 'Price', 'Weather'.
  1. Start labeling data.
  2. Export data. After labeling is complete, export the result file in JSON format from Label Studio. A pre-labeled file label-studio.json is already available in this computing container.

:::info If you don't want to label the data yourself, that's fine. The label-studio.json in this tutorial is already labeled and exported. :::

Model Fine-tuning

The scripts required for model fine-tuning below have been uploaded to this computing container.

Data Conversion

Execute the following script in the terminal to convert the data file format exported from Label Studio to the data file format exported from doccano.

python labelstudio2doccano.py --labelstudio_file label-studio.json

Parameter descriptions:

  • labelstudio_file: Path to the Label Studio export file (only JSON format is supported).
  • doccano_file: Save path for the doccano format data file, defaults to "doccano_ext.jsonl".
  • task_type: Task type, options include extraction ("ext") and classification ("cls"), defaults to "ext".

:::info PaddleNLP does not provide a tool by default to convert Label Studio's annotation format to its supported format. We provide a labelstudio2doccano.py script here. :::

Then execute the following script in the terminal to process the doccano format data file. After execution, training/validation/test set files will be generated in the /home/data directory.

python doccano.py \
    --doccano_file ./doccano_ext.jsonl \
    --task_type "ext" \
    --save_dir ./data \
    --splits 0.7 0.2 0.1

Parameter descriptions:

  • doccano_file: Path to the doccano format data annotation file.
  • task_type: Select task type, options include extraction ("ext") and classification ("cls").
  • save_dir: Save directory for training data, defaults to the data directory.
  • negative_ratio: Maximum negative example ratio. This parameter is only valid for extraction tasks. Appropriately constructing negative examples can improve model performance. The number of negative examples is related to the actual number of labels, maximum negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, defaults to 5. To ensure accuracy of evaluation metrics, the validation and test sets construct full negative examples by default.
  • splits: Proportion of training set, validation set, and test set when splitting the dataset. Defaults to [0.8, 0.1, 0.1].
  • options: Specify category labels for classification tasks. This parameter is only valid for classification tasks. Defaults to ["正向", "负向"].
  • prompt_prefix: Declare the prompt prefix information for classification tasks. This parameter is only valid for classification tasks. Defaults to "情感倾向".
  • is_shuffle: Whether to randomly shuffle the dataset, defaults to True.
  • seed: Random seed, defaults to 1000.
  • separator: Separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension-level classification tasks. Defaults to "##".

:::warning Each execution of the doccano.py script will overwrite existing data files with the same name. :::

Finetune

Execute the following script in the terminal to finetune the model.

python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint" \
    --learning_rate 1e-5 \
    --batch_size 4 \
    --max_seq_len 512 \
    --num_epochs 50 \
    --model "uie-base" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 100 \
    --device "gpu"

Parameter descriptions:

  • train_path: Training set file path.
  • dev_path: Validation set file path.
  • save_dir: Model storage path, default is "./checkpoint".
  • learning_rate: Learning rate, default is 1e-5.
  • batch_size: Batch size, please adjust according to your machine's specifications, default is 16.
  • max_seq_len: Maximum text segmentation length. When input exceeds the maximum length, the input text will be automatically segmented, default is 512.
  • num_epochs: Number of training epochs, default is 100.
  • model: Select model. The program will finetune based on the selected model. Options are "uie-base", "uie-medium", "uie-mini", "uie-micro" and "uie-nano", default is "uie-base".
  • seed: Random seed, default is 1000.
  • logging_steps: Interval steps for logging, default is 10.
  • valid_steps: Interval steps for evaluation, default is 100.
  • device: Device to use for training, options are "cpu" or "gpu".
  • init_from_ckpt: Path to initialize model parameters, can continue training from a checkpoint.

Model Evaluation

Execute the following script in the terminal to evaluate the model.

python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/dev.txt \
    --batch_size 16 \
    --max_seq_len 512

Output:

[2022-07-15 03:18:19,157] [    INFO] - -----------------------------
[2022-07-15 03:18:19,157] [    INFO] - Class Name: all_classes
[2022-07-15 03:18:19,157] [    INFO] - Evaluation Precision: 0.95349 | Recall: 0.89130 | F1: 0.92135

As can be seen, the F1 score has reached 0.92, indicating that the model performs well.

Parameter descriptions:

  • model_path: Path to the model folder for evaluation, which should contain the model weight file model_state.pdparams and configuration file model_config.json.
  • test_path: Test set file for evaluation.
  • batch_size: Batch size, please adjust according to machine conditions, default is 16.
  • max_seq_len: Maximum text segmentation length. When input exceeds the maximum length, the input text will be automatically segmented, default is 512.
  • debug: Whether to enable debug mode to evaluate each positive category separately. This mode is only for model debugging and is disabled by default.

Example output in debug mode:

[2022-07-15 03:27:57,801] [    INFO] - -----------------------------
[2022-07-15 03:27:57,801] [    INFO] - Class Name: Organization
[2022-07-15 03:27:57,802] [    INFO] - Evaluation Precision: 1.00000 | Recall: 0.75000 | F1: 0.85714
[2022-07-15 03:27:57,913] [    INFO] - -----------------------------
[2022-07-15 03:27:57,913] [    INFO] - Class Name: Location
[2022-07-15 03:27:57,913] [    INFO] - Evaluation Precision: 0.90476 | Recall: 0.82609 | F1: 0.86364
[2022-07-15 03:27:58,046] [    INFO] - -----------------------------
[2022-07-15 03:27:58,046] [    INFO] - Class Name: Time
[2022-07-15 03:27:58,047] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,098] [    INFO] - -----------------------------
[2022-07-15 03:27:58,098] [    INFO] - Class Name: Product
[2022-07-15 03:27:58,098] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,147] [    INFO] - -----------------------------
[2022-07-15 03:27:58,147] [    INFO] - Class Name: Price
[2022-07-15 03:27:58,147] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,176] [    INFO] - -----------------------------
[2022-07-15 03:27:58,176] [    INFO] - Class Name: Person Name
[2022-07-15 03:27:58,177] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000

Results after Fine-tuning

[
  {
    "人物": [{ "end": 54, "probability": 0.9998745636919165, "start": 40, "text": "Scott Aaronson" }],
    "时间": [{ "end": 2, "probability": 0.9995489912178165, "start": 0, "text": "近日" }],
    "组织": [
      { "end": 17, "probability": 0.9993636203914528, "start": 14, "text": "ACM" },
      { "end": 79, "probability": 0.9996438913847458, "start": 69, "text": "得克萨斯大学奥斯汀分校" },
      { "end": 84, "probability": 0.9951835632324219, "start": 75, "text": "UT Austin" },
      { "end": 105, "probability": 0.9998526936464233, "start": 99, "text": "OpenAI" }
    ]
  }
]
[
  {
    "Person Name": [{ "end": 32, "probability": 0.9999316942434575, "start": 18, "text": "Scott Aaronson" }],
    "Location": [{ "end": 54, "probability": 0.976469583224933, "start": 51, "text": "Austin" }],
    "Time": [
      { "end": 69, "probability": 0.9782005099896942, "start": 67, "text": "one year" },
      { "end": 2, "probability": 0.9995077236474508, "start": 0, "text": "recently" },
      { "end": 43, "probability": 0.9999382505043286, "start": 41, "text": "this week" }
    ],
    "Organization": [
      { "end": 66, "probability": 0.46570937436359827, "start": 57, "text": "UT Austin" },
      { "end": 56, "probability": 0.9686587700987381, "start": 45, "text": "University of Texas at Austin" },
      { "end": 13, "probability": 0.7166219551892539, "start": 10, "text": "ACM" },
      { "end": 87, "probability": 0.999835617128781, "start": 81, "text": "OpenAI" }
    ]
  }
]

Model Deployment

After obtaining the fine-tuned model, you can deploy the model to HyperAI's server to implement real-time model inference services.

For more information about model deployment, please refer to Model Deployment Introduction and Model Deployment for Chinese Named Entity Recognition Based on Transfer Learning.

Serving Service Development

Write the predictor.py file:

  • Import dependencies: In addition to the libraries used in the business logic, you need to additionally depend on openbayes-serving.
import openbayes_serving as serv
from paddlenlp import Taskflow
  • Post-processing (optional): Process the results returned by the model as needed for better presentation. In this tutorial, the format() function and add_o() function are used to modify the format of named entity recognition results.

  • Predictor Class: Does not need to inherit from other classes, but must provide at least two interfaces: __init__ and predict.

    • In __init__, define the entity extraction structure and load the model through Taskflow.
    • In predict, perform prediction and return post-processed results.
class Predictor:
    def __init__(self):
        self.schema = ['地名', '人名', '组织', '时间', '产品', '价格', '天气']
        self.ie = Taskflow("information_extraction", schema=self.schema, task_path='./checkpoint/model_best')


    def predict(self, json):
        text = json["input"]
        uie = self.ie(text)[0]
        result = format(text, uie)
        return result
  • Run: Start the service.
if __name__ == '__main__':
    serv.run(Predictor)

:::info A ready-made predictor.py has been provided in the root directory of the tutorial and can be used directly later. :::

Testing in Jupyter

Execute OPENBAYES_JOB_URL= python predictor.py in the terminal. After successfully starting the local test service, execute the following code in this Notebook for testing.

import requests
text = {
  "input": "近日,量子计算专家、ACM 计算奖得主 Scott Aaronson 通过博客宣布,将于本周离开得克萨斯大学奥斯汀分校 (UT Austin) 一年,并加盟人工智能研究公司 OpenAI。"
}
result = requests.post('http://localhost:25252', json=text)
result.json()

Recently, quantum computing expert and ACM Prize in Computing laureate Scott Aaronson announced via his blog that he will be leaving the University of Texas at Austin (UT Austin) for one year this week and joining artificial intelligence research company OpenAI.

I notice you've shared what appears to be the end of a JSON structure with a closing brace and bracket: '}]

However, there's no Chinese text provided for me to translate. Could you please share the Chinese content you'd like me to translate into English?

Deployment

After successful testing, stop this compute container and wait for data synchronization to complete.

In "Compute Container" - "Model Deployment", click "Create New Deployment", select the same image as used during development, bind this compute container, and click "Deploy" to proceed with online testing.

Test Deployment