Tutorial: Assisted Annotation and Training Using Label Studio's Machine Learning Backend
Integrate annotation and training using Label Studio's Machine Learning Backend to accelerate model iteration
This tutorial builds upon Named Entity Recognition Based on UIE and further implements interactive pre-annotation and model training functionalities by integrating Label Studio's Machine Learning Backend.
Environment Setup
- 
Launch a "Model Training" container in HyperAI, select the paddlepaddle-2.3environment and choosevgpuor other GPU containers for resources
- 
Open a Terminal window in Jupyter, then execute the command openbayes-label-studioto startlabel-studioOpen the link in the red box in your browser, register an account and log in 
- 
Open another Terminal window and execute the following commands to install label_studio_mlpip install label_studio_ml pip uninstall attr
Machine Learning Backend Development
The complete Machine Learning Backend can be found in the my_ml_backend.py file. For more information on writing custom machine learning backends, refer to Write your own ML backend.
Simply put, my_ml_backend.py mainly contains a class that inherits from LabelStudioMLBase, whose content can be divided into three main parts:
- The __init__method, which includes model loading and basic configuration initialization
- The predictmethod, used to generate new prediction results for annotation data, with the key parametertasksbeing the raw data passed by label studio
- The fitmethod, used for model training, which is called when clicking theTrainbutton on the page (the specific location will be mentioned below), with the key parameterannotationsbeing the annotated data passed by label studio
__init__ Initialization Method
Define and initialize required variables in the __init__ method. The LabelStudioMLBase class provides several special variables available for use:
- self.label_config: Original label configuration.
- self.parsed_label_config: Provides structured Label Studio label configuration for the project.
- self.train_output: Contains the results of previous model training runs, same as the output of the fit()method defined in the training call section.
As in the example of this tutorial, the label configuration is:
<View>
  <Labels name="label" toName="text">
  <Label value="地名" background="#FFA39E"/>
  <Label value="人名" background="#D4380D"/>
  <Label value="组织" background="#FFC069"/>
  <Label value="时间" background="#AD8B00"/>
  <Label value="产品" background="#D3F261"/>
  <Label value="价格" background="#389E0D"/>
  <Label value="天气" background="#5CDBD3"/>
  </Labels>
  <Text name="text" value="$text"/>
</View>The corresponding parsed_label_config is shown below:
{
	'label': {
		'type': 'Labels',
		'to_name': ['text'],
		'inputs': [{
			'type': 'Text',
			'value': 'text'
		}],
		'labels': ['地名', '人名', '组织', '时间', '产品', '价格', '天气'],
		'labels_attrs': {
			'地名': {
				'value': '地名',
				'background': '#FFA39E'
			},
			'人名': {
				'value': '人名',
				'background': '#D4380D'
			},
			'组织': {
				'value': '组织',
				'background': '#FFC069'
			},
			'时间': {
				'value': '时间',
				'background': '#AD8B00'
			},
			'产品': {
				'value': '产品',
				'background': '#D3F261'
			},
			'价格': {
				'value': '价格',
				'background': '#389E0D'
			},
			'天气': {
				'value': '天气',
				'background': '#5CDBD3'
			}
		}
	}
}Extract the required information from the self.parsed_label_config variable as needed, and load the model for pre-annotation through PaddleNLP's Taskflow.
def __init__(self, **kwargs):
    # don't forget to initialize base class...
    super(MyModel, self).__init__(**kwargs)
# print("parsed_label_config:", self.parsed_label_config)
    self.from_name, self.info = list(self.parsed_label_config.items())[0]
    assert self.info['type'] == 'Labels'
    assert self.info['inputs'][0]['type'] == 'Text'
    self.to_name = self.info['to_name'][0]
    self.value = self.info['inputs'][0]['value']
    self.labels = list(self.info['labels'])
    # init uie model
    self.model = Taskflow("information_extraction", schema=self.labels, task_path= './checkpoint/model_best')predict Prediction Method
Write code to override the predict(tasks, **kwargs) method. The predict() method accepts Label Studio tasks in JSON format and returns predictions in a format accepted by Label Studio. Additionally, prediction scores can be included and customized for active learning loops.
The tasks parameter contains details about the tasks to be pre-annotated. The specific task format is shown below:
{
	'id': 16,
	'data': {
		'text': '新华社都柏林 6 月 28 日电(记者张琪)第二届"汉语桥"世界小学生中文秀爱尔兰赛区比赛结果日前揭晓,来自都柏林市的小学五年级学生埃拉·戈尔曼获得一等奖。'
	},
	'meta': {},
	'created_at': '2022-07-12T07:05:06.793411Z',
	'updated_at': '2022-07-12T07:05:06.793424Z',
	'is_labeled': False,
	'overlap': 1,
	'inner_id': 6,
	'total_annotations': 0,
	'cancelled_annotations': 0,
	'total_predictions': 0,
	'project': 2,
	'updated_by': None,
	'file_upload': 2,
	'annotations': [],
	'predictions': []
}The specific format can be viewed in Label Studio's data list by clicking "show task source":
To make predictions through Taskflow, the original text needs to be extracted from the ['data']['text'] field. The returned UIE prediction result format is shown below:
def predict(self, tasks, **kwargs):
    from_name = self.from_name
    to_name = self.to_name
    model = self.model
    predictions = []
    # loop every task
    for task in tasks:
        # print("predict task:", task)
        text = task['data'][self.value]
        uie = model(text)[0]
        # print("uie:", uie)
        result = []
        # Extract entities from UIE prediction results
        for entity_type, entities in uie.items():
            for entity in entities:
                result.append({
                    "from_name": from_name,
                    "to_name": to_name,
                    "type": "labels",
                    "value": {
                        "start": entity['start'],
                        "end": entity['end'],
                        "text": entity['text'],
                        "labels": [entity_type]
                    },
                    "score": entity['probability']
                })
        
        predictions.append({
            "result": result,
            "score": sum([r['score'] for r in result]) / len(result) if result else 0
        })
    
    return predictionsresult = []
scores = []
for key in uie:
    for item in uie[key]:
        result.append({
            'from_name': from_name,
            'to_name': to_name,
            'type': 'labels',
            'value': {
                'start': item['start'],
                'end': item['end'],
                'score': item['probability'],
                'text': item['text'],
                'labels': [key]
            }
        })
        scores.append(item['probability'])
result = sorted(result, key=lambda k: k["value"]["start"])
mean_score = np.mean(scores) if len(scores) > 0 else 0
predictions.append({
    'result': result,
    # optionally you can include prediction scores that you can use to sort the tasks and do active learning
    'score': float(mean_score),
    'model_version': 'uie-ner'
})
return predictionsfit Training Method
Update the model based on new annotations.
Write code to override the fit() method. The fit() method accepts Label Studio annotations in JSON format and returns an arbitrary JSON dictionary that can store model-related information.
def fit(self, annotations, workdir=None, **kwargs):
    """ This is where training happens: train your model given list of annotations,
        then returns dict with created links and resources
    """
    # print("annotations:", annotations)
    dataset = convert(annotations)
```python
with open("./doccano_ext.jsonl", "w", encoding="utf-8") as outfile:
        for item in dataset:
            outline = json.dumps(item, ensure_ascii=False)
            outfile.write(outline + "\n")
    os.system('python doccano.py \
        --doccano_file ./doccano_ext.jsonl \
        --task_type "ext" \
        --save_dir ./data \
        --splits 0.5 0.5 0')
    os.system('python finetune.py \
        --train_path "./data/train.txt" \
        --dev_path "./data/dev.txt" \
        --save_dir "./checkpoint" \
        --learning_rate 1e-6 \
        --batch_size 4 \
        --max_seq_len 512 \
        --num_epochs 20 \
        --model "uie-base" \
        --init_from_ckpt "./checkpoint/model_best/model_state.pdparams" \
        --seed 1000 \
        --logging_steps 10 \
        --valid_steps 100 \
        --device "gpu"')
    return {
        'path': workdir
    }Machine Learning Integration
Start Machine Learning Backend
Execute the following commands in the terminal sequentially:
# Initialize custom machine learning backend
label-studio-ml init <my_ml_backend> --script <my_ml_backend.py>
# Start machine learning backend service
label-studio-ml start <my_ml_backend>After successful startup, you can see the ML backend URL in the terminal.
Note: For different HyperAI computing containers, the external access link in the red box varies. Using the link directly from this tutorial will not work; you need to replace it with the link prompted in your terminal. You can also replace the IP address with
localhost.
Add ML Backend to Label Studio
After starting the custom machine learning backend, you can add it to your Label Studio project.
Specific steps are as follows:
- Click Settings - Machine Learning - Add Model
- Fill in the title, ML backend URL, description (optional), and other content
- Select Use for interactive preannotations to enable interactive preannotation functionality (optional)
- Click Validate and Save
If errors occur, please refer to Machine Learning Troubleshooting. In addition to adding ML backend through Label Studio's UI interface, you can also add ML backend using the API.
Get Interactive Preannotations
To use the interactive preannotation functionality, you need to enable the Use for interactive preannotations option when adding the ML Backend. If not enabled, you can click Edit to modify it. Then simply click on any data item, and Label Studio will quietly run the ML backend you just set up to generate new annotations.
Review the preannotated data and modify the annotations if necessary.
In this example, "经开区" and "局地小冰雹" were not recognized in the preannotation results. After completing modifications or when the preannotation results meet expectations, click Submit to submit the annotation results.
Train the Model
After annotating at least one task, you can start training the model. Click Settings - Machine Learning - Start Training to begin training.
Then return to the window where label-studio-ml-backend was launched and you can see the training process has started. Additionally, you can also train the model using the API or trigger training using webhooks.
Summary
- The Machine Learning Backend provided by Label Studio offers a relatively flexible framework for assisting manual annotation, and we can indeed accelerate the annotation of NLP data through it.
- The enterprise version of Label Studio provides an Active Learning workflow, but from its description, this workflow is not perfect, especially the fitpart. Since Label Studio underestimates the time required for "Train", the workflow of automatic training after each annotation may not be that smooth.
- We did not use the "Auto-Annotation" feature provided by Label Studio because it has duplicate annotation issues.
- Since Label Studio provides its API, there are actually many possibilities to explore. Combined with webhooks and other features, it could potentially make the annotation and training workflow more efficient.