HyperAI

This tutorial builds upon Named Entity Recognition Based on UIE and further implements interactive pre-annotation and model training functionalities by integrating Label Studio's Machine Learning Backend.

Environment Setup

Launch a "Model Training" container in HyperAI, select the paddlepaddle-2.3 environment and choose vgpu or other GPU containers for resources
Open a Terminal window in Jupyter, then execute the command openbayes-label-studio to start label-studio

Open the link in the red box in your browser, register an account and log in
Open another Terminal window and execute the following commands to install label_studio_ml
```
pip install label_studio_ml
pip uninstall attr
```

Machine Learning Backend Development

The complete Machine Learning Backend can be found in the my_ml_backend.py file. For more information on writing custom machine learning backends, refer to Write your own ML backend.

Simply put, my_ml_backend.py mainly contains a class that inherits from LabelStudioMLBase, whose content can be divided into three main parts:

The __init__ method, which includes model loading and basic configuration initialization
The predict method, used to generate new prediction results for annotation data, with the key parameter tasks being the raw data passed by label studio
The fit method, used for model training, which is called when clicking the Train button on the page (the specific location will be mentioned below), with the key parameter annotations being the annotated data passed by label studio

`init` Initialization Method

Define and initialize required variables in the __init__ method. The LabelStudioMLBase class provides several special variables available for use:

self.label_config: Original label configuration.
self.parsed_label_config: Provides structured Label Studio label configuration for the project.
self.train_output: Contains the results of previous model training runs, same as the output of the fit() method defined in the training call section.

As in the example of this tutorial, the label configuration is:

<View>
  <Labels name="label" toName="text">
  <Label value="地名" background="#FFA39E"/>
  <Label value="人名" background="#D4380D"/>
  <Label value="组织" background="#FFC069"/>
  <Label value="时间" background="#AD8B00"/>
  <Label value="产品" background="#D3F261"/>
  <Label value="价格" background="#389E0D"/>
  <Label value="天气" background="#5CDBD3"/>
  </Labels>
  <Text name="text" value="$text"/>
</View>

The corresponding parsed_label_config is shown below:

{
	'label': {
		'type': 'Labels',
		'to_name': ['text'],
		'inputs': [{
			'type': 'Text',
			'value': 'text'
		}],
		'labels': ['地名', '人名', '组织', '时间', '产品', '价格', '天气'],
		'labels_attrs': {
			'地名': {
				'value': '地名',
				'background': '#FFA39E'
			},
			'人名': {
				'value': '人名',
				'background': '#D4380D'
			},
			'组织': {
				'value': '组织',
				'background': '#FFC069'
			},
			'时间': {
				'value': '时间',
				'background': '#AD8B00'
			},
			'产品': {
				'value': '产品',
				'background': '#D3F261'
			},
			'价格': {
				'value': '价格',
				'background': '#389E0D'
			},
			'天气': {
				'value': '天气',
				'background': '#5CDBD3'
			}
		}
	}
}

Extract the required information from the self.parsed_label_config variable as needed, and load the model for pre-annotation through PaddleNLP's Taskflow.

def __init__(self, **kwargs):
    # don't forget to initialize base class...
    super(MyModel, self).__init__(**kwargs)

# print("parsed_label_config:", self.parsed_label_config)
    self.from_name, self.info = list(self.parsed_label_config.items())[0]

    assert self.info['type'] == 'Labels'
    assert self.info['inputs'][0]['type'] == 'Text'

    self.to_name = self.info['to_name'][0]
    self.value = self.info['inputs'][0]['value']
    self.labels = list(self.info['labels'])
    # init uie model
    self.model = Taskflow("information_extraction", schema=self.labels, task_path= './checkpoint/model_best')

`predict` Prediction Method

Write code to override the predict(tasks, **kwargs) method. The predict() method accepts Label Studio tasks in JSON format and returns predictions in a format accepted by Label Studio. Additionally, prediction scores can be included and customized for active learning loops.

The tasks parameter contains details about the tasks to be pre-annotated. The specific task format is shown below:

{
	'id': 16,
	'data': {
		'text': '新华社都柏林 6 月 28 日电（记者张琪）第二届"汉语桥"世界小学生中文秀爱尔兰赛区比赛结果日前揭晓，来自都柏林市的小学五年级学生埃拉·戈尔曼获得一等奖。'
	},
	'meta': {},
	'created_at': '2022-07-12T07:05:06.793411Z',
	'updated_at': '2022-07-12T07:05:06.793424Z',
	'is_labeled': False,
	'overlap': 1,
	'inner_id': 6,
	'total_annotations': 0,
	'cancelled_annotations': 0,
	'total_predictions': 0,
	'project': 2,
	'updated_by': None,
	'file_upload': 2,
	'annotations': [],
	'predictions': []
}

The specific format can be viewed in Label Studio's data list by clicking "show task source":

To make predictions through Taskflow, the original text needs to be extracted from the ['data']['text'] field. The returned UIE prediction result format is shown below:

def predict(self, tasks, **kwargs):
    from_name = self.from_name
    to_name = self.to_name
    model = self.model

    predictions = []
    # loop every task
    for task in tasks:
        # print("predict task:", task)
        text = task['data'][self.value]
        uie = model(text)[0]
        # print("uie:", uie)

        result = []
        # Extract entities from UIE prediction results
        for entity_type, entities in uie.items():
            for entity in entities:
                result.append({
                    "from_name": from_name,
                    "to_name": to_name,
                    "type": "labels",
                    "value": {
                        "start": entity['start'],
                        "end": entity['end'],
                        "text": entity['text'],
                        "labels": [entity_type]
                    },
                    "score": entity['probability']
                })

        predictions.append({
            "result": result,
            "score": sum([r['score'] for r in result]) / len(result) if result else 0
        })

    return predictions

result = []
scores = []
for key in uie:
    for item in uie[key]:
        result.append({
            'from_name': from_name,
            'to_name': to_name,
            'type': 'labels',
            'value': {
                'start': item['start'],
                'end': item['end'],
                'score': item['probability'],
                'text': item['text'],
                'labels': [key]
            }
        })
        scores.append(item['probability'])
result = sorted(result, key=lambda k: k["value"]["start"])
mean_score = np.mean(scores) if len(scores) > 0 else 0

predictions.append({
    'result': result,
    # optionally you can include prediction scores that you can use to sort the tasks and do active learning
    'score': float(mean_score),
    'model_version': 'uie-ner'
})
return predictions

`fit` Training Method

Update the model based on new annotations.

Write code to override the fit() method. The fit() method accepts Label Studio annotations in JSON format and returns an arbitrary JSON dictionary that can store model-related information.

def fit(self, annotations, workdir=None, **kwargs):
    """ This is where training happens: train your model given list of annotations,
        then returns dict with created links and resources
    """
    # print("annotations:", annotations)
    dataset = convert(annotations)

```python
with open("./doccano_ext.jsonl", "w", encoding="utf-8") as outfile:
        for item in dataset:
            outline = json.dumps(item, ensure_ascii=False)
            outfile.write(outline + "\n")

    os.system('python doccano.py \
        --doccano_file ./doccano_ext.jsonl \
        --task_type "ext" \
        --save_dir ./data \
        --splits 0.5 0.5 0')

    os.system('python finetune.py \
        --train_path "./data/train.txt" \
        --dev_path "./data/dev.txt" \
        --save_dir "./checkpoint" \
        --learning_rate 1e-6 \
        --batch_size 4 \
        --max_seq_len 512 \
        --num_epochs 20 \
        --model "uie-base" \
        --init_from_ckpt "./checkpoint/model_best/model_state.pdparams" \
        --seed 1000 \
        --logging_steps 10 \
        --valid_steps 100 \
        --device "gpu"')

    return {
        'path': workdir
    }

Machine Learning Integration

Start Machine Learning Backend

Execute the following commands in the terminal sequentially:

# Initialize custom machine learning backend
label-studio-ml init <my_ml_backend> --script <my_ml_backend.py>

# Start machine learning backend service
label-studio-ml start <my_ml_backend>

After successful startup, you can see the ML backend URL in the terminal.

Note: For different HyperAI computing containers, the external access link in the red box varies. Using the link directly from this tutorial will not work; you need to replace it with the link prompted in your terminal. You can also replace the IP address with localhost.

Add ML Backend to Label Studio

After starting the custom machine learning backend, you can add it to your Label Studio project.

Specific steps are as follows:

Click Settings - Machine Learning - Add Model

Fill in the title, ML backend URL, description (optional), and other content

Select Use for interactive preannotations to enable interactive preannotation functionality (optional)
Click Validate and Save

If errors occur, please refer to Machine Learning Troubleshooting. In addition to adding ML backend through Label Studio's UI interface, you can also add ML backend using the API.

Get Interactive Preannotations

To use the interactive preannotation functionality, you need to enable the Use for interactive preannotations option when adding the ML Backend. If not enabled, you can click Edit to modify it. Then simply click on any data item, and Label Studio will quietly run the ML backend you just set up to generate new annotations.

Review the preannotated data and modify the annotations if necessary.

In this example, "经开区" and "局地小冰雹" were not recognized in the preannotation results. After completing modifications or when the preannotation results meet expectations, click Submit to submit the annotation results.

Train the Model

After annotating at least one task, you can start training the model. Click Settings - Machine Learning - Start Training to begin training.

Then return to the window where label-studio-ml-backend was launched and you can see the training process has started. Additionally, you can also train the model using the API or trigger training using webhooks.

Summary

The Machine Learning Backend provided by Label Studio offers a relatively flexible framework for assisting manual annotation, and we can indeed accelerate the annotation of NLP data through it.
The enterprise version of Label Studio provides an Active Learning workflow, but from its description, this workflow is not perfect, especially the fit part. Since Label Studio underestimates the time required for "Train", the workflow of automatic training after each annotation may not be that smooth.
We did not use the "Auto-Annotation" feature provided by Label Studio because it has duplicate annotation issues.
Since Label Studio provides its API, there are actually many possibilities to explore. Combined with webhooks and other features, it could potentially make the annotation and training workflow more efficient.

Environment Setup

Launch a "Model Training" container in HyperAI, select the paddlepaddle-2.3 environment and choose vgpu or other GPU containers for resources
Open a Terminal window in Jupyter, then execute the command openbayes-label-studio to start label-studio

Open the link in the red box in your browser, register an account and log in
Open another Terminal window and execute the following commands to install label_studio_ml
```
pip install label_studio_ml
pip uninstall attr
```

Machine Learning Backend Development

The complete Machine Learning Backend can be found in the my_ml_backend.py file. For more information on writing custom machine learning backends, refer to Write your own ML backend.

Simply put, my_ml_backend.py mainly contains a class that inherits from LabelStudioMLBase, whose content can be divided into three main parts:

The __init__ method, which includes model loading and basic configuration initialization
The predict method, used to generate new prediction results for annotation data, with the key parameter tasks being the raw data passed by label studio
The fit method, used for model training, which is called when clicking the Train button on the page (the specific location will be mentioned below), with the key parameter annotations being the annotated data passed by label studio

`init` Initialization Method

Define and initialize required variables in the __init__ method. The LabelStudioMLBase class provides several special variables available for use:

self.label_config: Original label configuration.
self.parsed_label_config: Provides structured Label Studio label configuration for the project.
self.train_output: Contains the results of previous model training runs, same as the output of the fit() method defined in the training call section.

As in the example of this tutorial, the label configuration is:

<View>
  <Labels name="label" toName="text">
  <Label value="地名" background="#FFA39E"/>
  <Label value="人名" background="#D4380D"/>
  <Label value="组织" background="#FFC069"/>
  <Label value="时间" background="#AD8B00"/>
  <Label value="产品" background="#D3F261"/>
  <Label value="价格" background="#389E0D"/>
  <Label value="天气" background="#5CDBD3"/>
  </Labels>
  <Text name="text" value="$text"/>
</View>

The corresponding parsed_label_config is shown below:

{
	'label': {
		'type': 'Labels',
		'to_name': ['text'],
		'inputs': [{
			'type': 'Text',
			'value': 'text'
		}],
		'labels': ['地名', '人名', '组织', '时间', '产品', '价格', '天气'],
		'labels_attrs': {
			'地名': {
				'value': '地名',
				'background': '#FFA39E'
			},
			'人名': {
				'value': '人名',
				'background': '#D4380D'
			},
			'组织': {
				'value': '组织',
				'background': '#FFC069'
			},
			'时间': {
				'value': '时间',
				'background': '#AD8B00'
			},
			'产品': {
				'value': '产品',
				'background': '#D3F261'
			},
			'价格': {
				'value': '价格',
				'background': '#389E0D'
			},
			'天气': {
				'value': '天气',
				'background': '#5CDBD3'
			}
		}
	}
}

Extract the required information from the self.parsed_label_config variable as needed, and load the model for pre-annotation through PaddleNLP's Taskflow.

def __init__(self, **kwargs):
    # don't forget to initialize base class...
    super(MyModel, self).__init__(**kwargs)

# print("parsed_label_config:", self.parsed_label_config)
    self.from_name, self.info = list(self.parsed_label_config.items())[0]

    assert self.info['type'] == 'Labels'
    assert self.info['inputs'][0]['type'] == 'Text'

    self.to_name = self.info['to_name'][0]
    self.value = self.info['inputs'][0]['value']
    self.labels = list(self.info['labels'])
    # init uie model
    self.model = Taskflow("information_extraction", schema=self.labels, task_path= './checkpoint/model_best')

`predict` Prediction Method

The tasks parameter contains details about the tasks to be pre-annotated. The specific task format is shown below:

{
	'id': 16,
	'data': {
		'text': '新华社都柏林 6 月 28 日电（记者张琪）第二届"汉语桥"世界小学生中文秀爱尔兰赛区比赛结果日前揭晓，来自都柏林市的小学五年级学生埃拉·戈尔曼获得一等奖。'
	},
	'meta': {},
	'created_at': '2022-07-12T07:05:06.793411Z',
	'updated_at': '2022-07-12T07:05:06.793424Z',
	'is_labeled': False,
	'overlap': 1,
	'inner_id': 6,
	'total_annotations': 0,
	'cancelled_annotations': 0,
	'total_predictions': 0,
	'project': 2,
	'updated_by': None,
	'file_upload': 2,
	'annotations': [],
	'predictions': []
}

The specific format can be viewed in Label Studio's data list by clicking "show task source":

To make predictions through Taskflow, the original text needs to be extracted from the ['data']['text'] field. The returned UIE prediction result format is shown below:

def predict(self, tasks, **kwargs):
    from_name = self.from_name
    to_name = self.to_name
    model = self.model

    predictions = []
    # loop every task
    for task in tasks:
        # print("predict task:", task)
        text = task['data'][self.value]
        uie = model(text)[0]
        # print("uie:", uie)

        result = []
        # Extract entities from UIE prediction results
        for entity_type, entities in uie.items():
            for entity in entities:
                result.append({
                    "from_name": from_name,
                    "to_name": to_name,
                    "type": "labels",
                    "value": {
                        "start": entity['start'],
                        "end": entity['end'],
                        "text": entity['text'],
                        "labels": [entity_type]
                    },
                    "score": entity['probability']
                })

        predictions.append({
            "result": result,
            "score": sum([r['score'] for r in result]) / len(result) if result else 0
        })

    return predictions

result = []
scores = []
for key in uie:
    for item in uie[key]:
        result.append({
            'from_name': from_name,
            'to_name': to_name,
            'type': 'labels',
            'value': {
                'start': item['start'],
                'end': item['end'],
                'score': item['probability'],
                'text': item['text'],
                'labels': [key]
            }
        })
        scores.append(item['probability'])
result = sorted(result, key=lambda k: k["value"]["start"])
mean_score = np.mean(scores) if len(scores) > 0 else 0

predictions.append({
    'result': result,
    # optionally you can include prediction scores that you can use to sort the tasks and do active learning
    'score': float(mean_score),
    'model_version': 'uie-ner'
})
return predictions

`fit` Training Method

Update the model based on new annotations.

Write code to override the fit() method. The fit() method accepts Label Studio annotations in JSON format and returns an arbitrary JSON dictionary that can store model-related information.

def fit(self, annotations, workdir=None, **kwargs):
    """ This is where training happens: train your model given list of annotations,
        then returns dict with created links and resources
    """
    # print("annotations:", annotations)
    dataset = convert(annotations)

```python
with open("./doccano_ext.jsonl", "w", encoding="utf-8") as outfile:
        for item in dataset:
            outline = json.dumps(item, ensure_ascii=False)
            outfile.write(outline + "\n")

    os.system('python doccano.py \
        --doccano_file ./doccano_ext.jsonl \
        --task_type "ext" \
        --save_dir ./data \
        --splits 0.5 0.5 0')

    os.system('python finetune.py \
        --train_path "./data/train.txt" \
        --dev_path "./data/dev.txt" \
        --save_dir "./checkpoint" \
        --learning_rate 1e-6 \
        --batch_size 4 \
        --max_seq_len 512 \
        --num_epochs 20 \
        --model "uie-base" \
        --init_from_ckpt "./checkpoint/model_best/model_state.pdparams" \
        --seed 1000 \
        --logging_steps 10 \
        --valid_steps 100 \
        --device "gpu"')

    return {
        'path': workdir
    }

Machine Learning Integration

Start Machine Learning Backend

Execute the following commands in the terminal sequentially:

# Initialize custom machine learning backend
label-studio-ml init <my_ml_backend> --script <my_ml_backend.py>

# Start machine learning backend service
label-studio-ml start <my_ml_backend>

After successful startup, you can see the ML backend URL in the terminal.

Note: For different HyperAI computing containers, the external access link in the red box varies. Using the link directly from this tutorial will not work; you need to replace it with the link prompted in your terminal. You can also replace the IP address with localhost.

Add ML Backend to Label Studio

After starting the custom machine learning backend, you can add it to your Label Studio project.

Specific steps are as follows:

Click Settings - Machine Learning - Add Model

Fill in the title, ML backend URL, description (optional), and other content

Select Use for interactive preannotations to enable interactive preannotation functionality (optional)
Click Validate and Save

If errors occur, please refer to Machine Learning Troubleshooting. In addition to adding ML backend through Label Studio's UI interface, you can also add ML backend using the API.

Get Interactive Preannotations

Review the preannotated data and modify the annotations if necessary.

Train the Model

After annotating at least one task, you can start training the model. Click Settings - Machine Learning - Start Training to begin training.

Summary

The Machine Learning Backend provided by Label Studio offers a relatively flexible framework for assisting manual annotation, and we can indeed accelerate the annotation of NLP data through it.
The enterprise version of Label Studio provides an Active Learning workflow, but from its description, this workflow is not perfect, especially the fit part. Since Label Studio underestimates the time required for "Train", the workflow of automatic training after each annotation may not be that smooth.
We did not use the "Auto-Annotation" feature provided by Label Studio because it has duplicate annotation issues.
Since Label Studio provides its API, there are actually many possibilities to explore. Combined with webhooks and other features, it could potentially make the annotation and training workflow more efficient.

Environment Setup

Launch a "Model Training" container in HyperAI, select the paddlepaddle-2.3 environment and choose vgpu or other GPU containers for resources
Open a Terminal window in Jupyter, then execute the command openbayes-label-studio to start label-studio

Open the link in the red box in your browser, register an account and log in
Open another Terminal window and execute the following commands to install label_studio_ml
```
pip install label_studio_ml
pip uninstall attr
```

Machine Learning Backend Development

The complete Machine Learning Backend can be found in the my_ml_backend.py file. For more information on writing custom machine learning backends, refer to Write your own ML backend.

Simply put, my_ml_backend.py mainly contains a class that inherits from LabelStudioMLBase, whose content can be divided into three main parts:

The __init__ method, which includes model loading and basic configuration initialization
The predict method, used to generate new prediction results for annotation data, with the key parameter tasks being the raw data passed by label studio
The fit method, used for model training, which is called when clicking the Train button on the page (the specific location will be mentioned below), with the key parameter annotations being the annotated data passed by label studio

`init` Initialization Method

Define and initialize required variables in the __init__ method. The LabelStudioMLBase class provides several special variables available for use:

self.label_config: Original label configuration.
self.parsed_label_config: Provides structured Label Studio label configuration for the project.
self.train_output: Contains the results of previous model training runs, same as the output of the fit() method defined in the training call section.

As in the example of this tutorial, the label configuration is:

<View>
  <Labels name="label" toName="text">
  <Label value="地名" background="#FFA39E"/>
  <Label value="人名" background="#D4380D"/>
  <Label value="组织" background="#FFC069"/>
  <Label value="时间" background="#AD8B00"/>
  <Label value="产品" background="#D3F261"/>
  <Label value="价格" background="#389E0D"/>
  <Label value="天气" background="#5CDBD3"/>
  </Labels>
  <Text name="text" value="$text"/>
</View>

The corresponding parsed_label_config is shown below:

{
	'label': {
		'type': 'Labels',
		'to_name': ['text'],
		'inputs': [{
			'type': 'Text',
			'value': 'text'
		}],
		'labels': ['地名', '人名', '组织', '时间', '产品', '价格', '天气'],
		'labels_attrs': {
			'地名': {
				'value': '地名',
				'background': '#FFA39E'
			},
			'人名': {
				'value': '人名',
				'background': '#D4380D'
			},
			'组织': {
				'value': '组织',
				'background': '#FFC069'
			},
			'时间': {
				'value': '时间',
				'background': '#AD8B00'
			},
			'产品': {
				'value': '产品',
				'background': '#D3F261'
			},
			'价格': {
				'value': '价格',
				'background': '#389E0D'
			},
			'天气': {
				'value': '天气',
				'background': '#5CDBD3'
			}
		}
	}
}

Extract the required information from the self.parsed_label_config variable as needed, and load the model for pre-annotation through PaddleNLP's Taskflow.

def __init__(self, **kwargs):
    # don't forget to initialize base class...
    super(MyModel, self).__init__(**kwargs)

# print("parsed_label_config:", self.parsed_label_config)
    self.from_name, self.info = list(self.parsed_label_config.items())[0]

    assert self.info['type'] == 'Labels'
    assert self.info['inputs'][0]['type'] == 'Text'

    self.to_name = self.info['to_name'][0]
    self.value = self.info['inputs'][0]['value']
    self.labels = list(self.info['labels'])
    # init uie model
    self.model = Taskflow("information_extraction", schema=self.labels, task_path= './checkpoint/model_best')

`predict` Prediction Method

The tasks parameter contains details about the tasks to be pre-annotated. The specific task format is shown below:

{
	'id': 16,
	'data': {
		'text': '新华社都柏林 6 月 28 日电（记者张琪）第二届"汉语桥"世界小学生中文秀爱尔兰赛区比赛结果日前揭晓，来自都柏林市的小学五年级学生埃拉·戈尔曼获得一等奖。'
	},
	'meta': {},
	'created_at': '2022-07-12T07:05:06.793411Z',
	'updated_at': '2022-07-12T07:05:06.793424Z',
	'is_labeled': False,
	'overlap': 1,
	'inner_id': 6,
	'total_annotations': 0,
	'cancelled_annotations': 0,
	'total_predictions': 0,
	'project': 2,
	'updated_by': None,
	'file_upload': 2,
	'annotations': [],
	'predictions': []
}

The specific format can be viewed in Label Studio's data list by clicking "show task source":

To make predictions through Taskflow, the original text needs to be extracted from the ['data']['text'] field. The returned UIE prediction result format is shown below:

def predict(self, tasks, **kwargs):
    from_name = self.from_name
    to_name = self.to_name
    model = self.model

    predictions = []
    # loop every task
    for task in tasks:
        # print("predict task:", task)
        text = task['data'][self.value]
        uie = model(text)[0]
        # print("uie:", uie)

        result = []
        # Extract entities from UIE prediction results
        for entity_type, entities in uie.items():
            for entity in entities:
                result.append({
                    "from_name": from_name,
                    "to_name": to_name,
                    "type": "labels",
                    "value": {
                        "start": entity['start'],
                        "end": entity['end'],
                        "text": entity['text'],
                        "labels": [entity_type]
                    },
                    "score": entity['probability']
                })

        predictions.append({
            "result": result,
            "score": sum([r['score'] for r in result]) / len(result) if result else 0
        })

    return predictions

result = []
scores = []
for key in uie:
    for item in uie[key]:
        result.append({
            'from_name': from_name,
            'to_name': to_name,
            'type': 'labels',
            'value': {
                'start': item['start'],
                'end': item['end'],
                'score': item['probability'],
                'text': item['text'],
                'labels': [key]
            }
        })
        scores.append(item['probability'])
result = sorted(result, key=lambda k: k["value"]["start"])
mean_score = np.mean(scores) if len(scores) > 0 else 0

predictions.append({
    'result': result,
    # optionally you can include prediction scores that you can use to sort the tasks and do active learning
    'score': float(mean_score),
    'model_version': 'uie-ner'
})
return predictions

`fit` Training Method

Update the model based on new annotations.

Write code to override the fit() method. The fit() method accepts Label Studio annotations in JSON format and returns an arbitrary JSON dictionary that can store model-related information.

def fit(self, annotations, workdir=None, **kwargs):
    """ This is where training happens: train your model given list of annotations,
        then returns dict with created links and resources
    """
    # print("annotations:", annotations)
    dataset = convert(annotations)

```python
with open("./doccano_ext.jsonl", "w", encoding="utf-8") as outfile:
        for item in dataset:
            outline = json.dumps(item, ensure_ascii=False)
            outfile.write(outline + "\n")

    os.system('python doccano.py \
        --doccano_file ./doccano_ext.jsonl \
        --task_type "ext" \
        --save_dir ./data \
        --splits 0.5 0.5 0')

    os.system('python finetune.py \
        --train_path "./data/train.txt" \
        --dev_path "./data/dev.txt" \
        --save_dir "./checkpoint" \
        --learning_rate 1e-6 \
        --batch_size 4 \
        --max_seq_len 512 \
        --num_epochs 20 \
        --model "uie-base" \
        --init_from_ckpt "./checkpoint/model_best/model_state.pdparams" \
        --seed 1000 \
        --logging_steps 10 \
        --valid_steps 100 \
        --device "gpu"')

    return {
        'path': workdir
    }

Machine Learning Integration

Start Machine Learning Backend

Execute the following commands in the terminal sequentially:

# Initialize custom machine learning backend
label-studio-ml init <my_ml_backend> --script <my_ml_backend.py>

# Start machine learning backend service
label-studio-ml start <my_ml_backend>

After successful startup, you can see the ML backend URL in the terminal.

Note: For different HyperAI computing containers, the external access link in the red box varies. Using the link directly from this tutorial will not work; you need to replace it with the link prompted in your terminal. You can also replace the IP address with localhost.

Add ML Backend to Label Studio

After starting the custom machine learning backend, you can add it to your Label Studio project.

Specific steps are as follows:

Click Settings - Machine Learning - Add Model

Fill in the title, ML backend URL, description (optional), and other content

Select Use for interactive preannotations to enable interactive preannotation functionality (optional)
Click Validate and Save

If errors occur, please refer to Machine Learning Troubleshooting. In addition to adding ML backend through Label Studio's UI interface, you can also add ML backend using the API.

The Machine Learning Backend provided by Label Studio offers a relatively flexible framework for assisting manual annotation, and we can indeed accelerate the annotation of NLP data through it.
The enterprise version of Label Studio provides an Active Learning workflow, but from its description, this workflow is not perfect, especially the fit part. Since Label Studio underestimates the time required for "Train", the workflow of automatic training after each annotation may not be that smooth.
We did not use the "Auto-Annotation" feature provided by Label Studio because it has duplicate annotation issues.
Since Label Studio provides its API, there are actually many possibilities to explore. Combined with webhooks and other features, it could potentially make the annotation and training workflow more efficient.

Tutorial: Assisted Annotation and Training Using Label Studio's Machine Learning Backend

Environment Setup

Machine Learning Backend Development

`init` Initialization Method

`predict` Prediction Method

`fit` Training Method

Machine Learning Integration

Start Machine Learning Backend

Add ML Backend to Label Studio

Get Interactive Preannotations

Train the Model

Summary

On this page

Tutorial: Assisted Annotation and Training Using Label Studio's Machine Learning Backend

Environment Setup

Machine Learning Backend Development

`init` Initialization Method

`predict` Prediction Method

`fit` Training Method

Machine Learning Integration

Start Machine Learning Backend

Add ML Backend to Label Studio

Get Interactive Preannotations

Train the Model

Summary

On this page

Tutorial: Assisted Annotation and Training Using Label Studio's Machine Learning Backend

Environment Setup

Machine Learning Backend Development

`init` Initialization Method

`predict` Prediction Method

`fit` Training Method

Machine Learning Integration

Start Machine Learning Backend

Add ML Backend to Label Studio

Get Interactive Preannotations

Train the Model

Summary

On this page