HyperAIHyperAI
Tutorials

Tutorial: Implementing Container Number Detection and Recognition Using PaddleOCR

Using PaddleOCR to train container number detection and recognition models in HyperAI

The complete tutorial is available at https://app.hyper.ai/console/open-tutorials/containers/XJsxhLTnKNu and can be directly "cloned" for use.

Project Introduction

Container number refers to the box number of containers used for exporting cargo, which must be filled in when completing shipping documents. Basic concept of standard container number composition: Adopts ISO6346 (1995) standard

Standard container numbers consist of 11 coded characters, for example: CBHU 123456 7, including three parts:

  1. The first part consists of 4 English letters. The first three code letters mainly indicate the container owner and operator, and the fourth code letter indicates the type of container. For example, standard containers starting with CBHU indicate that the owner and operator is COSCO Shipping
  2. The second part consists of 6 digits. This is the container registration code, used as a unique identifier for a container body
  3. The third part is the check digit, calculated from the preceding 4 letters and 6 digits according to check rules, used to identify whether errors occurred during verification. This is the 11th digit

This tutorial is based on PaddleOCR for container number detection and recognition tasks. It uses a small amount of data to train detection and recognition models separately, and finally connects them together to achieve the task of container number detection and recognition.

Environment Setup

  1. Launch a "Model Training" container on openbayes, select environment paddlepaddle-2.3, select resources vGPU or other GPU containers

  2. Open a Terminal window in Jupyter, then execute the following commands:

cd PaddleOCR-release-2.5 # Enter the PaddleOCR-release-2.5 folder
pip install -r requirements.txt # Install PaddleOCR dependencies
python setup.py install # Install PaddleOCR

Dataset Introduction

This tutorial uses the container number dataset, which contains 3003 container images with a resolution of 1920×1080

  1. PaddleOCR detection model training annotation rules are as follows, separated by "\t" in the middle:

Image filename Image annotation information encoded with json.dumps

" 图像文件名                    json.dumps 编码的图像标注信息"
ch4_test_images/img_61.jpg    [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {...}]

The image annotation information before json.dumps encoding is a list containing multiple dictionaries. The points in the dictionary represent the coordinates (x, y) of the four points of the text box, arranged clockwise starting from the top-left point. transcription represents the text content of the current text box. When its content is "###", it indicates that the text box is invalid and will be skipped during training.

  1. PaddleOCR recognition model training annotation rules are as follows, separated by "\t" in the middle:

" Image filename Image annotation information "

train_data/rec/train/word_001.jpg 简单可依赖
train_data/rec/train/word_002.jpg 用科技让复杂的世界更简单

Data Organization

Data Preparation for Detection Model

Divide the 3000 images of the dataset into training set and validation set in a 2:1 ratio, run the following code

from tqdm import tqdm
finename = "all_label.txt"
f = open(finename)
lines = f.readlines()
t = open('det_train_label.txt','w')
v = open('det_eval_label.txt','w')
count = 0
for line in tqdm(lines):
    if count < 2000:
        t.writelines(line)
        count += 1
    else:
        v.writelines(line)
f.close()
t.close()
v.close()
100%|██████████| 3003/3003 [00:00<00:00, 37908.32it/s]

Data Preparation for Recognition Model

Based on the annotations from the detection part, we crop the dataset to include images containing as much text as possible as data for recognition, run the following code

from PIL import Image
import json
from tqdm import tqdm
import os
import numpy as np
import cv2
import math

from PIL import Image, ImageDraw

class Rotate(object):

    def __init__(self, image: Image.Image, coordinate):
        self.image = image.convert('RGB')
        self.coordinate = coordinate
        self.xy = [tuple(self.coordinate[k]) for k in ['left_top', 'right_top', 'right_bottom', 'left_bottom']]
        self._mask = None
        self.image.putalpha(self.mask)

```python
@property
def mask(self):
    if not self._mask:
        mask = Image.new('L', self.image.size, 0)
        draw = ImageDraw.Draw(mask, 'L')
        draw.polygon(self.xy, fill=255)
        self._mask = mask
    return self._mask

def run(self):
    image = self.rotation_angle()
    box = image.getbbox()
    return image.crop(box)

def rotation_angle(self):
    x1, y1 = self.xy[0]
    x2, y2 = self.xy[1]
    angle = self.angle([x1, y1, x2, y2], [0, 0, 10, 0]) * -1
    return self.image.rotate(angle, expand=True)

def angle(self, v1, v2):
    dx1 = v1[2] - v1[0]
    dy1 = v1[3] - v1[1]
    dx2 = v2[2] - v2[0]
    dy2 = v2[3] - v2[1]
    angle1 = math.atan2(dy1, dx1)
    angle1 = int(angle1 * 180 / math.pi)
    angle2 = math.atan2(dy2, dx2)
    angle2 = int(angle2 * 180 / math.pi)
    if angle1 * angle2 >= 0:
        included_angle = abs(angle1 - angle2)
    else:
        included_angle = abs(angle1) + abs(angle2)
        if included_angle > 180:
            included_angle = 360 - included_angle
    return included_angle
def image_cut_save(path, bbox, save_path):
    """
    :param path: image path
    :param left: distance from the pixel point at the top-left corner of the block to the left border of the image
    :param upper: distance from the pixel point at the top-left corner of the block to the top border of the image
    :param right: distance from the pixel point at the bottom-right corner of the block to the left border of the image
    :param lower: distance from the pixel point at the bottom-right corner of the block to the top border of the image
    """
    img_width  = 1920
    img_height = 1080
    img = Image.open(path)
    coordinate = {'left_top': bbox[0], 'right_top': bbox[1], 'right_bottom': bbox[2], 'left_bottom': bbox[3]}
    rotate = Rotate(img, coordinate)

    left, upper = bbox[0]
    right, lower = bbox[2]
    if lower-upper > right-left:
        rotate.run().convert('RGB').transpose(Image.ROTATE_90).save(save_path)
    else:
        rotate.run().convert('RGB').save(save_path)
    return True

# Read detection annotations to create recognition dataset
files = ["det_train_label.txt","det_eval_label.txt"]
filetypes =["train","eval"]
for index,filename in enumerate(files):
    f = open(filename)
    l = open('rec_'+filetypes[index]+'_label.txt','w')
    if index == 0:
        data_dir = "RecTrainData"
    else:
        data_dir = "RecEvalData"
    if not os.path.exists(data_dir):
        os.mkdir(data_dir)
    lines = f.readlines()
    for line in tqdm(lines):
        image_name = line.split("\t")[0].split("/")[-1]
        annos = json.loads(line.split("\t")[-1])
        img_path = os.path.join("/input0/images",image_name)
        for i,anno in enumerate(annos):
            data_path = os.path.join(data_dir,str(i)+"_"+image_name)
            if image_cut_save(img_path,anno["points"],data_path):
                l.writelines(str(i)+"_"+image_name+"\t"+anno["transcription"]+"\n")
    l.close()
    f.close()
100%|██████████| 2000/2000 [02:55<00:00, 11.39it/s]
100%|██████████| 1003/1003 [01:30<00:00, 11.05it/s]

Experiment

Due to the limited size of the dataset, to achieve better and faster model convergence, the PP-OCRv3 model from PaddleOCR is selected for detection and recognition. Based on PP-OCRv2, PP-OCRv3 improves the end-to-end Hmean metric by 5% in Chinese scenarios and by 11% in English-numeric scenarios. For detailed optimization details, please refer to the PP-OCRv3 technical report.

Detection Model

Detection Model Configuration

PaddleOCR provides many detection models, which can be found along with their configuration files in the path PaddleOCR-release-2.5/configs/det. For example, if we select the model ch_PP-OCRv3_det_student.yml, its configuration file path is: PaddleOCR-release-2.5/configs/det/ch_PP-OCRv3/ch_PP-OCRv3_det_student.yml. Necessary settings such as training parameters and dataset paths need to be configured before use. Some key configurations are displayed below:

#Key training parameters
use_gpu: true #Whether to use GPU
epoch_num: 1200 #Number of training epochs
save_model_dir: ./output/ch_PP-OCR_V3_det/ #Model save path
save_epoch_step: 200 #Save model every 200 epochs
eval_batch_step: [0, 100] #Perform validation every 100 training iterations
pretrained_model: ./PaddleOCR-release
2.5/pretrain_models/ch_PP-OCR_V3_det/best_accuracy.pdparams #Pretrained model path
#Training set path configuration
Train:
  dataset:
    name: SimpleDataSet
    data_dir: /input0/images #Image folder path
    label_file_list:
      - ./det_train_label.txt #Label path

Model Fine-tuning

Run the following command in the notebook to fine-tune the model, where -c passes the path to the configured model file

%run PaddleOCR-release-2.5/tools/train.py \
    -c PaddleOCR-release-2.5/configs/det/ch_PP-OCRv3/ch_PP-OCRv3_det_student.yml

Using default hyperparameters, after training the model ch_PP-OCRv3_det_student for 385 epochs on the training set, the model's hmean on the validation set reached: 96.96%, with no significant increase thereafter

[2022/10/11 06:36:09] ppocr INFO: best metric, hmean: 0.969551282051282, precision: 0.9577836411609498,
recall: 0.981611681990265, fps: 20.347745459258228, best_epoch: 385

Recognition Model

Recognition Model Configuration

PaddleOCR also provides many recognition models, which can be found along with their configuration files in the path PaddleOCR-release-2.5/configs/rec. For example, if we choose the model ch_PP-OCRv3_rec_distillation, its configuration file path is: PaddleOCR-release-2.5/configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml. Before use, necessary settings need to be configured, such as training parameters, dataset paths, etc. Some key configurations are shown below:

#Key training parameters
use_gpu: true #Whether to use GPU
epoch_num: 1200 #Number of training epochs
save_model_dir: ./output/rec_ppocr_v3_distillation #Model save path
save_epoch_step: 200 #Save model every 200 epochs
eval_batch_step: [0, 100] #Perform validation every 100 iterations during training
pretrained_model: ./PaddleOCR-release-2.5/pretrain_models/PPOCRv3/best_accuracy.pdparams #Pretrained model path
#Training set path configuration
Train:
  dataset:
    name: SimpleDataSet
    data_dir: ./RecTrainData/ #Image folder path
    label_file_list:
      - ./rec_train_label.txt #Label path

Model Fine-tuning

Run the following command in the notebook to fine-tune the model, where -c passes the path to the configured model file

%run PaddleOCR-release-2.5/tools/train.py \
    -c PaddleOCR-release-2.5/configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml

Using default hyperparameters, after training the model ch_PP-OCRv3_rec_distillation for 136 epochs on the training set, the model's accuracy on the validation set reached: 96.11%, with no significant improvement thereafter

[2022/10/11 20:04:28] ppocr INFO: best metric, acc: 0.9610600272522444, norm_edit_dis: 0.9927426548965615,
Teacher_acc: 0.9540291998159589, Teacher_norm_edit_dis: 0.9905629345025616, fps: 246.029195787707, best_epoch: 136

Model Inference

Detection Model Inference

Run the following command in the notebook to detect text in test images using the fine-tuned model, where: Global.infer_img is the image path or image folder path, Global.pretrained_model is the fine-tuned model, Global.save_res_path is the inference result save path

%run PaddleOCR-release-2.5/tools/infer_det.py \
    -c PaddleOCR-release-2.5/configs/det/ch_PP-OCRv3/ch_PP-OCRv3_det_student.yml \
    -o Global.infer_img="/input0/images" Global.pretrained_model="./output/ch_PP-OCR_V3_det/best_accuracy" Global.save_res_path="./output/det_infer_res/predicts.txt"

Recognition Model Inference

Run the following command in the notebook to detect text in test images using the fine-tuned model, where: Global.infer_img is the image path or image folder path, Global.pretrained_model is the fine-tuned model, Global.save_res_path is the inference result save path

%run PaddleOCR-release-2.5/tools/infer_rec.py \
    -c PaddleOCR-release-2.5/configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml \
    -o Global.infer_img="./RecEvalData/" Global.pretrained_model="./output/rec_ppocr_v3_distillation/best_accuracy" Global.save_res_path="./output/rec_infer_res/predicts.txt"

Serial Inference of Detection and Recognition Models

Model Conversion

Before serial inference, the trained model needs to be converted to an inference model. Execute the following detection command respectively. Where -c passes the configuration file path of the model to be converted, -o Global.pretrained_model is the model file to be converted, and Global.save_inference_dir is the storage path of the converted inference model

# Detection model conversion
%run PaddleOCR-release-2.5/tools/export_model.py \
-c PaddleOCR-release-2.5/configs/det/ch_PP-OCRv3/ch_PP-OCRv3_det_student.yml \
-o Global.pretrained_model="./output/ch_PP-OCR_V3_det/best_accuracy" Global.save_inference_dir="./output/det_inference/"
# Recognition model conversion
%run PaddleOCR-release-2.5/tools/export_model.py \
-c PaddleOCR-release-2.5/configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml \
-o Global.pretrained_model="./output/rec_ppocr_v3_distillation/best_accuracy" Global.save_inference_dir="./output/rec_inference/"

[2022/10/11 07:10:33] ppocr INFO: load pretrain successful from ./output/rec_ppocr_v3_distillation/best_accuracy [2022/10/11 07:10:35] ppocr INFO: inference model is saved to ./output/rec_inference/Teacher/inference [2022/10/11 07:10:36] ppocr INFO: inference model is saved to ./output/rec_inference/Student/inference

Model Pipeline Inference

After the conversion is complete, PaddleOCR provides a concatenation tool for detection and recognition models, which can combine any trained detection model with any recognition model into a two-stage text recognition system. The input image goes through four main stages: text detection, detection box correction, text recognition, and score filtering, to output text positions and recognition results. The execution code is as follows, where image_dir is the path to a single image or image collection, det_model_dir is the path to the detection inference model, and rec_model_dir is the path to the recognition inference model. The visualization results are saved by default to the ./inference_results folder.

%run PaddleOCR-release-2.5/tools/infer/predict_system.py \
--image_dir="OCRTest" \
--det_model_dir="./output/det_inference/" \
--rec_model_dir="./output/rec_inference/Student/"

Here is the translation:

[2022/10/11 07:10:46] ppocr INFO: In PP-OCRv3, rec_image_shape parameter defaults to '3, 48, 320', if you are using recognition model with PP-OCRv2 or an older version, please set --rec_image_shape='3,32,320 [2022/10/11 07:10:48] ppocr DEBUG: dt_boxes num : 2, elapse : 1.0023341178894043 [2022/10/11 07:10:48] ppocr DEBUG: rec_res num : 2, elapse : 0.02405834197998047 [2022/10/11 07:10:48] ppocr DEBUG: 0 Predict time of OCRTest/1-122700001-OCR-LF-C01.jpg: 1.041s [2022/10/11 07:10:48] ppocr DEBUG: TTEMU3108252, 0.864 [2022/10/11 07:10:48] ppocr DEBUG: 22G1, 0.843 [2022/10/11 07:10:48] ppocr DEBUG: The visualized image saved in ./inference_results/1-122700001-OCR-LF-C01.jpg [2022/10/11 07:10:48] ppocr DEBUG: dt_boxes num : 1, elapse : 0.047757863998413086 [2022/10/11 07:10:48] ppocr DEBUG: rec_res num : 1, elapse : 0.016452789306640625 [2022/10/11 07:10:48] ppocr DEBUG: 1 Predict time of OCRTest/1-122720001-OCR-AH-A01.jpg: 0.073s [2022/10/11 07:10:48] ppocr DEBUG: The visualized image saved in ./inference_results/1-122720001-OCR-AH-A01.jpg [2022/10/11 07:10:48] ppocr DEBUG: dt_boxes num : 2, elapse : 0.05301952362060547 [2022/10/11 07:10:48] ppocr DEBUG: rec_res num : 2, elapse : 0.020509719848632812 [2022/10/11 07:10:48] ppocr DEBUG: 2 Predict time of OCRTest/1-122720001-OCR-AS-B01.jpg: 0.081s [2022/10/11 07:10:48] ppocr DEBUG: EITU1786393, 0.990 [2022/10/11 07:10:48] ppocr DEBUG: 45G1, 0.963 [2022/10/11 07:10:49] ppocr DEBUG: The visualized image saved in ./inference_results/1-122720001-OCR-AS-B01.jpg [2022/10/11 07:10:49] ppocr DEBUG: dt_boxes num : 2, elapse : 0.049460411071777344 [2022/10/11 07:10:49] ppocr DEBUG: rec_res num : 2, elapse : 0.020053625106811523 [2022/10/11 07:10:49] ppocr DEBUG: 3 Predict time of OCRTest/1-122720001-OCR-LB-C02.jpg: 0.077s [2022/10/11 07:10:49] ppocr DEBUG: LTU1, 0.814 [2022/10/11 07:10:49] ppocr DEBUG: 45G1, 0.997 [2022/10/11 07:10:49] ppocr DEBUG: The visualized image saved in ./inference_results/1-122720001-OCR-LB-C02.jpg [2022/10/11 07:10:49] ppocr DEBUG: dt_boxes num : 2, elapse : 0.051781654357910156 [2022/10/11 07:10:49] ppocr DEBUG: rec_res num : 2, elapse : 0.020511150360107422 [2022/10/11 07:10:49] ppocr DEBUG: 4 Predict time of OCRTest/1-122720001-OCR-RF-D01.jpg: 0.081s [2022/10/11 07:10:49] ppocr DEBUG: EITU1786393, 0.966 [2022/10/11 07:10:49] ppocr DEBUG: 45G1, 0.939 [2022/10/11 07:10:49] ppocr DEBUG: The visualized image saved in ./inference_results/1-122720001-OCR-RF-D01.jpg [2022/10/11 07:10:49] ppocr DEBUG: dt_boxes num : 0, elapse : 0.04465031623840332 [2022/10/11 07:10:49] ppocr DEBUG: rec_res num : 0, elapse : 1.430511474609375e-06 [2022/10/11 07:10:49] ppocr DEBUG: 5 Predict time of OCRTest/1-122728001-OCR-AH-A01.jpg: 0.049s [2022/10/11 07:10:49] ppocr DEBUG: The visualized image saved in ./inference_results/1-122728001-OCR-AH-A01.jpg [2022/10/11 07:10:49] ppocr INFO: The predict total time is 2.9623537063598633

OCR Result

This image shows a typical OCR (Optical Character Recognition) result. OCR technology converts text in images into editable digital text through computer vision and machine learning algorithms.

Main Features

  • High Accuracy: Modern OCR systems can achieve over 99% accuracy in recognizing printed text
  • Multi-language Support: Capable of recognizing text in multiple languages
  • Layout Preservation: Can maintain the original document's formatting and layout
  • Batch Processing: Supports processing multiple documents simultaneously

Common Application Scenarios

  1. Document digitization
  2. Business card recognition
  3. Invoice processing
  4. ID card information extraction
  5. Book scanning

Technical Challenges

Despite significant advances in OCR technology, it still faces some challenges:

  • Handwritten text recognition
  • Low-quality image processing
  • Complex background interference
  • Recognition of special fonts