验证码识别-ddddocr

约 391 字大约 1 分钟

没事搞什么爬虫。还好带带弟弟OCR,救了命了

ONNX

MacOS M1 需安装 onnxruntime

brew install onnxruntime

DDDDOR

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple ddddocr

使用

import ddddocr

ocr = ddddocr.DdddOcr(show_ad=False)

with open("test.png", 'rb') as f:
    image = f.read()

res = ocr.classification(image)
print(res)

报错

AttributeError: module 'PIL.Image' has no attribute 'ANTIALIAS'

新版本的 ANTIALIAS 已经被移除, 退版本即可 pip install Pillow==9.5.0

训练

PyTorch安装

PyTorch 官方open in new window 查看自己支持的版本

dddd_trainer

安装dddd_trainer, 并初始化。 参考官方Githubopen in new window, 已经很详细了

命令

python app.py create ztest
python app.py cache ztest /opt/dddd-ocr/images_set
python app.py train ztest

当你出现如下错误的时候,可能是样本数据太少导致的。

Read Cache File End! Caches Num is 4.
2023-08-04 15:23:28.176 | INFO     | utils.load_cache:__init__:25 - 
Reading Cache File... ----> /opt/dddd-ocr/dddd_trainer/projects/ztest/cache/cache.val.tmp
2023-08-04 15:23:28.176 | INFO     | utils.load_cache:__init__:30 - 
Read Cache File End! Caches Num is 0.
Traceback (most recent call last):
  File "app.py", line 33, in <module>
    fire.Fire(App)
  File "/opt/dddd-ocr/venv/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/dddd-ocr/venv/lib/python3.7/site-packages/fire/core.py", line 480, in _Fire
    target=component.__name__)
  File "/opt/dddd-ocr/venv/lib/python3.7/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "app.py", line 27, in train
    trainer = train.Train(project_name)
  File "/opt/dddd-ocr/dddd_trainer/utils/train.py", line 83, in __init__
    loaders = load_cache.GetLoader(project_name)
  File "/opt/dddd-ocr/dddd_trainer/utils/load_cache.py", line 147, in __init__
    num_workers=0, collate_fn=self.collate_to_sparse),
  File "/opt/dddd-ocr/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 344, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/opt/dddd-ocr/venv/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 108, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

模型训练结束, 使用离线模型运行 dddd ocr 即可

官方 github 示例

ocr = ddddocr.DdddOcr(det=False, ocr=False, import_onnx_path="myproject_0.984375_139_13000_2022-02-26-15-34-13.onnx", charsets_path="charsets.json")