Skip to content

Latest commit

 

History

History
105 lines (83 loc) · 2.96 KB

README.md

File metadata and controls

105 lines (83 loc) · 2.96 KB

ocr

OCR-图像识别 Go语言SDK+命令行工具

  • Languages, 训练数据加速下载及使用步骤
  • Pool, ocr.Scan, ocr.ScanClipboard GoSDK及命令行用法
  • poolSize, ocr.ScanBytes, 命令行系统进程优雅退出及特殊处理
  • Whitelist

Tesseract安装

Tesseract Installation: https://github.com/otiai10/gosseract#installation
https://github.com/tesseract-ocr/tessdoc/blob/main/Installation.md

Troubleshootings: See we-mid/bec-grpc#troubleshootings

下载简体中文等语言的最新训练数据

# 通过gitmirror加速下载gitraw
gitrawdown() {
    if [ $# -lt 2 ]; then
        echo "缺少参数"
        return 1
    fi
    local url=$(echo $1 | \
        sed 's|github.com/\(.*\)/\(.*\)/blob/\(.*\)|raw.githubusercontent.com/\1/\2/refs/heads/\3|g' | \
        sed 's|raw.githubusercontent.com|raw.gitmirror.com|')
    echo "Downloading... $url"
    echo "=> $2"
    curl -fL $url > $2
}
gitrawdown https://raw... ~/Downloads/tessdata_fast/chi_sim.traineddata
mkdir -p ~/Downloads/tessdata_fast
cd ~/Downloads/tessdata_fast
gitrawdown https://raw.githubusercontent.com/tesseract-ocr/tessdata_fast/refs/heads/main/chi_sim.traineddata \
    chi_sim.traineddata
gitrawdown https://raw.githubusercontent.com/tesseract-ocr/tessdata_fast/refs/heads/main/eng.traineddata \
    eng.traineddata

用法:作为命令行工具

go install gitee.com/we-mid/go/ocr/cmd/ocrscan@latest

# 如果指定为自己下载的训练数据
export TESSDATA_PREFIX=~/Downloads/tessdata_fast

# 从文件路径读取图片
ocrscan -l chi_sim,eng '~/Desktop/截屏2024-10-16 10.23.27.png'

# 从剪切板读取图片
ocrscan -l chi_sim,eng -c

>> 访问来源Top10                            自然月    4
用户类型   全部用户
指标筛选   访问人数    打开次数
任务栏 ee 3,199
手机端搜索 Se 953
Android系统 mm = 320
发现 小程序 = 190
单聊分享 p 385
PC端 , 41
小程序功能 ) 31
收藏 | 25
群聊分享 8
公众号菜单 4

用法:作为Go语言SDK

import "gitee.com/we-mid/go/ocr"

func main() {
	ocrTeardown := ocr.Setup(3) // poolSize
	defer ocrTeardown()

	// languages can be:
	// https://github.com/tesseract-ocr/tessdata_fast
	// - []  - ["eng"]  - ["chi_sim", "eng"]
	text, err := ocr.ScanClipboard(languages)
	text, err := ocr.ScanBytes(languages, bs)
	text, err := ocr.Scan(languages, filePath)
}

相关资料

Available Languages: https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-different-versions.md

Trained Data: Hans => chi_sim https://github.com/tesseract-ocr/tessdata_best/blob/main/chi_sim.traineddata https://github.com/tesseract-ocr/tessdata_fast/blob/main/chi_sim.traineddata

GitHub Raw Mirror: https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/refs/heads/main/chi_sim.traineddata => https://raw.gitmirror.com/tesseract-ocr/tessdata_best/refs/heads/main/chi_sim.traineddata