目录
什么是OCR
Tess4j案例
图片文字识别-管理敏感词
什么是OCR
OCR (Optical Character Recognition,光学字符识别)是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文字的过程
| 方案 | 说明 | 
|---|---|
| 百度OCR | 收费 | 
| Tesseract-OCR | Google维护的开源OCR引擎,支持Java,Python等语言调用 | 
| Tess4J | 封装了Tesseract-OCR ,支持Java调用 | 
Tesseract-OCR特点:
- Tesseract支持UTF-8编码格式,并且可以“开箱即用”地识别100多种语言。
- Tesseract支持多种输出格式:纯文本,hOCR (HTML),PDF等
- 官方建议,为了获得更好的OCR结果,最好提供给高质量的图像。
- Tesseract进行识别其他语言的训练。具体的训练方式,请参考官方提供的文档: https://github.com/tesseract-ocr/tessdoc
Tess4j案例
创建项目导入tess4j对应的依赖
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.1.1</version>
</dependency>导入中文字体库, 把资料中的tessdata文件夹拷贝到自己的工作空间下

编写测试类进行测试
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import java.io.File;
public class Application {
    public static void main(String[] args) throws TesseractException {
        //创建Tesseract对象
        ITesseract tesseract = new Tesseract();
        //设置字体库路径
        tesseract.setDatapath("D:\\");
        //设置识别的语言-简体中文
        tesseract.setLanguage("chi_sim");
        //执行ocr识别(识别图片)
        String result = tesseract.doOCR(new File("D:\\123.png"));
        //替换回车和tal键  使结果为一行
        System.out.println("识别的结果为:"+result);
    }
}图片文字识别-管理敏感词
一、首先创建一个父工程tess4j
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.3.9.RELEASE</version>
    </parent>
    <groupId>org.example</groupId>
    <artifactId>tess4j</artifactId>
    <packaging>pom</packaging>
    <version>1.0-SNAPSHOT</version>
    <modules>
        <module>tess4j-test</module>
        <module>utils</module>
    </modules>
    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>
</project>其次创建两个子工程分别为:tess4j-test、utils
二、创建utils模块
1.配置utils模块的pom文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>tess4j</artifactId>
        <groupId>org.example</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>
    <artifactId>utils</artifactId>
    <dependencies>
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>4.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-test</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <scope>provided</scope>
        </dependency>
    </dependencies>
</project>2.创建图片文字识别工具类Tess4jClient
package com.test.utils;
import lombok.Getter;
import lombok.Setter;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.stereotype.Component;
import java.io.File;
@Getter
@Setter
@Component
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jClient {
    private String dataPath;//字体库路径
    private String language;//字体库类型:中文  英文  日文等
    public String doOCR(File image) throws TesseractException {
        //创建Tesseract对象
        ITesseract tesseract = new Tesseract();
        //设置字体库路径
        tesseract.setDatapath(dataPath);
        //中文识别
        tesseract.setLanguage(language);
        //执行ocr识别
        String result = tesseract.doOCR(image);
        //替换回车和tal键  使结果为一行
        result = result.replaceAll("\\r|\\n", "-").replaceAll(" ", "");
        return result;
    }
}3.在resources下创建META-INF文件夹,在该文件夹下创建spring.factories文件并在配置中添加该类,完整如下:
org.springframework.boot.autoconfigure.EnableAutoConfiguration=\
  com.test.utils.Tess4jClient4.创建自管理敏感词审核工具类SensitiveWordUtil
package com.test.utils;
import java.util.*;
public class SensitiveWordUtil {
    public static Map<String, Object> dictionaryMap = new HashMap<>();
    /**
     * 生成关键词字典库
     * @param words
     * @return
     */
    public static void initMap(Collection<String> words) {
        if (words == null) {
            System.out.println("敏感词列表不能为空");
            return ;
        }
        // map初始长度words.size(),整个字典库的入口字数(小于words.size(),因为不同的词可能会有相同的首字)
        Map<String, Object> map = new HashMap<>(words.size());
        // 遍历过程中当前层次的数据
        Map<String, Object> curMap = null;
        Iterator<String> iterator = words.iterator();
        while (iterator.hasNext()) {
            String word = iterator.next();
            curMap = map;
            int len = word.length();
            for (int i =0; i < len; i++) {
                // 遍历每个词的字
                String key = String.valueOf(word.charAt(i));
                // 当前字在当前层是否存在, 不存在则新建, 当前层数据指向下一个节点, 继续判断是否存在数据
                Map<String, Object> wordMap = (Map<String, Object>) curMap.get(key);
                if (wordMap == null) {
                    // 每个节点存在两个数据: 下一个节点和isEnd(是否结束标志)
                    wordMap = new HashMap<>(2);
                    wordMap.put("isEnd", "0");
                    curMap.put(key, wordMap);
                }
                curMap = wordMap;
                // 如果当前字是词的最后一个字,则将isEnd标志置1
                if (i == len -1) {
                    curMap.put("isEnd", "1");
                }
            }
        }
        dictionaryMap = map;
    }
    /**
     * 搜索文本中某个文字是否匹配关键词
     * @param text
     * @param beginIndex
     * @return
     */
    private static int checkWord(String text, int beginIndex) {
        if (dictionaryMap == null) {
            throw new RuntimeException("字典不能为空");
        }
        boolean isEnd = false;
        int wordLength = 0;
        Map<String, Object> curMap = dictionaryMap;
        int len = text.length();
        // 从文本的第beginIndex开始匹配
        for (int i = beginIndex; i < len; i++) {
            String key = String.valueOf(text.charAt(i));
            // 获取当前key的下一个节点
            curMap = (Map<String, Object>) curMap.get(key);
            if (curMap == null) {
                break;
            } else {
                wordLength ++;
                if ("1".equals(curMap.get("isEnd"))) {
                    isEnd = true;
                }
            }
        }
        if (!isEnd) {
            wordLength = 0;
        }
        return wordLength;
    }
    /**
     * 获取匹配的关键词和命中次数
     * @param text
     * @return
     */
    public static Map<String, Integer> matchWords(String text) {
        Map<String, Integer> wordMap = new HashMap<>();
        int len = text.length();
        for (int i = 0; i < len; i++) {
            int wordLength = checkWord(text, i);
            if (wordLength > 0) {
                String word = text.substring(i, i + wordLength);
                // 添加关键词匹配次数
                if (wordMap.containsKey(word)) {
                    wordMap.put(word, wordMap.get(word) + 1);
                } else {
                    wordMap.put(word, 1);
                }
                i += wordLength - 1;
            }
        }
        return wordMap;
    }
}
三、创建tess4j-test模块
1.配置tess4j-test模块的pom文件(导入utils模块)
2.在tess4j-test中的配置中添加两个属性
tess4j:
  data-path: D:\workspace\tessdata //字体库路径
  language: chi_sim // 字体库类型(简体中文)3.测试
package com.example;
import com.test.utils.SensitiveWordUtil;
import com.test.utils.Tess4jClient;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;
import java.util.*;
@SpringBootTest
class Tess4jTestApplicationTests {
    @Autowired
    private Tess4jClient tess4jClient;
    /**
     * 识别图片文本内容,并审核是否包含敏感词
     */
    @Test
    void contextLoads() throws TesseractException {
        //识别图片中的文字
        String result = tess4jClient.doOCR(new File("D:\\123.png"));
        //审核是否包含自管理的敏感词
        Map sensitiveScan = handleSensitiveScan(result);
        boolean flag = (boolean) sensitiveScan.get("flag");
        if (!flag) {
            System.out.println("图中包含敏感词:" + sensitiveScan.get("map"));
        }
    }
    /**
     * 自管理的敏感词审核
     * @param content
     * @return
     */
    private Map<String,Object> handleSensitiveScan(String content) {
        boolean flag = true;
        //模拟查询数据库获取所有的敏感词
        List<String> sensitiveList = new ArrayList<>();
        sensitiveList.add("私人侦探");
        sensitiveList.add("私人调查");
        sensitiveList.add("企业打侵");
        //初始化敏感词库
        SensitiveWordUtil.initMap(sensitiveList);
        //查看文章中是否包含敏感词
        Map<String, Integer> map = SensitiveWordUtil.matchWords(content);
        if(map.size() >0){
            flag = false;
        }
        Map<String,Object> resultMap = new HashMap();
        resultMap.put("flag",flag);
        resultMap.put("map",map);
        return resultMap;
    }
}
图片如下:

结果如下:




















