现阶段整个社会短视频,中视频为王,文字传播虽然被弱化,但在业务中还是有一定的传播价值,今天就来讲一讲如何使用js动态提取视频中的字幕。
先来看看效果:
屏幕录制2024-02-29 15.40.18
  
一,tesseract.js介绍
tesseract.js使用webassembly端口把Tesseract OCR 引擎包装了起来,可以让js开发者在浏览器端和node端都能方便的进行使用。
浏览器端使用webpack,esm或者script标签进行引入。安装成功之后既可以用来提取图片中的文字也可以动态提取视频字幕。
提取图片文字的小栗子:
import { createWorker } from 'tesseract.js';
(async () => {
  const worker = await createWorker('eng');
  const ret = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
  console.log(ret.data.text);
  await worker.terminate();
})();二,搭建项目
下面来一步一步来看看如何提取视频中的文字
项目结构如下:

package.json文件如下

三、编写html
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Tesseract.js Video Streaming Recognition</title>
  <link rel="stylesheet" href="./css/style.css">
  <script src='https://unpkg.com/tesseract.js@v2.0.0-beta.1/dist/tesseract.min.js'></script>
</head>
<body>
  <div id="root">
    <video id="poem-video" width="640" height="360" crossorigin="anonymous">
      <source src="./do-not-go-gentle.mp4" type="video/mp4">
    </video>
    <div id="sep"></div>
    <div id="messages">
    </div>
  </div>
  <script>
    const { createWorker, createScheduler } = Tesseract;
    const scheduler = createScheduler();
    const video = document.getElementById('poem-video');
    const messages = document.getElementById('messages');
    let timerId = null;
    const addMessage = (m, bold) => {
      let msg = `<p>${m}</p>`;
      if (bold) {
        msg = `<p class="bold">${m}</p>`;
      }
      messages.innerHTML += msg;
      messages.scrollTop = messages.scrollHeight;
    }
    const doOCR = async () => {
      const c = document.createElement('canvas');
      c.width = 640;
      c.height = 360;
      c.getContext('2d').drawImage(video, 0, 0, 640, 360);
      const start = new Date();
      const { data: { text } } = await scheduler.addJob('recognize', c);
      const end = new Date()
      addMessage(`[${start.getMinutes()}:${start.getSeconds()} - ${end.getMinutes()}:${end.getSeconds()}], ${(end - start) / 1000} s`);
      text.split('\n').forEach((line) => {
        addMessage(line);
      });
    };
    (async () => {
      addMessage('Initializing Tesseract.js');
      for (let i = 0; i < 4; i++) {
        const worker = createWorker();
        await worker.load();
        await worker.loadLanguage('eng');
        await worker.initialize('eng');
        scheduler.addWorker(worker);
      }
      addMessage('Initialized Tesseract.js');
      video.addEventListener('play', () => {
        timerId = setInterval(doOCR, 1000);
      });
      video.addEventListener('pause', () => {
        clearInterval(timerId);
      });
      addMessage('Now you can play the video. :)');
      video.controls = true;
    })();
  </script>
</body>
</html>注意这里初始化是一个异步操作,初始化画布完成之后,video的play事件触发之后,每间隔一秒会重绘画布输出提取到的文字信息。
-- END --



















