Как установить tesseract ocr для windows

Adapting the Tesseract Open Source OCR Engine for Multilingual OCR

Publication Year: 2009

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

ACM, 2009. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the International Workshop on Multilingual OCR 2009, Barcelona, Spain July 25, 2009.

Nanonets and Humans in the Loop

‌‌The ‘Moderate’ screen aids the correction and entry processes and reduce the manual reviewer’s workload by almost 90% and reduce the costs by 50% for the organisation.

Features include

  1. Track predictions which are correct
  2. Track which ones are wrong
  3. Make corrections to the inaccurate ones
  4. Delete the ones that are wrong
  5. Fill in the missing predictions
  6. Filter images with date ranges
  7. Get counts of moderated images against the ones not moderated

All the fields are structured into an easy to use GUI which allows the user to take advantage of the OCR technology and assist in making it better as they go, without having to type any code or understand how the technology works.

Have an OCR problem in mind? Want to reduce your organization’s data entry costs? Head over to Nanonets and build OCR models to extract text from images or extract data from PDFs!

Tesseract Installation

The Tesseract OCR engine package is generally called “tesseract-ocr” Thus you can install the latest version of Tesseract 4.1.1 through the terminal and its developer tools on Ubuntu Focal 20.04.

Install Tesseract OCR Engine

And also we need to install the command line program “libtesseract-dev” to work with the Tesseract OCR engine.

Install Command Line Program fro Tesseract

Then we need to install tesseract built binaries(supported languages and scripts) that are available directly from the Linux distributions through snapd by running the following command. If you do not install snapd, you have to run the below command before installing tesseract built binaries.

Install Snapd

Then run

Install Tesseract Built Binaries

Further Reading

  • Best trained model for LSTM Tesseract 4.0
  • Dropbox approach to OCR 4.2017
  • Overview of Tesseract OCR Engine Legacy
  • Comparison of OCR Accuracy on Early Printed Books using the
    Open Source Engines Calamari and OCRopus
  • Efficient, Lexicon-Free OCR using Deep Learning
  • Suitability of OCR Engines in Information Extraction Systems — A Comparative Evaluation
  • DeepText Benchmark
  • OCR Project List
  • Tesseract Github Latest Release
  • CVPR 2019 — Character Region Awareness for Text Detection (CRAFT)
  • Credit Card OCR with OpenCV and Python
  • Image Preprocessing
  • OCR using Tesseract on Raspberry Pi
  • Tesseract OCR for Non-English Languages
  • How to Do OCR from the Linux Command Line Using Tesseract
  • An Overview of the Tesseract OCR Engine

Summary

Today we learned how to install and configure Tesseract on our machines, the first part in a two part series on using Tesseract for OCR. We then used the   binary to apply OCR to input images.

However, we found out that unless our images are cleanly segmented Tesseract will give poor results. In the case of “noisy” input images, we’ll likely obtain better accuracy by training a custom machine learning model to recognize characters in our specific use case.

Tesseract is best suited for situations with high resolution inputs where the foreground text is cleanly segmented from the background.

Next week we’ll learn how to access Tesseract via Python code, so stay tuned.

Expose the Required APIs — Writing a Wrapper

using System;using System.Runtime.InteropServices;public class TesseractWrapper{#if UNITY_EDITOR    private const string TesseractDllName = "tesseract";    private const string LeptonicaDllName = "tesseract";#elif UNITY_ANDROID    private const string TesseractDllName = "libtesseract.so";    private const string LeptonicaDllName = "liblept.so";#else    private const string TesseractDllName = "tesseract";    private const string LeptonicaDllName = "tesseract";#endif        private static extern IntPtr TessVersion();}

So, we start by adding a class called TesseractWrapper which will act as an API layer between the application and the Tesseract DLL(s). As you might notice above the DLLs we got, have different names for different platforms. In order to get over this issue we are using Compiler Switches to fix the Tesseract and Leptonica(One of the major dependencies) plugin file names.The way we expose Functions is by using DllImport(<fileName>) and extern key word… The function signature is something you would have to look up from the Documentations. Here we are exposing the function TessVersion() which has…

Function signature according to the Tesseract Documentation, so the return type is a pointer (IntPtr) with no params.

Linux

Tesseract is available directly from many Linux distributions. The package is generally called ‘tesseract’ or ‘tesseract-ocr’ — search your distribution’s repositories to find it.
Thus you can install Tesseract 4.x and its developer tools on Ubuntu 18.x bionic by simply running:

Note for Ubuntu users: In case is unable to find the package try adding entry to the file as shown below.

Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions. The language packages are called ‘tesseract-ocr-langcode’ and ‘tesseract-ocr-script-scriptcode’, where langcode is three letter language code and scriptcode is four letter script code.

Examples: tesseract-ocr-eng (English), tesseract-ocr-ara (Arabic), tesseract-ocr-chi-sim (Simplified Chinese), tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari script), etc.

The traineddata is currently not shipped with the snap package and must be placed manually to .

Tesseract Development Version with LSTM engine and related traineddata

5.00 Alpha

AppImage

Instruction

  1. Download AppImage from releases page
  2. Open your terminal application, if not already open
  3. Browse to the location of the AppImage
  4. Make the AppImage executable:
  5. Run it:

AppImage compatibility

  • Debian: ≥ 10
  • Fedora: ≥ 29
  • Ubuntu: ≥ 18.04
  • CentOS ≥ 8
  • openSUSE Tumbleweed

Included traineddata files

  • deu — German
  • eng — English
  • fin — Finnish
  • fra — French
  • osd — Script and orientation
  • por — Portuguese
  • rus — Russian
  • spa — Spanish

Tesseract 4 packages with LSTM engine and related traineddata

Ubuntu Focal 20.04

4.0.x

Ubuntu Bionic 18.04

4.0.x

RHEL/CentOS/Scientific Linux, Fedora, openSUSE packages

rpm package with tesseract-ocr

For example to install Tesseract with German language traineddata:

For CentOS 8 run the following as root:

For RHEL 7 run the following as root:

For CentOS 7 run the following as root:

For Scientific Linux 7 run the following as root:

For Fedora 32 run the following as root:

For Fedora 31 run the following as root:

For openSUSE Tumbleweed run the following as root:

For openSUSE Leap 15.0 run the following as root:

FOR EXPERTS ONLY.

If you are experimenting with OCR Engine modes, you will need to manually install language training data beyond what is available in your Linux distribution.

Various types of training data can be found on GitHub. Unpack and copy the .traineddata file into a ‘tessdata’ directory. The exact directory will depend both on the type of training data, and your Linux distribution. Possibilities are or or .

If Tesseract is not available for your distribution, or you want to use a newer version than they offer, you can compile your own.

Usage within Python

Within venv

https://github.com/sirfz/tesserocr

  1. Install Pillow, a module for image processing in Python:

    pip install Pillow

  • Code for Python:

    https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d
    offers this snippet:

    from PIL import Image  # PIL = old version of Pillow utility
    column = Image.open('code.jpg')
    gray = column.convert('L')    # convert to gray scale vs. RGB or CMYK.
    blackwhite = gray.point(lambda x: 0 if x < 200 else 255, '1')
    blackwhite.save("code_bw.jpg") # TODO: change to use program invocation parameter
     
  • Code for Shell script:

    from PIL import Image
    import sys
    column = Image.open(sys.argv)
    gray = column.convert('L')
    blackwhite = gray.point(lambda x: 0 if x < 200 else 255, '1')
    blackwhite.save("code_bw.jpg")
     

Conclusion

Just as deep learning has impacted nearly every facet of computer vision, the same is true for character recognition and handwriting recognition. Deep learning based models have managed to obtain unprecedented text recognition accuracy, far beyond traditional information extraction and machine learning image processing approaches.

Tesseract performs well when document images follow the next guidelines:

  • Clean segmentation of the foreground text from background
  • Horizontally aligned and scaled appropriately
  • High-quality image without blurriness and noise

The latest release of Tesseract 4.0 supports deep learning based OCR that is significantly more accurate. The OCR engine itself is built on a Long Short-Term Memory (LSTM) network, a kind of Recurrent Neural Network (RNN).

Tesseract is perfect for scanning clean documents and comes with pretty high accuracy and font variability since its training was comprehensive. I would say that Tesseract is a go-to tool if your task is scanning of books, documents and printed text on a clean white background.

Open Source OCR Tools

There are a lot of optical character recognition software available. I did not find any quality comparison between them, but I will write about some of them that seem to be the most developer-friendly.

Tesseract — an open-source OCR engine that has gained popularity among OCR developers. Even though it can be painful to implement and modify sometimes, there weren’t too many free and powerful OCR alternatives on the market for the longest time. Tesseract began as a Ph.D. research project in HP Labs, Bristol. It gained popularity and was developed by HP between 1984 and 1994. In 2005 HP released Tesseract as an open-source software. Since 2006 it is developed by Google.

google trends comparison for different open source OCR tools

OCRopus — OCRopus is an open-source OCR system allowing easy evaluation and reuse of the OCR components by both researchers and companies. A collection of document analysis programs, not a turn-key OCR system. To apply it to your documents, you may need to do some image preprocessing, and possibly also train new models. In addition to the recognition scripts themselves, there are several scripts for ground truth editing and correction, measuring error rates, determining confusion matrices that are easy to use and edit.

Ocular — Ocular works best on documents printed using a hand press, including those written in multiple languages. It operates using the command line. It is a state-of-the-art historical OCR system. Its primary features are:

  • Unsupervised learning of unknown fonts: requires only document images and a corpus of text.
  • Ability to handle noisy documents: inconsistent inking, spacing, vertical alignment
  • Support for multilingual documents, including those that have considerable word-level code-switching.
  • Unsupervised learning of orthographic variation patterns including archaic spellings and printer shorthand.
  • Simultaneous, joint transcription into both diplomatic (literal) and normalized forms.

SwiftOCR — I will also mention the OCR engine written in Swift since there is huge development being made into advancing the use of the Swift as the development programming language used for deep learning. Check out blog to find out more why. SwiftOCR is a fast and simple OCR library that uses neural networks for image recognition. SwiftOCR claims that their engine outperforms well known Tessaract library.

In this blog post, we will put focus on Tesseract OCR and find out more about how it works and how it is used.

Второй этап

К сожалению, у меня не получилось преобразовать документ TIFF в Searchable PDF через tesseract.

Была использована следующая команда:

tesseract YOUR_FILE.tiff searchable -l rus PDF

Выходила следующая ошибка:

Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error in pixReadFromTiffStream: failed to read tiffdata

Для нашей задачи постраничное деление на отдельные файлы (изображения) было даже предпочтительней (об этом ниже).

Конвертируем каждую страницу (файл png) в Searchable PDF:

tesseract YOUR_FILE-.png searchable- -l rus+kaz+eng pdf
tesseract YOUR_FILE-1.png searchable-1 -l rus+kaz+eng pdf
...
tesseract YOUR_FILE-N.png searchable-2 -l rus+kaz+eng pdf

На выходе получаем файлы:

searchable-0.pdf
searchable-1.pdf
...
searchable-N.pdf

Очень крутая фишка, что можно разпозначать несколько языков, перечислив их через символ ‘+’: rus+kaz+eng.

Команда распознавания и извлечения текста

tesseract YOUR_FILE-.png -l rus+kaz+eng YOUR_FILE-
tesseract YOUR_FILE-1.png -l rus+kaz+eng YOUR_FILE-1
...
tesseract YOUR_FILE-N.png -l rus+kaz+eng YOUR_FILE-N

В результате будут созданы текстовые файлы:

YOUR_FILE-0.txt
YOUR_FILE-1.txt
...
YOUR_FILE-N.txt

Распознавая отдельно каждую страницу, мы можем организовать постраничный поиск и при необходимости показывать пользователю только нужные страницы.

Склеив тексты страниц и положив их в поисковый движок, получим подокументый поиск.

Если нужен целый Searchable PDF, то можно его склеивать из отдельных страниц.

p.s. Деление на отдельные страницы затратно получается, но, думаю, зато это более гибко.

Установка пакета Tesseract для Python

Чтобы установить pytesseract воспользуемся менеджером пакетов Python pip. Также рекомендуется использовать виртуальную среду чтобы устанавливать свой набор пакетов для разных проектов. В данном случае virtualenv называется cv.

Затем установим Pillow (удобный клон PIL для Python) от которого зависит pytesseract.

Примечание: pytesseract не обеспечивает настоящей привязки к Python. Скорее, он является простой обёрткой для двоичного файла tesseract. Если познакомиться с проектом по подробнее, то станет ясно, что библиотека сохраняет изображение во временный файл на диске, а затем вызывает двоичный файл tesseract и полученный результат записывает в файл.

Рассмотрим код, который отделяет текст переднего плана от фона, а затем применим установленный pytesseract.

Установка библиотеки

Первое, что необходимо сделать, то это выполнить установку Tesseract ORC. Установка Tesseract удобна на Маке и Линукс. Если вы на Windows, то придется выполнить на одно движение больше.

Если вы на Маке, то скачайте HomeBrew и далее в терминале пропишите brew install tesseract . Если вы на Линукс, тогда в зависимости от операционной системы вам нужно прописать соответствующую команду в терминале.

И если вы на Windows, то вам нужно скачать приложение на ПК. Вам нужно скачать файл Windows Installer . После скачивания выполните установку данной программы.

С самой программой вам никак не придется взаимодействовать, а лишь скопировать её расположение. Обычно оно устанавливается на диск С в Program files. Найдите вашу программу и скопируйте путь к этой папке.

Other Platforms

Tesseract may work on more exotic platforms too. You can either try compiling it yourself, or take a look at the list of other projects using Tesseract.

Running Tesseract

Tesseract is a command-line program, so first open a terminal or command prompt. The command is used like this:

So basic usage to do OCR on an image called ‘myscan.png’ and save the result to ‘out.txt’ would be:

Or to do the same with German:

It can even be used with multiple languages traineddata at a time eg. English and German:

You can also create a searchable pdf directly from tesseract ( versions >=3.03):

More information about the various options is available in the Tesseract manpage.

Other Languages

Tesseract has been trained for , check for your language in the Tessdata repository.

It can also be trained to support other languages and scripts; for more details see TrainingTesseract.

Development

Also, it is free software, so if you want to pitch in and help, please do!
If you find a bug and fix it yourself, the best thing to do is to attach the patch to your bug report in the Issues List

Use Cases

Tesseract is a general purpose OCR engine, but it works best when we have clean black text on solid white background in a common font. It also works well when the text is approximately horizontal and the text height is at least 20 pixels. If the text has a surrounding border, it may be detected as some random text.

For example, if you scanned a book with a high-quality scanner, the results would be great. But if you took a passport with complex guilloche pattern in the background, the text recognition may not work as well. In such cases, there are several tricks that we need to employ to make reading such text possible. We will discuss those advance tricks in our next post.

Let’s look at these relatively easy examples.

3.1 Documents (book pages, letters)

Let’s take an example of a photo of book page.

Photograph of a book page.

When we process this image using tesseract, it produces following output:

Output1.1 What is computer vision? As humans, we perceive the three-dimensional structure of the world around us with apparentease. Think of how vivid the three-dimensional percept is when you look at a vase of flowerssitting on the table next to you. You can tell the shape and translucency of each petal throughthe subtle patterns of light and Shading that play across its surface and effortlessly segmenteach flower from the background of the scene (Figure 1.1). Looking at a framed group por-trait, you can easily count (and name) all of the people in the picture and even guess at theiremotions from their facial appearance. Perceptual psychologists have spent decades trying tounderstand how the visual system works and, even though they can devise optical illusions!to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remainselusive (Marr 1982; Palmer 1999; Livingstone 2008).

Even though there is a slight slant in the text, Tesseract does a reasonable job with very few mistakes.

3.2 Receipts

The text structure in book pages is very well defined i.e. words and sentences are equally spaced and very less variation in font sizes which is not the case in bill receipts. A slightly difficult example is a Receipt which has non-uniform text layout and multiple fonts. Let’s see how well does tesseract perform on scanned receipts.

OCR Receipt Example

OutputStore #056663515DEL MAR HTS,RDSAN DIEGO, CA 92130(858) 792-7040Register #4 Transaction #571140Cashier #56661020 8/20/17 5:45PMwellnesst+ with PlentiPlenti Card#: 31XXXXXXXXXX45531 G2 RETRACT BOLD BLK 2PK 1.99 TSALE 1/1.99, Reg 1/4.69Discount 2.70-

1 Items Subtotal 1.99Tax .15

Total 2.14*xMASTER* 2.14MASTER card * #XXXXXXXXXXXX548SApo #AA APPROVAL AUTORef # 05639EEntry Method: Chip

3.3 Street Signs

If you get lucky, you can also get this simple code to read simple street signs.

Traffic sign board

OutputSKATEBOARDING

BICYCLE RIDING

ROLLER BLADING

SCOOTER RIDING

Note, it mistakes the screw for a symbol.

Let’s look at a slightly more difficult example. You can see there is some background clutter and the text is surrounded by a rectangle.

Property Sign Board

Tesseract does not do a very good job with dark boundaries and often assumes it to be text.

Output| THIS PROPERTY} ISPROTECTEDBY ||| VIDEO SURVEILLANCE

However, if we help Tesseract a bit by cropping out the text region, it gives perfect output.

Cropped Notice Board

OutputTHIS PROPERTYIS PROTECTED BYVIDEO SURVEILLANCE

The above example illustrates why we need text detection before we do text recognition. A text detection algorithm outputs a bounding box around text areas which can then be fed into a text recognition engine like Tesseract for high-quality output. We will cover this in a future post.

Subscribe & Download Code

If you liked this article and would like to download code (C++ and Python) and example images used in this post, please click here. Alternately, sign up to receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.

Download Example Code

Использование разных языков

Tesseract OCR поддерживает , Чтобы использовать язык, вы должны сначала установить его

Когда вы найдете язык, который вы хотите использовать в списке, обратите внимание на его сокращение. Мы собираемся установить поддержку для валлийцев

Его сокращение — «cym», что сокращенно от «Cymru», что означает валлийский.

Инсталляционный пакет называется «tesseract-ocr-» с сокращением языка, помеченным на конце. Чтобы установить файл с валлийским языком в Ubuntu, мы будем использовать:

sudo apt-get install tesseract-ocr-cym

Изображение с текстом ниже. Это первый стих уэльского государственного гимна.

Давайте посмотрим, справится ли Tesseract OCR с этой задачей. Мы будем использовать (язык) вариант, чтобы позволить знать язык, на котором мы хотим работать:

tesseract hen-wlad-fy-nhadau.png anthem -l cym --dpi 150

отлично справляется, как показано в извлеченном тексте ниже. Da Iawn, Tesseract OCR.

Если ваш документ содержит два или более языков (например, словарь валлийский-английский), вы можете использовать знак плюс () сказать добавить другой язык, вот так:

tesseract image.png textfile -l eng+cym+fra

Combined Orientation and Script Detection using the Tesseract OCR Engine

Publication Year: 2009

This paper proposes a simple but effective algorithm to estimate the script and dominant page orientation of the text contained in an image. A candidate set of shape classes for each script is generated using synthetically rendered text and used to train a fast shape classifier. At run time, the classifier is applied independently to connected components in the image for each possible orientation of the component, and the accumulated confidence scores are used to determine the best estimate of page orientation and script. Results demonstrate the effectiveness of the approach on a dataset of 1846 documents containing a diverse set of images in 14 scripts and any of four possible page orientations.

Provide ground truth

Place ground truth consisting of line images and transcriptions in the folder
. This list of files will be split into training and
evaluation data, the ratio is defined by the variable.

Images must be TIFF and have the extension or PNG and have the
extension , or .

Transcriptions must be single-line plain text and have the same name as the
line image but with the image extension replaced by .

The repository contains a ZIP archive with sample ground truth, see
ocrd-testset.zip. Extract it to and run
.

NOTE: If you want to generate line images for transcription from a full
page, see tips in issue 7 and
in particular .

Создаем Searchable PDF с помощью Tesseract OCR

Недавно на работе столкнулись с задачей разпознавания сканированных документов и поиска по ним.

Мною был рассмотрен движок распознавания текста с открытым исходным кодом Tesseract.

В данной статье будут рассмотрены основные моменты возможной реализации.

Предположим, что у нас есть многостраничный отсканированные документ в формате PDF, но нераспознанный.

И наша задача распознать текст с помощью OCR (Optical character recognition — Оптическое распознавание символов) и создать так называемый Searchable PDF.

Searchable PDF — это PDF, в котором поверх изображения размещается дополнительный слой, содержащий распознанный текст, причем на тех же позициях что и на изображении.

Для начала нужно установить необходимые программы.

  • imagemagick — набор программ (консольных утилит) для работы с множеством графических форматов;
  • tesseract-ocr — приложение оптического распознавания символов;
  • tesseract-ocr-all — все языковые пакеты (но можно установить только конкретные языковые пакеты)

У tesseract есть языковые пакеты для русского и казахских языков, что очень круто.

Также можно не устанавливать локально у себя tesseract, а запустить через docker.

Официального образа на hub.docker.com я не нашел, поэтому сделал свой.

Запустить контейнер с tesseract из образа naik85/tesseract можно так (пример для linux/unix):

docker run --rm -v "$(PWD)":files -w files -it naik85tesseract bash

После старта контейнера откроется консоль bash, где можно будет выполнять команды.
Также будут доступны ваши файлы из директории, где вы запустили команду docker run.

Первый этап

На первом этапе нужно извлечь изображения из PDF. Здесь есть два варианта либо преобразовать PDF в один файл TIFF, либо преобразовать в набор изображений.

TIFF — это многостраничный формат хранения растровых графических изображений.

Для конвертации в TIFF использовалась следующая команда:

convert -density 300 YOUR_FILE.pdf -depth 1 -strip -background white -alpha off YOUR_FILE.tiff

Для конвертации в PNG использовалась следующая команда:

convert -density 300 YOUR_FILE.pdf -depth 1 -strip -background white -alpha off YOUR_FILE.png

Параметры конвертации приведены для примера, их можно настроить под ваши требования.

После выполнения конвертации в PNG для каждой страницы будет создан отдельный файл изображения.

Например:

YOUR_FILE-0.png
YOUR_FILE-1.png
...
YOUR_FILE-N.png

Второй этап

К сожалению, у меня не получилось преобразовать документ TIFF в Searchable PDF через tesseract.

Была использована следующая команда:

tesseract YOUR_FILE.tiff searchable -l rus PDF

Выходила следующая ошибка:

Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error in pixReadFromTiffStream: failed to read tiffdata

Для нашей задачи постраничное деление на отдельные файлы (изображения) было даже предпочтительней (об этом ниже).

Конвертируем каждую страницу (файл png) в Searchable PDF:

tesseract YOUR_FILE-.png searchable- -l rus+kaz+eng pdf
tesseract YOUR_FILE-1.png searchable-1 -l rus+kaz+eng pdf
...
tesseract YOUR_FILE-N.png searchable-2 -l rus+kaz+eng pdf

На выходе получаем файлы:

searchable-0.pdf
searchable-1.pdf
...
searchable-N.pdf

Очень крутая фишка, что можно разпозначать несколько языков, перечислив их через символ ‘+’: rus+kaz+eng.

Команда распознавания и извлечения текста

tesseract YOUR_FILE-.png -l rus+kaz+eng YOUR_FILE-
tesseract YOUR_FILE-1.png -l rus+kaz+eng YOUR_FILE-1
...
tesseract YOUR_FILE-N.png -l rus+kaz+eng YOUR_FILE-N

В результате будут созданы текстовые файлы:

YOUR_FILE-0.txt
YOUR_FILE-1.txt
...
YOUR_FILE-N.txt

Распознавая отдельно каждую страницу, мы можем организовать постраничный поиск и при необходимости показывать пользователю только нужные страницы.

Склеив тексты страниц и положив их в поисковый движок, получим подокументый поиск.

Если нужен целый Searchable PDF, то можно его склеивать из отдельных страниц.

p.s. Деление на отдельные страницы затратно получается, но, думаю, зато это более гибко.

Поделиться данной статьей через:  

Introduction

  • The current official release is 4.1.1.
  • The master branch on Github can be used by those who want the latest code for LSTM (—oem 1) and legacy (—oem 0) Tesseract. The master branch is using 5.0.0 versioning because code modernization caused API compatibility issues with 4.x release.

Tesseract can be used directly via command line, or (for programmers) by using an API to extract printed text from images. It supports a wide variety of languages. Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page. External tools, wrappers and training projects for Tesseract are listed under AddOns.

Tesseract is free software, so if you want to pitch in and help, please do!
If you find a bug and fix it yourself, the best thing to do is to attach the patch to your bug report in the Issues List.

Introduction to Tesseract

Tesseract is an open-source text recognition (OCR) Engine written in c/c++ and works on Windows, macOS, and Linux, and comes under Apache 2.0 License. It was initially designed by Hewlett Packard in 1985 then later released as an Open Source in 2005. After that, Google sponsored to develop and maintain Tesseract from 2006.

It can be used directly using an API to extract printed text from files, has Unicode(UTF-8) support, and recognizes more than 100 languages which you can refer to here. Tesseract project does not have a built-in GUI application, if you need GUI, find one from several available 3rdParties such as VietOCR, OCR2Text, dpScreenOCR, NeOCR, etc. You can get more about 3rdParties here.

Limitations of Tesseract for OCR

A few weeks ago I was working on a project to recognize the 16-digit numbers on credit cards.

I was easily able to write Python code to localize each of the four groups of 4-digits.

Here is an example 4-digit region of interest:

Figure 8: Localizing a 4-digit grouping of characters on a credit card.

However, when I tried to apply Tesseract to the following image, the results were dissatisfying:

Figure 9: Trying to apply Tesseract to “noisy” images.

$ tesseract tesseract_inputs/example_04.png stdout digits
Warning in pixReadMemPng: work-around: writing to a temp file
5513

Notice how Tesseract reported  , but the image clearly shows  .

Unfortunately, this is a great example of a limitation of Tesseract. While we have segmented the foreground text from background, the pixelated nature of the text “confuses” Tesseract. It’s also likely that Tesseract was not trained on a credit card-like font.

Tesseract is best suited when building document processing pipelines where images are scanned in, pre-processed, and then Optical Character Recognition needs to be applied.

We should note that Tesseract is not an off-the-shelf solution to OCR that will work in all (or even most) image processing and computer vision applications.

In order to accomplish that, you’ll need to apply feature extraction techniques, machine learning, and deep learning.

What’s next? I recommend PyImageSearch University.

Course information:
25 total classes • 37h 19m video • Last updated: 9/2021★★★★★ 4.84 (128 Ratings) • 10,597 Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you’re serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you’ll find:

  • &check; 25 courses on essential computer vision, deep learning, and OpenCV topics
  • &check; 25 Certificates of Completion
  • &check; 37h 19m on-demand video
  • &check; Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
  • &check; Pre-configured Jupyter Notebooks in Google Colab
  • &check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
  • &check; Access to centralized code repos for all 400+ tutorials on PyImageSearch
  • &check; Easy one-click downloads for code, datasets, pre-trained models, etc.
  • &check; Access on mobile, laptop, desktop, etc.

4.0 with LSTM

Tesseract 4.0 added a new OCR engine based on LSTM neural networks. It works well on x86/Linux with official Language Model data available for 100+ languages and 35+ scripts. See 4.0x-Changelog for more details.

Traineddata Files

For detailed information about the different types of models, see Data Files.

Model files for version are available from tessdata tagged 4.00. It has models from November 2016. The individual language file links are available from the following link.

tessdata 4.00 November 2016

Model files for version and later are available from tessdata tagged 4.0.0. It has legacy models from September 2017 that have been updated with Integer versions of LSTM models. This set of traineddata files has support for both the legacy recognizer with and for LSTM models with . These models are available from the following Github repo.

tessdata

Two more sets of traineddata, trained at Google, are made available in the following Github repos. These do not have the legacy models and only have LSTM models usable with .

  • tessdata_best
  • tessdata_fast

Training for Tesseract 4

  • TrainingTesseract 4.00 — Detailed Guide by Ray Smith

    • Fonts
    • Box Files
    • The-Hallucination-Effect
  • Links to Community Contributions for Finetune Training
  • 4.0 Accuracy and Performance

Usage

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('text.png'))
    ->run();

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('german.png'))
    ->lang('deu')
    ->run();

Multiple languages

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('mixed-languages.png'))
    ->lang('eng', 'jpn', 'spa')
    ->run();

Inducing recognition

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('8055.png'))
    ->allowlist(range('A', 'Z'))
    ->run();

Breaking CAPTCHAs

Yes, I know some of you might want to use this library for the noble purpose
of breaking CAPTCHAs, so please take a look at this comment:

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Adblock
detector