Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.
Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page.
Installation
There are two parts to install, the engine itself, and the traineddata for the languages.
Tesseract is available directly from many Linux distributions. The package is generally called ‘tesseract’ or ‘tesseract-ocr’ — search your distribution’s repositories to find it.
Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions. The language traineddata packages are called ‘tesseract-ocr-langcode’ and ‘tesseract-ocr-script-scriptcode’, where langcode
is three letter language code and scriptcode
is four letter script code.
Examples: tesseract-ocr-eng (English), tesseract-ocr-ara (Arabic), tesseract-ocr-chi-sim (Simplified Chinese), tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari script), etc.
** FOR EXPERTS ONLY. **
If you are experimenting with OCR Engine modes, you will need to manually install language training data beyond what is available in your Linux distribution.
Various types of training data can be found on GitHub. Unpack and copy the .traineddata file into a ‘tessdata’ directory. The exact directory will depend both on the type of training data, and your Linux distribution. Possibilities are /usr/share/tesseract-ocr/tessdata
or /usr/share/tessdata
or /usr/share/tesseract-ocr/4.00/tessdata
.
Training data for obsolete Tesseract versions =< 3.02 reside in another location.
If Tesseract is not available for your distribution, or you want to use a newer version than they offer, you can compile your own.
Ubuntu
You can install Tesseract and its developer tools on Ubuntu by simply running:
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
Note for Ubuntu users: In case apt
is unable to find the package try adding universe
entry to the sources.list
file as shown below.
sudo vi /etc/apt/sources.list
Copy the first line "deb http://archive.ubuntu.com/ubuntu bionic main" and paste it as shown below on the next line.
If you are using a different release of ubuntu, then replace bionic with the respective release name.
deb http://archive.ubuntu.com/ubuntu bionic universe
Debian packages
- Tesseract 4
- Tesseract 5
- Tesseract 5 (devel)
Raspbian packages
- Tesseract 4
- Tesseract 5
- Tesseract 5 (devel)
Ubuntu packages
- Tesseract 4
- Tesseract 5
- Tesseract 5 (devel)
Ubuntu ppa
- Tesseract 4
- Tesseract 5
- Tesseract 5 (devel-daily)
RHEL/CentOS/Scientific Linux, Fedora, openSUSE packages
- Tesseract 4
- Tesseract 5
See Installation on OpenSuse page for detailed instructions.
AppImage
Instruction
- Download AppImage from releases page
- Open your terminal application, if not already open
- Browse to the location of the AppImage
- Make the AppImage executable:
$ chmod a+x tesseract*.AppImage
- Run it:
./tesseract*.AppImage -l eng page.tif page.txt
AppImage compatibility
- Debian: ≥ 10
- Fedora: ≥ 29
- Ubuntu: ≥ 18.04
- CentOS ≥ 8
- openSUSE Tumbleweed
Included traineddata files
- deu — German
- eng — English
- fin — Finnish
- fra — French
- osd — Script and orientation
- por — Portuguese
- rus — Russian
- spa — Spanish
snap
For distributions that are supported by snapd you may also run the following command to install the tesseract
built binaries(Don’t have snapd installed?):
sudo snap install --channel=edge tesseract
The traineddata is currently not shipped with the snap package and must be placed manually to ~/snap/tesseract/current
.
macOS
You can install Tesseract using either MacPorts or Homebrew.
A macOS wrapper for the Tesseract API is also available at Tesseract macOS.
MacPorts
To install Tesseract run this command:
sudo port install tesseract
To install any language data, run:
sudo port install tesseract-<langcode>
List of available langcodes can be found on MacPorts tesseract page.
Homebrew
To install Tesseract run this command:
The tesseract directory can then be found using brew info tesseract
,
e.g. /usr/local/Cellar/tesseract/3.05.02/share/tessdata/
.
Windows
Installer for Windows for Tesseract 3.05, Tesseract 4 and Tesseract 5 are available from Tesseract at UB Mannheim. These include the training tools. Both 32-bit and 64-bit installers are available.
An installer for the OLD version 3.02 is available for Windows from our download page.
This includes the English training data.
If you want to use another language, download the appropriate training data,
unpack it using 7-zip, and copy the .traineddata file into the ‘tessdata’ directory, probably C:Program FilesTesseract-OCRtessdata
.
To access tesseract-OCR from any location you may have to add the directory where the tesseract-OCR binaries are located to the Path variables, probably C:Program FilesTesseract-OCR
.
Experts can also get binaries build with Visual Studio from the build artifacts of the Appveyor Continuous Integration.
Cygwin
Released version >= 3.02 of tesseract-ocr are part of Cygwin
The latest version available is 4.1.0. Please see announcement.
MSYS2
Install tesseract-OCR:
pacman -S mingw-w64-{i686,x86_64}-tesseract-ocr
and the data files:
pacman -S mingw-w64-{i686,x86_64}-tesseract-data-eng
In the above command, “eng” may be replaced with the ISO 639 3-letter language code for supported languages. For a list of available language packages use:
pacman -Ss tesseract-data
Other Platforms
Tesseract may work on more exotic platforms too. You can either try compiling it yourself, or take a look at the list of other projects using Tesseract.
Running Tesseract
Tesseract is a command-line program, so first open a terminal or command prompt. The command is used like this:
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
So basic usage to do OCR on an image called ‘myscan.png’ and save the result to ‘out.txt’ would be:
Or to do the same with German:
tesseract myscan.png out -l deu
It can even be used with multiple languages traineddata at a time eg. English and German:
tesseract myscan.png out -l eng+deu
Tesseract also includes a hOCR mode, which produces a special HTML file with the coordinates of each word. This can be used to create a searchable pdf, using a tool such as Hocr2PDF. To use it, use the ‘hocr’ config option, like this:
tesseract myscan.png out hocr
You can also create a searchable pdf directly from tesseract ( versions >=3.03):
tesseract myscan.png out pdf
More information about the various options is available in the Tesseract manpage.
Other Languages
Tesseract has been trained for many languages, check for your language in the Tessdata repository.
It can also be trained to support other languages and scripts; for more details see TrainingTesseract.
Development
Tesseract can also be used in your own project, under the terms of the Apache License 2.0. It has a fully featured API, and can be compiled for a variety of targets including Android and the iPhone. See the 3rdParty page for a sample of what has been done with it. Note that as yet there are very few 3rdParty Tesseract OCR projects being developed for Mac (with the only one being Tesseract macOS.md), although there are several online OCR services that can be used on Mac that may use Tesseract as their OCR engine.
Also, it is free software, so if you want to pitch in and help, please do!
If you find a bug and fix it yourself, the best thing to do is to attach the patch to your bug report in the Issues List
Support
First read the documentation, particularly the FAQ to see if your problem is addressed there.
If not, search the Tesseract user forum or the
Tesseract developer forum, and if you still can’t find what you need, please ask us there.
I am currently working on optimal character recognition project using python 2.7,open computer vision in windows.To accomplish this task i came to know that it can be done by using tesseract (software).But, it cannot be installed on windows. I searched a lot but i could not get the solution. Can any one tell me is there any way of installing it on windows ?or can it be done without using it?
asked Sep 10, 2017 at 12:00
1
Simple steps for tesseract installation in windows.
-
Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
-
Install this exe in
C:Program Files (x86)Tesseract-OCR
-
Open virtual machine command prompt in windows or anaconda prompt.
-
Run
pip install pytesseract
-
To test if tesseract is installed type in python prompt:
import pytesseract
print(pytesseract)
zeit
3082 silver badges12 bronze badges
answered Oct 25, 2020 at 8:42
2
To accomplish OCR with Python on Windows, you will need Python and OpenCV which you already have, as well as Tesseract and the Pytesseract Python package.
To install Tesseract OCR for Windows:
- Run the installer(find 2021) from UB Mannheim
- Configure your installation (choose installation path and language data to include)
- Add Tesseract OCR to your environment variables
To install and use Pytesseract on Windows:
- Simply run
pip install pytesseract
- You will also need to install Pillow with
pip install Pillow
to use Pytesseract. Import it in your Python document like sofrom PIL import Image
. - You will need to add the following line in your code in order to be able to call pytesseract on your machine:
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files\Tesseract-OCR\tesseract.exe'
I’ve given a detailed walkthrough of how to install Tesseract OCR for Windows here if you would like further guidance.
Smart Manoj
4,8254 gold badges30 silver badges56 bronze badges
answered May 23, 2021 at 9:06
bradbrad
1871 silver badge15 bronze badges
UB Mannheim provide pre-built binaries for the latest versions of tesseract
.
From tesseract
Github wiki.
Windows
An unofficial installer for windows for Tesseract 3.05-dev and
Tesseract 4.00-dev is available from Tesseract at UB
Mannheim. This
includes the training tools.…
To access tesseract-OCR from any location you may have to add the
directory where the tesseract-OCR binaries are located to the Path
variables, probablyC:Program FilesTesseract-OCR
.
answered Sep 10, 2017 at 12:41
wklwkl
75.8k16 gold badges163 silver badges175 bronze badges
1. Установка Tesseract-OCR
Сначала загрузите установочный файл Тессеракт-OCR.
Адрес загрузки:(1) https://github.com/tesseract-ocr/tesseract/wiki/Downloads
(2) https://digi.bib.uni-mannheim.de/tesseract
Я использовал второй адрес и загрузил установочный файл Tesseract-OCR-Setup-3.05.01.exe.
Вы можете начать устанавливать этот файл.
Есть два очка, чтобы обратить внимание:
(1)При загрузке данных языка, по умолчанию для установки на английском языке, если вы хотите использовать Тессеракт текст ручки текст, вам необходимо проверить дополнительный язык данных. Но рекомендуется не проверять все, потому что мы не используем большинство языков, а установка за проверкой будет потреблять долгое время.
(2)Будьте осторожны, чтобы запомнить ваш путь установки, потому что он должен использоваться, когда переменная среды установлена.
Например, я устанавливаю здесь в папке D: / Tesseract.
2. Изменить переменные среды
2.1 После установки Тессеракта-OCR закончена, необходимо добавить его путь для установки переменной PATH среды системы.
Введите следующий интерфейс с помощью панели управления Панель системной системы Расширенные настройки:
Нажмите на переменную среды:
Выберите путь в системной переменной, нажмите кнопку Изменить, а затем добавить в папку D: Тессеракта Тессеракта-OCR на пути Путь к пути PATH по newting.
2.2 Добавить переменную Tessdata_prefix
После установки пути мы также будем создавать переменную TESSDATA_PREFIX в системной переменной, а значение переменной является дорожной мощностью D: Tesseract Tesseract-OCR. Если это не установлено, введите tesseract -list-langs, отобразит любые языковые пакеты, которые не могут быть загружены.
Нажмите кнопку Создать, задайте имя переменной и значение переменной следующим образом:
На данный момент Tesseract-OCR завершен.
3. Проверьте, если Tesseract-OCR успешно установлен
Откройте командную строку, введите tesseract -v, вернется версию Tesseract, которая в настоящее время устанавливается.
Введите Tesseract —list-langs Проверьте языковой пакет
Если все идет хорошо, Tesseract-OCR был успешно установлен и может быть использован.
Содержание
- Установка Tesseract для OCR
- Знакомство с программой
- Установка Tesseract
- Проверка правильности установки
- Проверка Tesseract OCR
- Ограничения Tesseract
- Резюме
- Introduction
- Tesseract documentation
- Introduction
- Installation
- Linux
- Tesseract Development Version with LSTM engine and related traineddata
- Ubuntu PPA
- Debian
- AppImage
- Tesseract 4 packages with LSTM engine and related traineddata
- Ubuntu
- Ubuntu PPA
- Debian
- Raspbian
- RHEL/CentOS/Scientific Linux, Fedora, openSUSE packages
- FOR EXPERTS ONLY.
- Windows
- Cygwin
- Favicon
- Обучаем вместе с Tesseract OCR
- 0. Что нам нужно
- 1. Создаём и редактируем box-файл
- Как установить tesseract на windows 10
- Ошибка, при установке модуля tesseract-ocr, как решить?
Установка Tesseract для OCR
OCR — механический или электронный перевод изображений рукописного, машинописного или печатного текста в текстовые данные, использующихся для представления символов в компьютере.
Знакомство с программой
Tesseract первоначально разработана Hewlett Packard в 1980-х годах, в 2005 году был опубликован её исходный код. В августе 2006 г. Google купил её и открыл исходные тексты под лицензией Apache 2.0 для последующей разработки.
Программное обеспечение Tesseract работает со многими естественными языками от английского (первоначально) до панджаби. С момента обновления в 2015 году он поддерживает более 100 письменных языков и содержит обучаемый код для других языков. Поддержка русского языка реализована подключением дополнительных модулей.
Первоначально программа была написана на C, в 1998 году была перенесена на C ++. У неё нет графического интерфейса, но есть сторонние программные проекты, которые обертывают Tesseract для предоставления графического интерфейса пользователя.
Установка Tesseract
Чтобы использовать библиотеку Tesseract, необходимо установить её в операционную систему.
Для пользователей MacOS воспользуемся brew:
Если используется операционная система Ubuntu:
Пользователям Windows официальных бинарных сборок Tesseract не предоставляется, поэтому рекомендуется воспользоваться поисковыми системами для поисков сторонних сборок.
Проверка правильности установки
Чтобы проверить, что Tesseract был успешно установлен, выполним следующую команду:
В командную строку должна распеваться версия Tesseract, а также список совместимых библиотек форматов файлов изображений.
Если появилась ошибка:
тогда вернитесь к предыдущему шагу и устраните ошибки установки. Кроме того, может потребоваться обновить переменную окружения PATH (только для продвинутых пользователей).
Проверка Tesseract OCR
Для того чтобы получить разумные результаты в Tesseract OCR нужно предварительно обработать цифровыми фильтрами поступающие изображения.
При использовании Tesseract рекомендуется:
Отклонения от этих рекомендаций могут привести к неправильным результатам OCR.
Теперь применим OCR к следующему изображению:
Запустим команду в терминале:
Tesseract правильно распознал текст «Testing Tesseract OCR» и распечатал его в терминале.
Ограничения Tesseract
К сожалению, этот синтетический пример достаточно далёк от реальности. Если распознаваемый текст плохо отделим от фона или он сильно пикселирован, то Tesseract скорее всего вернёт ошибочные результаты. Tesseract лучше всего подходит для конвейерной обработки документов, в которых изображения сканируются, обрабатываются цифровыми фильтрами, а затем к ним применяется оптическое распознавание символов.
Следует отметить, что Tesseract не является готовым решением для OCR, которое сможет работать во всех приложениях обработки изображений и компьютерного зрения. Для сложных частных случаев необходимо применить методы извлечения признаков, машинное обучение и искусственный интеллект.
Резюме
Если обрабатываемые изображения не будут содержать чёткого текста, Tesseract даст плохие результаты. В случае зашумлённых входных изображений, получить лучшую точность можно обучая пользовательскую модель машинного обучения.
Tesseract лучше всего подходит для ситуаций с изображениями высокого разрешения, где текст переднего плана чётко отделим от фона.
Источник
Introduction
Tesseract documentation
Introduction
Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.
Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page.
Installation
There are two parts to install, the engine itself, and the training data for a language.
Linux
Note for Ubuntu users: In case apt is unable to find the package try adding universe entry to the sources.list file as shown below.
Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions. The language packages are called ‘tesseract-ocr-langcode’ and ‘tesseract-ocr-script-scriptcode’, where langcode is three letter language code and scriptcode is four letter script code.
Examples: tesseract-ocr-eng (English), tesseract-ocr-ara (Arabic), tesseract-ocr-chi-sim (Simplified Chinese), tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari script), etc.
For distributions that are supported by snapd you may also run the following command to install the tesseract built binaries(Don’t have snapd installed?):
The traineddata is currently not shipped with the snap package and must be placed manually to
Tesseract Development Version with LSTM engine and related traineddata
5.00 Alpha
Ubuntu PPA
Debian
AppImage
Included traineddata files
Tesseract 4 packages with LSTM engine and related traineddata
Ubuntu
Ubuntu PPA
Debian
There are also 4.1.x packages for other versions of Debian, check it here https://notesalexp.org/tesseract-ocr/
Raspbian
RHEL/CentOS/Scientific Linux, Fedora, openSUSE packages
For example to install Tesseract with German language traineddata:
For CentOS 8 run the following as root:
For RHEL 7 run the following as root:
For CentOS 7 run the following as root:
For Scientific Linux 7 run the following as root:
For Fedora 32 run the following as root:
For Fedora 31 run the following as root:
For openSUSE Tumbleweed run the following as root:
For openSUSE Leap 15.0 run the following as root:
FOR EXPERTS ONLY.
If you are experimenting with OCR Engine modes, you will need to manually install language training data beyond what is available in your Linux distribution.
Windows
Installer for Windows for Tesseract 3.05, Tesseract 4 and development version 5.00 Alpha are available from Tesseract at UB Mannheim. These include the training tools. Both 32-bit and 64-bit installers are available.
Experts can also get binaries build with Visual Studio from the build artifacts of the Appveyor Continuous Integration.
Cygwin
Released version >= 3.02 of tesseract-ocr are part of Cygwin
The latest version available is 4.1.0. Please see announcement.
Источник
Favicon
Блог по web технологиям. Веб студия г. Воронеж. Создание и поддержка сайтов на заказ.
Обучаем вместе с Tesseract OCR
Tesseract — свободная платформа для оптического распознавания текста, исходники которой Google подарил сообществу в 2006 году. Если вы пишете софт для распознавания текста, то вам наверняка приходилось обращаться к услугам этой мощной библиотеки. И если она не справилась с вашим текстом (а скорее всего это именно так), то выход у вас остаётся один — научить её. Процесс этот достаточно сложный и изобилует не очевидными, а порой и прям-таки магическими действиями.
Оригинальный проект находится на гитхабе, а скачать установщик можно здесь, На момент написания статьи версия установщика была 3.05.01. Мне понадобилось немало времени на постижение всей его глубины, поэтому я решил написать что и как, вдруг забуду что-то в будущем, а также чтобы помочь другим пройти этот путь в следующий раз быстрее.
0. Что нам нужно
Сборки этой библиотеки есть под windows (можно скачать установщик отсюда) и под linux. Для большинства linux-дистрибутивов установить tesseract можно просто через sudo apt-get install tesseract-ocr.
1. Создаём и редактируем box-файл
Для того чтобы отметить символы на изображении и задать им соответствие utf-8 символам текста служат box-файлы. Это обычные текстовые файлы, в которых каждому символу соответствует строка с символом и координатами прямоугольника в пикселях. Первоначально файл генерируем утилитой из пакета tesseract:
tesseract ccc.eee.exp0.tif ccc.eee.exp0 batch.nochop makebox
получим файл
в текущей директории. Заглянем в него. Да, чуть не забыл, не забудьте прописать адрес установленной Tesseract-OCR в переменную среды Path в windows, иначе команда tesseract не будет работать в консоли.
Символы в начале строки полностью соответствуют символам в файле? Если это так, то тренировать ничего не нужно, вы можете спать спокойно. В нашем случае скорее всего символы не будут совпадать ни по существу ни по количеству. Т.е. tesseract со словарём по умолчанию не распознал не только символы, но и посчитал некоторые из них за два или больше. Возможно часть символов у нас «слипнется», т.е. попадёт в общую коробку и будет распознано как один. Это всё нужно поправить прежде чем идти дальше.
Работа нудная и кропотливая, но к счастью для этого есть ряд сторонних утилит. Я например пользовался jTessBoxEditor. Открываем им изображение, box-файл с таким же именем он сам подтянет (главное чтобы всё лежало в одной папке).
Прошло полдня… Вы с чувством глубокого удовлетворения закрываете jTessBoxEditor (вы ведь не забыли сохранить результат, верно?) и у вас есть корректный box-файл. Теперь можно переходить к следующему этапу.
Источник
Как установить tesseract на windows 10
The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub’s log of contributors.
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages «out of the box».
Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The main branch also has experimental support for ALTO (XML) output.
You should note that in many cases, in order to get better OCR results, you’ll need to improve the quality of the image you are giving Tesseract.
This project does not include a GUI application. If you need one, please see the 3rdParty documentation.
Tesseract can be trained to recognize other languages. See Tesseract Training for more information.
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.
The latest (LSTM based) stable version is 4.1.1, released on December 26, 2019. Latest source code is available from main branch on GitHub. Open issues can be found in issue tracker, and planning documentation.
The latest 3.0x version is 3.05.02, released on June 19, 2018. Latest source code for 3.05 is available from 3.05 branch on GitHub. There is no development for this version, but it can be used for special cases (e.g. see Regression of features from 3.0x).
See Release Notes and Change Log for more details of the releases.
Источник
Ошибка, при установке модуля tesseract-ocr, как решить?
Здравствуйте!
Столкнулся с проблемой, пытаюсь установить через командную строку модуль tesseract-ocr. Появляется ошибка следующего характера:
Приложу скрин из Visual Studio Installer, так почему-то нет графы Python, также можно посмотреть все компоненты установленные, если это как-то поможет делу:
Выручайте, второй день ломаю голову, что ему от меня надо,
Заранее благодарю всех откликнувшихся!
Здравствуйте!
Попробуйте другой метод для установки через Anaconda
К сожалению не получилось, есть у вас еще варианты, как можно исправить? 🙂
два: pip install pytesseract pillow
Установил файл, выполнил pip install pytesseract pillow в командной строке, но проблема так и не исчезла.
Не совсем понял, это в код программы дописывать?
from PIL import Image
import pytesseract
Попробовал дописать в код, та же история. Может ли это быть из-за большого количества неструктурированных компонентов С++?
Вот скриншот из панели управления, нормально ли это? Или же дело не в этом?
Заранее прошу прощения, за столь недалекие вопросы, пока я еще зеленый в этом деле 🙂
Источник
Tesseract OCR
Table of Contents
- Tesseract OCR
- About
- Brief history
- Installing Tesseract
- Running Tesseract
- For developers
- Support
- License
- Dependencies
- Latest Version of README
About
This package contains an OCR engine — libtesseract
and a command line program — tesseract
.
Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (—oem 0).
It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.
Stefan Weil is the current lead developer. Ray Smith was the lead developer until 2018. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS
and GitHub’s log of contributors.
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages «out of the box».
Tesseract supports various image formats including PNG, JPEG and TIFF.
Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV and ALTO (the last one — since version 4.1.0).
You should note that in many cases, in order to get better OCR results, you’ll need to improve the quality of the image you are giving Tesseract.
This project does not include a GUI application. If you need one, please see the 3rdParty documentation.
Tesseract can be trained to recognize other languages.
See Tesseract Training for more information.
Brief history
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.
Major version 5 is the current stable version and started with release
5.0.0 on November 30, 2021. Newer minor versions and bugfix versions are available from
GitHub.
Latest source code is available from main branch on GitHub.
Open issues can be found in issue tracker,
and planning documentation.
See Release Notes
and Change Log for more details of the releases.
Installing Tesseract
You can either Install Tesseract via pre-built binary package
or build it from source.
A C++ compiler with good C++17 support is required for building Tesseract from source.
Running Tesseract
Basic command line usage:
tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]
For more information about the various command line options use tesseract --help
or man tesseract
.
Examples can be found in the documentation.
For developers
Developers can use libtesseract
C or
C++ API to build their own application. If you need bindings to libtesseract
for other programming languages, please see the
wrapper section in the AddOns documentation.
Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.
Support
Before you submit an issue, please review the guidelines for this repository.
For support, first read the documentation,
particularly the FAQ to see if your problem is addressed there.
If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can’t find what you need, ask for support in the mailing-lists.
Mailing-lists:
- tesseract-ocr — For tesseract users.
- tesseract-dev — For tesseract developers.
Please report an issue only for a bug, not for asking questions.
License
The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
NOTE: This software depends on other packages that may be licensed under different open source licenses.
Tesseract uses Leptonica library which essentially
uses a BSD 2-clause license.
Dependencies
Tesseract uses Leptonica library
for opening input images (e.g. not documents like pdf).
It is suggested to use leptonica with built-in support for zlib,
png and
tiff (for multipage tiff).
Latest Version of README
For the latest online version of the README.md see:
https://github.com/tesseract-ocr/tesseract/blob/main/README.md
- IronOCR
- IronOCR Blog
- OCR Tools
- How to Use Tesseract OCR in Windows
Published April 8, 2022
What is Tesseract OCR?
Tesseract is an optical character recognition engine that can be used on a variety of operating systems. It is a free software, released under the Apache License. In this guide, I will take you through the steps that I followed in order to install Tesseract on my Windows 10 machine. The major version 5 is the current stable version and began with release 5.0. 0 on November 30, 2021.
How to Use Tesseract OCR in Windows
- Install Tesseract OCR on a Windows 10 using .exe file
- Configure the Tesseract installation
- Add installation path to environment variables
- Run Tesseract OCR for Windows on a test image
Step 1: Install Tesseract OCR in Windows 10 using .exe File:
To install language data: sudo port install tesseract —<langcode>
A list of langcodes is found on the MacPorts Tesseract page Homebrew. The first step to install Tesseract OCR for Windows is to download the .exe installer that corresponds to your machine’s operating system
Step 2: Configure Installation
Next, we’ll need to configure the Tesseract installation. If you’re feeling confident and only want to run Tesseract OCR for Windows with the default language set to English, running through the installation screens with all of the default options selected should work.
Installer Language
This is just the language for the dialog boxes and help information. If we want to then we can run Tesseract OCR for Windows in multiple languages:
Installer language for Tesseract OCR for Windows
Tesseract OCR Setup
The setup screen recommends that all other applications are closed before continuing with the installation.
The Tesseract OCR for Windows installation screen.
Choose Install Location
Next, we’ll choose the installation location. Before proceeding to the next step, make sure to copy the install location to a .txt file. We will need to add the installation location to our machine’s environment variables once the installation is complete.
Choose the installation location.
Choose Components
By default, the ScrollView, Training Tools, Shortcuts creation, and Language data are all selected. Unless you have a specific reason not to install these, we will want to keep all of these selected.
Default Tesseract OCR for Windows installation components.
If we scroll down and expand the ‘Additional script data’, we will see that we have the option to download and install additional script data. This can be helpful in improving the accuracy of text extraction from certain scripted languages. It’s up to you if you want to install these.
Optional script installation components.
In the last step of the installation, we’ll be asked to choose the start menu folder for Tesseract OCR for Windows shortcuts. I’ve left mine set to the default name: ‘Tesseract-OCR’.
Choose the start menu folder for the Tesseract OCR for Windows shortcuts.
After we click install, Tesseract OCR for Windows will begin installing. Our next step is to add the installation path to our machine’s environment variables.
Step 3: Add Installation Path to Environment Variables
Control Panel
To add the installation location to our environment variables, go to the Start menu and search for ‘environment variables’. You should see a result to edit the system environment variables. If you don’t, you can always use the following steps: Start menu > Control Panel > Edit the system environment variables.
Searching for ‘environment variables’
System Properties
When presented with the ‘System Properties’ dialog box , we’ll want to make sure the Advanced tab is clicked, then click the Environment Variables button towards the bottom right of the screen.
Environment Variables
Under system variables, we will click the Edit button.
When presented with the «Edit environment variable» screen, click the New button, and paste in your Tesseract OCR installation path that we copied earlier in Step 2. Once you’ve done this, click the ‘OK‘ button.
Add Tesseract OCR for Windows Installation Directory to Environment Variables
That’s it! Now that we’ve run the .exe installer and added the Tesseract OCR for Windows install location to our environment variables, we can test that our installation is working by running Tesseract on a test image.
Step 4: Run Tesseract OCR for Windows on a Test Image
To test that Tesseract OCR for Windows was installed successfully, open command prompt on your machine, then run the Tesseract command. You should see an output with a quick explanation of Tesseract’s usage options.
Checking successful installation of Tesseract OCR for Windows
Congratulations! You’ve successfully installed Tesseract OCR for Windows on your machine.
Advantages of using IronOCR to do OCR Work:
IronOCR provides Tesseract OCR on Mac, Windows, Linux, Azure and Docker for:
- .NET Framework 4.0 +
- .NET Standard 2.0 +
- .NET Core 2.0 +
- .NET 5
- Mono for macOS and Linux
- Xamarin for macOS
IronOCR reads text, barcodes, and QR codes from all major image and PDF formats using the latest Tesseract 5 engine. This library adds OCR functionality to Desktop, Console and Web applications in minutes. It supports 127+ international languages. Licenses start from $749.
Step 1: Install the latest version of IronOCR
Install DLL
Download the IronOcr DLL directly to your machine.
Install NuGet
Alternatively, you can install it through NuGet.
PM > Install-Package IronOcr
Step 2: Apply Your License Key
Set your IronOCR license key using code
Add this code to the startup of your application before IronOCR is used.
IronOcr.Installation.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";
IronOcr.Installation.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01";
IronOcr.Installation.LicenseKey = "IRONOCR-MYLICENSE-KEY-1EF01"
VB C#
Step 3: Test your Key
Test if your key has been installed correctly.
BoolresultIronOcr.License.IsValidLicense("IRONOCR-MYLICENSE-KEY-1EF0");
BoolresultIronOcr.License.IsValidLicense("IRONOCR-MYLICENSE-KEY-1EF0");
BoolresultIronOcr.License.IsValidLicense("IRONOCR-MYLICENSE-KEY-1EF0")
VB C#
Get started with the project
// PM > Install-Package IronOcr
// using IronOcr;
var Ocr = new IronTesseract();
// Hundreds of languages available
Ocr.Language = OcrLanguage.English;
using (var Input = new OcrInput())
{
OcrInput.Add(@"imgexample.tiff")
// Input.DeNoise(); optional
// Input.Deskew(); optional
IronOcr.OcrResult Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
// Explore the OcrResult using IntelliSense
}
// PM > Install-Package IronOcr
// using IronOcr;
var Ocr = new IronTesseract();
// Hundreds of languages available
Ocr.Language = OcrLanguage.English;
using (var Input = new OcrInput())
{
OcrInput.Add(@"imgexample.tiff")
// Input.DeNoise(); optional
// Input.Deskew(); optional
IronOcr.OcrResult Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
// Explore the OcrResult using IntelliSense
}
' PM > Install-Package IronOcr
' using IronOcr;
Dim Ocr = New IronTesseract()
' Hundreds of languages available
Ocr.Language = OcrLanguage.English
Using Input = New OcrInput()
OcrInput.Add("imgexample.tiff") IronOcr.OcrResult Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
' ' Explore the OcrResult using IntelliSense
End Using
VB C#
How to Use Tesseract OCR in C# for .NET?
- Install Google Tesseract and IronOCR for .NET into Visual Studio
- Check the latest builds in C#
- Review accuracy and image compatibility
- Test performance and API function
- Consider Multi-Language Support
Use NuGet Package Manager to install the IronOCR NuGet Package into your Visual Studio solution.
// PM > Install-Package IronOcr
// using IronOcr;
var Ocr = new IronTesseract();
// Hundreds of languages available
Ocr.Language = OcrLanguage.English;
using (var Input = new OcrInput())
{
OcrInput.Add(@"imgexample.tiff")
// Input.DeNoise(); optional
// Input.Deskew(); optional
IronOcr.OcrResult Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
// Explore the OcrResult using IntelliSense
}
// PM > Install-Package IronOcr
// using IronOcr;
var Ocr = new IronTesseract();
// Hundreds of languages available
Ocr.Language = OcrLanguage.English;
using (var Input = new OcrInput())
{
OcrInput.Add(@"imgexample.tiff")
// Input.DeNoise(); optional
// Input.Deskew(); optional
IronOcr.OcrResult Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
// Explore the OcrResult using IntelliSense
}
' PM > Install-Package IronOcr
' using IronOcr;
Dim Ocr = New IronTesseract()
' Hundreds of languages available
Ocr.Language = OcrLanguage.English
Using Input = New OcrInput()
OcrInput.Add("imgexample.tiff") IronOcr.OcrResult Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
' ' Explore the OcrResult using IntelliSense
End Using
VB C#
IronOCR Tesseract for C#
With IronOCR, all Tesseract installation happens entirely using the NuGet Package Manager.
PM > Install-Package IronOcr
Tesseract 5 API in IronOCR Tesseract
To date, IronTesseract is the only known implementation of Tesseract 5 for .NET Framework or Core.
// using IronOcr;
var Ocr = new IronTesseract(); // nothing to configure
using (var Input = new OcrInput(@"imagesimage.png"))
{
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
// using IronOcr;
var Ocr = new IronTesseract(); // nothing to configure
using (var Input = new OcrInput(@"imagesimage.png"))
{
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
' using IronOcr;
Dim Ocr = New IronTesseract() ' nothing to configure
Using Input = New OcrInput("imagesimage.png")
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
VB C#
Tesseract 4 API in IronOCR Tesseract
// using IronOcr;
var Ocr = new IronTesseract();
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract4;
using (var Input = new OcrInput(@"imagesimage.png"))
{
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
// using IronOcr;
var Ocr = new IronTesseract();
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract4;
using (var Input = new OcrInput(@"imagesimage.png"))
{
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
' using IronOcr;
Dim Ocr = New IronTesseract()
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract4
Using Input = New OcrInput("imagesimage.png")
Dim Result = Ocr.Read(Input)
Console.WriteLine(Result.Text)
End Using
VB C#
Why IronOCR Is Better Than Tesseract:
ACCURACY
TESSERACT:
If Tesseract encounters an image that is rotated, skewed, is of a low DPI, scanned, or has background noise, it becomes almost impossible for Tesseract to get data from that image. In addition, Tesseract will also take a very long time to process that document before providing you with nonsensical information.
IRONOCR:
Iron OCR takes this headache away. Users often achieve 99.8-100% accuracy with minimal configuration.
IMAGE COMPATIBILITY
TESSERACT:
Only accepts Leptonica PIX image format which is an IntPtr C++ object in C#. PIX objects are not managed memory — and failure to handle them with care in C# results in memory leaks.
IRONOCR:
Images are memory managed. PDF & Tiff supported. System. Drawing, Stream, and Byte Array are included for every file format.
Broad image support:
- PDF Documents
- PDF Pages
- MultiFrame TIFF files
- JPEG & JPEG2000
- GIF
- PNG
- System.Drawing.Image
- Binary image Data (byte[])
- And many more…
PERFORMANCE
TESSERACT:
Google Tesseract can perform fast and accurate results if properly tuned and input images have been preprocessed using Photoshop or ImageMagick.
IRONOCR:
The IronOcr .NET Tesseract DLL works accurately and at speed for most images out of the box. We have implemented multithreading to make use of the multi-core processors that most machines now use. Even low-resolution images generally work with a high degree of accuracy in your program. No PhotoShop required.
API
TESSERACT:
We have two free choices:
- Work with Interop layers — many that are found on GitHub are out of date, have unresolved tickets, memory leaks, and Console warnings. May not support .NET Core or Standard.
- Work with the command line EXE — difficult to deploy and constantly interrupted by virus scanners and security policies.
IRONOCR:
A managed and tested .NET Library for Tesseract called IronTesseract.
Fully documented with IntelliSense support.
LANGUAGE
TESSERACT:
Supports only 100 languages.
IRONOCR:
Supports 127+ languages.
Conclusion
Tesseract is an excellent resource for C++ developers, but it is not a complete OCR library for .NET. Scanned or photographed images need to be processed so as to be orthogonal, standardized, high-resolution, and free of digital noise before Tesseract can accurately work with them.
In contrast, IronOCR can do this and more, with just a single line of code. It is true that IronOCR uses Tesseract for its internal OCR engine, a very finely-tuned Tesseract, built for C#, with a lot of performance improvements and features added as standard.
You can download the software product from this link.
In this tutorial, we will configure our development environment for OCR. Once your machine is configured, we’ll start writing Python code to perform OCR, paving the way for you to develop your own OCR applications.
To learn how to configure your development environment, just keep reading.
Learning Objectives
In this tutorial, you will:
- Learn how to install the Tesseract OCR engine on your machine
- Learn how to create a Python virtual environment (a best practice in Python development)
- Install the necessary Python packages you need to run the examples in this tutorial (and develop OCR projects of your own)
OCR Development Environment Configuration
In the first part of this tutorial, you will learn how to install the Tesseract OCR engine on your system. From there, you’ll learn how to create a Python virtual environment and then install OpenCV, PyTesseract, and all the other necessary Python libraries you’ll need for OCR, computer vision, and deep learning.
A Note on Install Instructions
The Tesseract OCR engine has existed for over 30 years. The install instructions for Tesseract OCR are fairly stable. Therefore I have included the steps.
With that said, let’s install the Tesseract OCR engine on your system!
Installing Tesseract
Inside this tutorial, you will learn how to install Tesseract on your machine.
Installing Tesseract on macOS
Installing the Tesseract OCR engine on macOS is quite simple if you use the Homebrew package manager.
Use the link above to install Homebrew on your system if it is not already installed.
From there, all you need to do is use the brew
command to install Tesseract:
$ brew install tesseract
Provided that the above command does not exit with an error, you should now have Tesseract installed on your macOS machine.
Installing Tesseract on Ubuntu
Installing Tesseract on Ubuntu 18.04 is easy — all we need to do is utilize apt-get
:
$ sudo apt install tesseract-ocr
The apt-get
package manager will automatically install any prerequisite libraries or packages required for Tesseract.
Installing Tesseract on Windows
Please note that the PyImageSearch team and I do not officially support Windows, except for customers who use our pre-configured Jupyter/Colab Notebooks, which you can find at PyImageSearch University. These notebooks run on all environments, including macOS, Linux, and Windows.
We instead recommend using a Unix-based machine such as Linux/Ubuntu or macOS, both of which are better suited for developing computer vision, deep learning, and OCR projects.
That said, if you wish to install Tesseract on Windows, we recommend that you follow the official Windows install instructions put together by the Tesseract team.
Verifying Your Tesseract Install
Provided that you were able to install Tesseract on your operating system, you can verify that Tesseract is installed by using the tesseract
command:
$ tesseract -v tesseract 4.1.1 leptonica-1.79.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE
Your output should look similar to mine.
Creating a Python Virtual Environment for OCR
Python virtual environments are a best practice for Python development, and we recommend using them to have more reliable development environments.
Installing the necessary packages for Python virtual environments, as well as creating your first Python virtual environment, can be found in our pip Install OpenCV tutorial. We recommend you follow that tutorial to create your first Python virtual environment.
Installing OpenCV and PyTesseract
Now that you have your Python virtual environment created and ready, we can install both OpenCV and PyTesseract, the Python package that interfaces with the Tesseract OCR engine.
Both of these can be installed using the following commands:
$ workon <name_of_your_env> # required if using virtual envs $ pip install numpy opencv-contrib-python $ pip install pytesseract
Next, we’ll install other Python packages we’ll need for OCR, computer vision, deep learning, and machine learning.
Installing Other Computer Vision, Deep Learning, and Machine Learning Libraries
Let’s now install some other supporting computer vision and machine learning/deep learning packages that we’ll need throughout the rest of this tutorial:
$ pip install pillow scipy $ pip install scikit-learn scikit-image $ pip install imutils matplotlib $ pip install requests beautifulsoup4 $ pip install h5py tensorflow textblob
What’s next? I recommend PyImageSearch University.
Course information:
69 total classes • 73 hours of on-demand code walkthrough videos • Last updated: February 2023
★★★★★ 4.84 (128 Ratings) • 15,800+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you’re serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you’ll find:
- ✓ 69 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 69 Certificates of Completion
- ✓ 73 hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Click here to join PyImageSearch University
Summary
In this tutorial, you learned how to install the Tesseract OCR engine on your machine. You also learned how to install the required Python packages you will need to perform OCR, computer vision, and image processing.
Now that your development environment is configured, we will write an OCR code in our next tutorial!
Join the PyImageSearch Newsletter and Grab My FREE 17-page Resource Guide PDF
Enter your email address below to join the PyImageSearch Newsletter and download my FREE 17-page Resource Guide PDF on Computer Vision, OpenCV, and Deep Learning.