This page archives the FAQ page pertaining to Tesseract 2.0x, 3.0x and 4.00.00alpha as of May 1, 2018.
The main FAQ page will contain information pertaining to Tesseract 4.0.0 and later.
A collection of frequently asked questions and the answers, or pointers to them. If you have a question which is not answered below, please post it to the forums.
- Can I increase speed of OCR?
- Windows: tesseract closes automatically right after launching
- What output formats can Tesseract produce?
- How to interpret hOCR baseline output?
- libtesseract.so.3: cannot open shared object file
- Tesseract does not work
- How do I uninstall Tesseract
- How do I run multiple versions of Tesseract on same machine
- Error in pixReadStream:
- Can't open eng.unicharset?
- leptonica library missing
- Can't read compressed TIFF files
- No output with color images
- Does it support multi-page TIFF files?
- Why doesn't viewer/svutil.cpp compile?
- Where are the training tools for Ubuntu 14.04 ?
- How do I Edit Box files used in training?
- Utf8 buffer too big, size=xx (Error during training)
- How do I recognize only digits?
- How do I add just one character or one font to my favourite language, without having to retrain from scratch?
- How do I produce searchable PDF output?
- The produced searchable PDF seems to only contain spaces
- How to do streaming
- Is there a Minimum / Maximum Text Size? (It won't read screen text!)
- Why is the output empty or of poor quality?
- How do I generate the language data files?
- How do I unpack or alter existing language data files?
- How do I provide my own dictionary?
- wordlist2dawg doesn't work!
- How to increase the trust in/strength of the dictionary?
- What are configs and how can I have more?
- There are inconsistent results from tesseract when the same TessBaseAPI object is used for decoding multiple images
- Diacritics above and below the glyph are ignored/cause garbage output
- Where is the documentation?
- How can I try the next version?
- actual_tessdata_num_entries_<= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file ..\ccutil\tessdatamanager.cpp, line 55_
- last_char == '\n':Error:Assert failed:in file ..\ccutil\tessdatamanager.cpp, line 95
- Error: Illegal min or max specification
- Error: X classes in inttemp while unicharset contains Y unichars.
- Error: Size of unicharset is greater than MAX_NUM_CLASSES
- How can I make the error messages go to tesseract.log instead of stderr?
- How can I suppress tesseract info line?
- How can I get the coordinates and confidence of each word?
- How is confidence calculated?
- Can I use Tesseract for handwriting recognition?
- Can I use tesseract for barcode recognition?
- Issues with older versions of tesseract
- How to port Tesseract engine into vb6 project?
- Delphi wrapper for tessdll.dll
- Incomplete OCR result
- error LNK2001: unresolved external symbol "public: static char cdecl TessBaseAPI::TesseractRect(unsigned char const,int,int,int,int,int,int)"
- box overlaps no blobs or blobs in multiple rows error when training
- Training failed with error message "mf.cpp:78: FEATURE_SET_STRUCT* ExtractMicros..."
- error message: Font id = -1/0, class id = 1/105 on sample 0
- My question isn't in here!
If you think you found a bug in Tesseract, please create an issue. Questions should be asked in the users mailing-list.
You must read the CONTRIBUTING guide before you report an issue in GitHub or ask a question in the forum.
If you are processing several images, you can run tesseract in parallel with GNU Parallel. E.g. instead of:
find . -maxdepth 1 -name "*.tif" -print0 | while IFS= read -r -d '' n; do
tesseract "$n" "$n" -l eng hocr
hocr2pdf -i "$n" -n -o "$n.pdf" < "$n.html"
done
you can run:
parallel "tesseract {} {} -l eng hocr; hocr2pdf -i {} -n -o {}.pdf < {}.html" ::: *.tif
Note that this example is a little obsolete. You will get better results having Tesseract produce one page PDF files in parallel, then splicing them together at the end using QPDF or similar. If you are running Tesseract 4, you can use the "fast" models.
Tesseract 4 also uses up to four CPU threads while processing a page, so it will be faster than Tesseract 3 for a single page. If your computer has only two CPU cores, then running four threads will slow down things significantly and it would be better to use a single thread or maybe a maximum of two threads! Using a single thread eliminates the computation overhead of multithreading and is also the best solution for processing lots of images by running one Tesseract process per CPU core. Set the maximum number of threads using the environment variable OMP_THREAD_LIMIT
. To disable multithreading, use OMP_THREAD_LIMIT=1
.
Tesseract is a command line program, so you need to run it from the command line. If you need a program with a graphical interface, there are several available from the 3rdParty page.
- txt
- hocr
- tsv
- pdf with text layer only
Tesseract can produce plain text, PDF, and HTML output. Tesseract's standard output is a plain txt file (utf-8 encoded, with '\n' as end-of-line marker).
With the configfile 'hocr' tesseract will produce XHTML output compliant with the hOCR specification (the input image name must be ASCII if the operating system use something other than utf-8 encoding for filenames - see issue 809 for some details).
With the configfile 'pdf' tesseract will produce searchable PDF containing pages images with a hidden, searchable text layer.
The hOCR output for the first line of eurotext.tif contains the following information:
<span class='ocr_line' id='line_1_1' title="bbox 105 66 823 113; baseline 0.015 -18; …
bbox is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at -18
and its slope angle is arctan(0.015) = 0.86°
.
In general, the baseline is described by a polynomial of order n
with the coefficients pn … p0
with n = 1 for a linear (i.e. straight) line (see hOCR spec).
Run 'sudo ldconfig' after 'sudo make install'. See issue 621.
Please ensure there is only one installation of tesseract e.g. there are reported problems (in case of shared build) if tesseract 3.01 and 3.02 are installed on the same computer. See e.g. issue 793
In the tesseract-ocr
directory, run:
sudo make uninstall
If you installed training tools, also run:
sudo make training-uninstall
If you want to have several version of tesseract (e.g. you want to compare OCR result) I would suggest you to compile them from source (e.g. in /usr/src) and not install them. If you want to test particular version you can run it this way:
/usr/src/tesseract-3.03/api/tesseract eurotext.tif eurotext
/usr/src/tesseract-ocr.3.02/api/tesseract eurotext.tif eurotext
/usr/src/tesseract-3.03/api/tesseract is shell wrapper script, and it will take care that correct shared library is used (without installation...).
If you see this error, than you have a problem with your leptonica installation - e.g. there is missing support for your image format. Usually this means the relevant image library was not installed properly during leptonica build or there is some configure problem within leptonica.
You don't have language training data installed. Read the README.
If get this error message when you run ./configure
and your leptonica header files are located in /usr/local/include
(e.g. you installed leptonica to /usr/local
) than run:
LIBLEPT_HEADERSDIR=/usr/local/include ./configure
or:
CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure
I get this error message:
read_tif_image:Error:Illegal image format:Compression
Windows (Visual C++): Libtiff support can be added in either VC++6 or VC++Express with the following:
Go to the Windows download for libtiff and follow these steps:
Download and run the setup program.
Add the paths for include and library files in tools/options/directories
Add HAVE_LIBTIFF to the preprocessor definitions.
Add libtiff.lib to the list of libraries.
Rebuild.
Put libtiff3.dll be in your path somewhere.
This is done by control panel/system/advanced/environment variables and adding c:/program files/gnuwin32/bin to PATH.
Keep your fingers crossed...
Non-Windows (and Cygwin): Install libtiff-dev. Procedure differs from OS to OS, but on many something like
sudo apt-get install libtiff-dev
or some variant thereof should do the trick, before running configure.
There have been several bug reports of blank or garbage output with color images, both with and without libtiff. Here is the most up-to-date information (last update 23 Sep 2008):
Without libtiff, Tesseract only reads uncompressed tiff files. Even then it won't read 32 bit tiff files correctly. Will be fixed in 2.04. (Meaning that it will correctly handle most image depths (except 16 bit) with libtiff.
With libtiff, Tesseract reads compressed tiff files, but can't handle any color: 24 or 32 bit. It can only read 1 bit binary images or 8 bit greyscale. (No color maps!) Fixed in 2.04.
The API (TessBaseAPI) should be OK with 1, 8, 24 or 32 bit images.
Yes, with all versions 2.03 and later, as long as you have libtiff installed. See Compressed TIFF above.
This file is the single greatest cause of portability issues, because it is the interface to a viewer running in an external process. If you can get it to compile on your system, please report an issue logging what you had to change, but please only for the current version. If you can't get it to compile, you can define GRAPHICS_DISABLED in your compiler (for all the source) and it will comment out all the hard-to-compile code and disable the viewer functionality, which most people don't use anyway.
On Unix-like systems, the configure script can be instructed to disable graphics like this:
configure --disable-graphics
Due to an oversight, training tools do not ship with Ubuntu 14.04. Either wait for the next Ubuntu release, or use these commands.
sudo apt-get build-dep tesseract-ocr
sudo apt-get install devscripts
dget http://http.debian.net/debian/pool/main/t/tesseract/tesseract_3.03.03-1.dsc
tar xvzf tesseract_3.03.03.orig.tar.gz
cd tesseract-3.03
zcat ../tesseract_3.03.03-1.diff.gz | patch -p1
debuild -us -uc
cd ..
sudo dpkg -i *.deb
There are a variety of programs to help with this, see the AddOns page.
There is an upper limit of 30 bytes for the utf-8 representation of each recognizable unit (character/grapheme/glyph). It can't easily be increased. Check your box file for strangely long strings.
Use the digits
config file like this:
tesseract imagename outputbase digits
Use
TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
BEFORE calling an Init function or put this in a text file called tessdata/configs/digits
:
tessedit_char_whitelist 0123456789
and then your command line becomes:
tesseract image.tif outputbase nobatch digits
Warning: In versions 2.03/2.04 (but not in version 3.00 and up), you must put nobatch
before digits
.
How do I add just one character or one font to my favourite language, without having to retrain from scratch?
It is not possible to add just one character or font. You need to retrain from scratch. See the Training Tesseract documentation.
Searchable PDF output is a standard feature as of Tesseract version 3.03. Use the pdf
config file like this:
tesseract phototest.tif phototest pdf
There may be nothing wrong with the PDF itself, but its hidden, searchable text layer may be not understood by your PDF reader. For example, Preview.app in Mac OS X is well known for having problems like this, and might "see" only spaces and no text. Try using Adobe Acrobat Reader instead.
Let's say you have an amazing but slow multipage scanning device. It would be nice to OCR during scanning. In this example, the scanning program is sending image filenames to Tesseract as they are produced. Tesseract streams a searchable PDF to stdout. You will need Tesseract v3.04 or later for this feature.
scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".
Using LSTM there seems also to be a maximum x-height somewhere around 30 px. Above that, Tesseract doesn't produce accurate results. The legacy engine seems to be less prone to this (see https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ).
Read the ImproveQuality documentation.
Read the Training Tesseract documentation.
Use the combine_tessdata command. See the combine_tessdata manual page for details on its usage.
To add an extra word list, create a .user-words file as explained in tesseract(1).
If you want to replace the whole dictionary, you will need to unpack the .traineddata file, create a new word-dawg file, and then pack the files back into a .traineddata file. See TrainingTesseract for more details.
Easy: Replace tessdata/eng.user-words
with your own word list, in the same format - UTF8 text, one word per line.
More difficult, but better for a large dictionary:
Replace tessdata/eng.word-dawg
with one created from your own word list, using wordlist2dawg.
See the Training Tesseract documentation for details.
There is a memory problem with the 2.03 wordlist2dawg. If you don't have something more than 1GB of memory, then your system grinds to a halt and it runs very slowly. Reduce both max_num_edges and reserved_edges by a factor of 10 at line 39-40 of training/wordlist2dawg.cpp and rebuild.
If you successfully create a new dawg, and then it doesn't load, due to the error:
Error: trying to read a DAWG '%s' that contains %d edges while the maximum is %d.
Then try changing the value of MAX_NUM_EDGES_IN_SQUISHED_DAWG_FILE in dawg.h.
For tesseract-ocr < 3.01 try upping NON_WERD and GARBAGE_STRING in dict/permute.cpp to maybe 3 or even 5.
For tesseract-ocr >= 3.01 try increasing the variables language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word in a config file. By default they are 0.1 and 0.15 respectively.
For tesseract-ocr 2.0x: If the text fonts you are recognizing are significantly different from your training data, and you don't mind a slow-down, you could also try lowering ClassPrunerThreshold in classify/intmatcher.cpp to about 200 from 229. These measures should all improve the power of the dictionary to resolve words from non-words.
Of course any changes that up the power of the dictionary also up the ability to hallucinate dictionary words. If this is a problem, keep short words out of your dictionary, and don't add a vast list of words that are rarely used if they increase the number of ambiguities with more frequent words.
To go the opposite way and turn off the dictionary, you can either delete the appropriate dawg(s) from the traineddata file using combine_tessdata, or set the appropriate load_type_dawg parameter to false, but this must be done with a config file and the version of Init that takes a configs argument. It cannot be done by SetVariable, because the parameter is controlling initialization. type may be one of punc, system, number, bigram, freq, unambig.
Config is an overloaded word in tesseract. One meaning is a file of control parameters used for debugging or modifying its behaviour, such as tessdata/configs/segdemo.
The other meaning is used in training and in the classifier:
A config represents a (potentially) different shape of a character from a different font. The MAX_NUM_CONFIGS limit applies to the number of different files on the command line of mftraining containing samples of any one character, as each file is assumed to represent a different font. There is currently (2.03) a limit of 32 configs. You can get away with more than 32 files on the mftraining command line if not all the files contain all the characters.
Other ways to fix the problem:
If files contain very similar looking samples, then you can cat them together to make a single file to reduce the total number of files. DON'T do this if the characters in two files look very different.
Increase MAX_NUM_CONFIGS (in intproto.h) There are consequences. You will make inttemp files generated with a different value of MAX_NUM_CONFIGS unreadable. We are working towards overcoming this weakness for version 3.0. Will not be in 2.04 though. Also, classification will be slower and use more memory.
There are inconsistent results from tesseract when the same TessBaseAPI object is used for decoding multiple images
Try to clear the adaptive data with ClearAdaptiveClassifier()
or turn off the adaptive classifier with config variables:
classify_enable_learning 0
classify_enable_adaptive_matcher 0
See also the discussion on the tesseract forum
Diacritics above and below the glyph are ignored/cause garbage output
Try to increase the default value (1.25) of textord_min_linesize to higher value. Some users report good results with value 2.5.
You're looking at it. If things aren't clear, search on the Tesseract Google Group or ask us there. If you want to help us write more, please do, and post it to the group!
Periodically stable versions go to the downloads page. Between releases, and in particular, just before a new release, the latest code is available from git. You can find the source here: https://github.com/tesseract-ocr/tesseract.git where you can check it out either by command line, or by following the link to the howto on using various client programs and plugins.
actual_tessdata_num_entries_<= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file ..\ccutil\tessdatamanager.cpp, line 55_
If you get an error while running tesseract, please check if you are using the correct version of traineddata (e.g. 3.00 with 3.01). You can not use 3.01 traineddata with tesseract 3.00.
If you get error during running combine_tessdata:
This indicates that your lang.unicharambigs is missing an empty line at the end of the file.
This can happen if number format ("locale" on linux or "Regional and Language Options" in Windows) is different than English format. Solution is to set your locale (in code) of LC_NUMERIC to "C" before calling tesseract. In some cases you need to include locale.h:
#include <locale.h>
setlocale (LC_NUMERIC, "C");
Other option is to adjust your user locale/regional settings.
See also information in issues 250 and 228
(Where Y != X) There are 2 possibilities: X ~= Y, usually with X < Y: Usually caused by a failed training process. Look for FATALITY messages from the tr file generation process. Looks like the training process failed to pick up some samples of some characters, and they didn't make it into the inttemp file (in mftraining) because there was no entry in the tr file. There are bad problems with applybox that make this a problem for a lot of people. The character samples need to be spaced out.
X a wild number (very large + or -) and Y a sane number between 100 and a few thousand, depending on the language: Corrupt inttemp file, or (if you have NOT trained yourself) your hardware has a funny endian architecture that is not automatically detected. Big-endian or Little-endian 32 and 64 bit SHOULD work, but mixed endian (0x12345678 -> 0x56 0x78 0x12 0x13 or similar) will NOT work. Get a sensible hardware architecture, or retrain yourself. Then your inttemp will match the hardware.
Just increase MAX_NUM_CLASSES (in dict/matchdefs.h) for your build. See issues 670 and 743
To restore the old behaviour of writing to tesseract.log instead of writing to the console window, you need a text file that contains this:
debug_file tesseract.log
call the file 'logfile' and put it in tessdata/configs/ Then add logfile to the end of your command line.
See issue 579. On linux you can redirect stderr and stdout output to /dev/null. E.g.:
tesseract phototest.tif phototest 1>/dev/null 2>&1
With tesseract 3.02 you can use config "quiet". E.g.:
tesseract phototest.tif phototest quiet
Warning: Both options will cause you to not see the error message if there is one.
There are two options. If you would rather not get into programming, you can use Tesseract's hOCR output format (read the Tesseract manual page for details). If you are comfortable programming, use the Tesseract API.
- Character - Compute a distance measure between 0 and 1 of the character from a training sample:
- certainty = -20 * distance
- rating = distance * outline length
- Word:
- certainty = min over word of character certainty.
- rating = sum over word of character rating.
You can, but it won't work very well, as Tesseract is designed for printed text. Take a look at the Lipi Toolkit project instead.
No. Tesseract is for text recognition.
(information from issue tracker with Status: Look-here-for-help.)
See the suggestions in issue 42.
Have a look at Addon documentation or see the comments in issue 88.
See the suggestions in issue 44.
error LNK2001: unresolved external symbol "public: static char cdecl TessBaseAPI::TesseractRect(unsigned char const,int,int,int,int,int,int)" (?TesseractRect@TessBaseAPI@@SAPADPBEHHHHHH@Z)
If you are using version 2.04 and VC++ 6 add a #define TESSDLL_IMPORTS before you include baseapi.h. (see issue 297
If you get this error try to use pagesegmenation value 5 or 6. E.g.:
tesseract input_image output -psm 6 batch.nochop makebox
Or have a look at other suggestions in issue 471
See suggestions on how to improve the input image in issue 488.
This means that the font name is not in font_properties or font_properties does not meet requirements
Try searching the forum: http://groups.google.com/group/tesseract-ocr as your question may have come up before even if it is not listed here.