Installing Tesseract OCR 4.0 on CentOS 6

xkcd: compiling - https://xkcd.com/303/

Tesseract OCR package is available for CentOS 6 via EPEL yum repository, but unfortunately, at the time of writing this article, the latest available Tesseract version in EPEL is 3.0.4.

Installing Tesseract 4.0 from source is possible, but with some extra effort as CentOS 6 doesn’t come with Leptonica 1.77, which is required by Tesseract 4.0, nor it comes with autoconf-archive package (which was orphaned in EPEL), nor it comes with GCC that supports C++11.

So far, things don’t look promising but rest assured, it’s not the end of the world 😄

Firstly, install development tools and a couple of prerequisites for Tesseract.

# yum -y groupinstall "development tools"
# yum -y install libpng-devel libtiff-devel libjpeg-devel

Next, install CentOS Software Collections yum repository and newer version of GCC. Don’t worry, GCC from SCL repository won’t interfere with GCC from base repository.

# yum -y install centos-release-scl
# yum -y install devtoolset-7-gcc-c++

In order to reach newly installed GCC, you simply need to source devtoolset-7 script

# source /opt/rh/devtoolset-7/enable

Next up for installation is autoconf-archive, a collection of re-usable Autoconf macros:

# cd /usr/src/
# wget http://ftpmirror.gnu.org/autoconf-archive/autoconf-archive-2019.01.06.tar.xz
# tar xvvfJ autoconf-archive-2019.01.06.tar.xz
# cd autoconf-archive-2019.01.06/
# ./configure --prefix=/usr
# make
# make install

Now we can move to Leptonica installation. Tesseract 4.0 requires Leptonica 1.77 or newer.

# cd /usr/src/
# wget http://leptonica.org/source/leptonica-1.77.0.tar.gz
# tar xvvfz leptonica-1.77.0.tar.gz
# cd leptonica-1.77.0/
# ./configure --prefix=/usr/local/
# make
# make install

At this point, all system requirements are satisfied. We can finally install Tesseract OCR:

# cd /usr/src/
# wget https://github.com/tesseract-ocr/tesseract/archive/4.0.0.tar.gz -O tesseract-4.0.0.tar.gz
# tar xvvfz tesseract-4.0.0.tar.gz
# cd tesseract-4.0.0
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
# ./autogen.sh
# ./configure --prefix=/usr/local/ --with-extra-libraries=/usr/local/lib/
# make install

There you go. You now have Tesseract 4.0 on CentOS 6.

# tesseract --version
tesseract 4.0.0
leptonica-1.77.0
libjpeg 6b (libjpeg-turbo 1.2.1) : libpng 1.2.49 : libtiff 3.9.4 : zlib 1.2.3

If you’re getting -bash: tesseract: command not found error, you most probably don’t have /usr/local/bin in your $PATH, so make sure to fix that by adding (or appending to existing configuration) the following to your ~/.bash_profile:

export PATH="$PATH:/usr/local/bin"