Solr full text search for Dovecot

Index all emails!

Dovecot is great POP3/IMAP mail server but its internal search mechanism falls short when dealing with large mailboxes. Search is not only super slow, but also extremely resource intensive on both CPU and disk IO front.

Offloading mail indexing and search to Apache Solr is not only recommended, but a must in such scenarios.

In this tutorial, I’ll cover Apache Solr 7.5.0 installation and integration with Dovecot 2.2. I won’t go into Dovecot installation and configuration details.

Let’s start with Solr installation. Make sure that you Java 8 installed on your system. It doesn’t matter if it’s Oracle Java or OpenJDK. All commands below should be executed under root user (or via sudo).

# useradd -m -d /var/solr solr
# cd /opt/
# wget https://www-eu.apache.org/dist/lucene/solr/7.5.0/solr-7.5.0.tgz
# tar xzf solr-7.5.0.tgz solr-7.5.0/bin/install_solr_service.sh --strip-components=2
# bash ./install_solr_service.sh solr-7.5.0.tgz -i /opt -d /var/solr -u solr -s solr -p 8983
# chkconfig solr on
# service solr start

So, to recap, the above commands will create new unix user solr with homedir in /var/solr/. Solr installation script is extracted from the archive and executed with couple of arguments which define where should Solr be installed, where should its data directory be, under which user should Solr service run and on which port should Solr bind. The installation script will automatically install init script in /etc/init.d/solr, which means that we can easily manage it as a service.

Now, let’s create new Solr core for Dovecot.

# su - solr -c "/opt/solr/bin/solr create_core -c dovecot"

Next, open /var/solr/data/dovecot/conf/solrconfig.xml and uncomment the following section:

<queryResponseWriter name="xml"
                     default="true"
                     class="solr.XMLResponseWriter" />

With solrconfig.xml sorted, let’s remove managed-schema…

# useradd -m -d /var/solr solr
# rm -f /var/solr/data/dovecot/conf/managed-schema

…and put in place Dovecot’s schema to /var/solr/data/dovecot/conf/schema.xml

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
<?xml version="1.0" encoding="UTF-8" ?>
<!--
         For fts-solr:
This is the Solr schema file, place it into solr/conf/schema.xml. You may
want to modify the tokenizers and filters.
-->
<schema name="dovecot" version="1.5">
    <!-- IMAP has 32bit unsigned ints but java ints are signed, so use longs -->
    <fieldType name="string" class="solr.StrField" />
    <fieldType name="long" class="solr.TrieLongField" />
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
      </analyzer>
    </fieldType>
    <!-- boolean type: "true" or "false" -->
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>
    <!--
                   Numeric field types that index values using KD-trees.
      Point fields don't support FieldCache, so they must have docValues="true" if needed for sorting, faceting, functions, etc.
    -->
    <fieldType name="pint" class="solr.IntPointField" docValues="true"/>
    <fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/>
    <fieldType name="plong" class="solr.LongPointField" docValues="true"/>
    <fieldType name="pdouble" class="solr.DoublePointField" docValues="true"/>
    <fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true"/>
    <fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
    <fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/>
    <fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/>
    <!-- The format for this date field is of the form 1995-12-31T23:59:59Z, and
                      is a more restricted form of the canonical representation of dateTime
         http://www.w3.org/TR/xmlschema-2/#dateTime
         The trailing "Z" designates UTC time and is mandatory.
         Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
         All other components are mandatory.
         Expressions can also be used to denote calculations that should be
         performed relative to "NOW" to determine the value, ie...
               NOW/HOUR
                  ... Round to the start of the current hour
               NOW-1DAY
                  ... Exactly 1 day prior to now
               NOW/DAY+6MONTHS+3DAYS
                  ... 6 months and 3 days in the future from the start of
                      the current day
      -->
    <!-- KD-tree versions of date fields -->
    <fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
    <fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/>
    <!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
    <fieldType name="binary" class="solr.BinaryField"/>
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
                         <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="uid" type="long" indexed="true" stored="true" required="true" />
   <field name="box" type="string" indexed="true" stored="true" required="true" />
   <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
   <field name="user" type="string" indexed="true" stored="true" required="true" />
   <field name="hdr" type="text" indexed="true" stored="false" />
   <field name="body" type="text" indexed="true" stored="false" />
   <field name="from" type="text" indexed="true" stored="false" />
   <field name="to" type="text" indexed="true" stored="false" />
   <field name="cc" type="text" indexed="true" stored="false" />
   <field name="bcc" type="text" indexed="true" stored="false" />
   <field name="subject" type="text" indexed="true" stored="false" />
   <!-- Used by Solr internally: -->
   <field name="_version_" type="long" indexed="true" stored="true"/>
 <uniqueKey>id</uniqueKey>
</schema>

At this point, we have Solr configuration in place, so let’s restart Solr service.

# service solr restart

In most Linux distributions, Dovecot comes packaged with fts_solr plugin, so its activation can be done by editing a single line in /etc/dovecot/conf.d/10-mail.conf:

mail_plugins = fts fts_solr

Next, let’s instruct Dovecot where to find Solr. Open /etc/dovecot/conf.d/90-plugin.conf and add the following configuration block:

plugin {
  fts_autoindex=yes
  fts = solr
  fts_solr = url=http://localhost:8983/solr/dovecot/
}

Finally, restart Dovecot service and manually trigger initial indexing of desired mailbox.

# service dovecot restart
# doveadm -v index -u [email protected] '*'

The last command will index all IMAP folders in [email protected] mailbox. If you want to index specific IMAP folder, simply replace the last argument (asterisk) with folder name.

Dovecot 2.2.3 and newer do only soft commits to the Solr index to improve performance. You must run a hard commit once in a while or Solr will keep increasing its transaction log sizes. It’s recommended to configure the following cron jobs (preferably under solr user, just to keep things well organized):

# Run optimize every day at midnight
0 0 * * * /usr/bin/curl "http://127.0.0.1:8983/solr/dovecot/update?optimize=true"
# Run commit 10 minutes into every hour
10 * * * * /usr/bin/curl "http://127.0.0.1:8983/solr/dovecot/update?commit=true"