Wednesday, April 18, 2012

configuring soundex in elasticsearch

elasticsearch is trivially easy to set up and start using with reasonable default settings. Like any framework, deviating from those defaults increases the challenge. Phonetic searches like soundex are supported by elasticsearch, but not out-of-the-box.

What is soundex? It classifies words according to how they sound, so that similar-sounding words will match each other in a search. For instance, "frend" will find "friend" because they both translate to the soundex secret code F653.

This post describes in explicit detail how to configure a soundex search on specific fields in an index on elasticsearch v0.19.2 on a Mac. It includes some explanation of what each step means.

If you haven't installed elasticsearch yet, do that.

Part 1: Install the plugin

The phonetic plugin (github) is trivial to install. Go to the directory where elasticsearch is installed, and:
bin/plugin -install elasticsearch/elasticsearch-analysis-phonetic/1.1.0 .

The rest of the configuration is not so easy. What the plugin gave us was a new type of filter. That filter knows how to turn words into soundex secret codes. Our next objective is to teach elasticsearch how to perform this translation on searchable data.


Part 2: Configure the analyzer

The phonetic filter will be part of an analyzer. What is an analyzer?
When we tell elasticsearch about searchable data (aka "insert a document into an index"), the searchable fields are broken down into bits (tokens), which are then stored in the index. The remarkably crude diagram below illustrates this process. A document is inserted which contains a description like "bunch of words." This string is divided up by the tokenizer. Boring words like "of" (aka stop words) are discarded by the standard filter. The remaining words are translated into soundex secret codes by the soundex filter. Finally, these tokens make it into the index, where they are stored.

When a search string comes in to elasticsearch, that string goes through the same analyzer, where it gets broken down into tokens and those tokens get filtered and translated in the same way. The search tokens are then compared with tokens in the index, and by this means matches are located. Putting the search string through the same analyzer as the searched data is important, because it makes the searched tokens correspond to the indexed tokens.

What does this mean for us? To add the soundex filter to our data processing stream, we need to configure a custom analyzer. Put this into config/elasticsearch.yml:
index :
    analysis :
        analyzer :
            search_soundex :
                type : custom
                tokenizer : standard                filter : [standard, lowercase, soundex_filter]
        filter :
            soundex_filter :
                type : phonetic                encoder : soundex                replace : true

Watch out for:

  • no tabs. YAML hates tabs, as any indentation-sensitive format must. But you won't get the error you expect, because elasticsearch translates each tab into a single space. You might get an error like this: "ParserException[while parsing a block mapping; expected <block end>, but found BlockMappingStart]" because elasticsearch has inconveniently altered your indentation. 
  • No other non-space whitespace. We copied and pasted from a web page (like this one!) and got nonbreaking spaces in our file, and the result was that the analyzer was not found. No error loading the index, only "analyzer search_soundex not found" when we tried to put an item into our index.
  • if you add a second analyzer or filter, integrate it into this section; adding a new, similar block appears to overwrite the first one. (this was only my experience)
  • there are some examples of adding the analyzer definition within the index creation configuration. I never got that to work.

Some interesting bits:

search_soundex is the name of the analyzer. Call it what you want, but remember it for the next section.
replace : true tells our filter to discard the original tokens, passing only the secret soundex codes onward to the index. If you want to do both exact-match and phonetic-match searches on your data, you may need to set this to false. However, this bit us in the butt later, so I don't recommend it.

Test your analyzer:

First, restart elasticsearch to pick up your configuration changes.
  curl localhost:9200/_shutdown -XPOST
Second, you'll need an index -- any index. If you don't have one yet, do this:
  curl localhost:9200/anyindexname -XPUT
Third, use the analyzer api to see what happens to an input string when it goes through your analyzer.
  curl "localhost:9200/anyindexname/_analyze?analyzer=search_soundex&pretty=true" -d 'bunch of words'
This will give you some JSON output. The interesting bits are labeled "token" -- these are the tokens that come out of your analyzer. You should see B520 and W632 among them.

Part 3: Configure field mappings

Next, it is necessary to tell elasticsearch to use this analyzer. For this, it is necessary to map fields on an index. These mappings are by default done by elasticsearch dynamically, but for soundex to work, we have to set them up ourselves.
This must be done at index creation time, as far as I can determine. This means it is necessary to delete the index and recreate it for each experiment. For static configuration, these settings can be placed in a template file, but that's outside the scope of this post. I'm going to show how to create the index through the REST API. It should be possible to get these mappings set up on index creation with the Java API, but I had trouble with that.

If your index already exists, then delete it:
  curl localhost:9200/anyindexname -XDELETE
Create your index:
curl localhost:9200/anyindexname -XPUT -d '
{
  "mappings" :{
     "yourdocumenttype" : {
        "properties" : {
           "description" : { "type":"string","index":"analyzed","analyzer" : "search_soundex" }
           }
     }
   }
}'

Explanation of the interesting bits:

-XPUT tells curl to use the PUT http command. All the stuff after -d will be passed as the data for the put, and this is the index configuration.
anyindexname is the name of your index, whatever you like
yourdocumenttype is the type of document you're going to insert. Call it whatever.
description is the name of the field that is going to be searched phonetically.
The rest of it is structure.

Test your mappings

Run
  curl localhost:9200/anyindexname/_mapping
You should see your configuration come out, including the analyzer. If not, bang your head against the wall.

Intermission: insert some data

It's time to populate the index. You don't need to do anything differently compared to without soundex. If you're following along and want to throw in some test data, try this:
  curl localhost:9200/anyindexname/yourdocumenttype/1 -XPOST -d '{ "description" : "bunch of words" }'
To see all the data in your index, here's a trick. An explanation of facets is outside the scope of this post, but this will retrieve everything in the index and count all the tokens in all the descriptions.
  curl localhost:9200/anyindexname/_search?pretty=true -d ' { "query" : { "matchAll" : {} }, "facets":{"tag":{"terms":{"field":"description"}}}}'
Look for the "term" items, and see the soundex codes.

Part 4: Set up the query

When searching this index for data in this field, it is important that the search string be passed through the same analyzer. This will happen by default as long as you specify the field name.  That way secret soundex codes are compared against secret soundex codes and all is well.

REST API:
curl localhost:9200/anyindexname/yourdocumenttype/_search -d ' 
{ "query" : 
   { "text" : 
     { "description" : 
        { "query" : "werds bunk"
     } 
   }
}'

Java API:

TransportClient client = new TransportClient(ImmutableSettings.settingsBuilder()
                .put("cluster.name", "elasticsearch_jessitron").build())
                .addTransportAddress(new InetSocketTransportAddress("localhost", 9300));

SearchRequest request = new SearchRequestBuilder(client)
                .setQuery(textQuery("description", "werds bunk"))     
                .setIndices("anyindexname")
                .setTypes("yourdocumenttype")
                .request();
SearchResponse searchResponse = client.search(request).actionGet();

assertThat(searchResponse.getHits().totalHits(), is(1l));

The above query will find any document with either a token that sounds like "words" or a token that sounds like "bunch." You can change it to require that the description contain both "words" and "bunch" by telling the text query to use AND instead of OR. (If you're going to do this, then be sure you don't have replace : false set on your soundex filter. You won't find anything when your search terms aren't exact.)

REST API:
curl localhost:9200/anyindexname/yourdocumenttype/_search -d ' 
{ "query" : 
   { "text" : 
     { "description" : 
        { "query" : "werds bunk",
          "operator" : "and"
     } 
   }
}'

Java API:

TransportClient client = new TransportClient(ImmutableSettings.settingsBuilder()
                .put("cluster.name", "elasticsearch_jessitron").build())
                .addTransportAddress(new InetSocketTransportAddress("localhost", 9300));

SearchRequest request = new SearchRequestBuilder(client)
                .setQuery(textQuery("description", "werds bunk")
                .operator(TextQueryBuilder.Operator.AND)
                ).setIndices("anyindexname")
                .setTypes("yourdocumenttype")
                .request();
SearchResponse searchResponse = client.search(request).actionGet();

assertThat(searchResponse.getHits().totalHits(), is(1l));

Watch out for:

It is tempting to use the "_all" field in the searches, to base results on all fields at once. "_all" is not a loop through all fields so much as its own special field: it has its own analyzer both for indexing and searching. It is configured separately, so it won't do soundex unless you set that up specifically.

Part 5: Deep breath

That was a lot more work than most stuff in elasticsearch. Having done this, you now know how to map fields and search on them. You know how to define analyzers and filters by field. This will stand you in good stead for further configuration. For instance, adding geographic search is easy after you've done soundex.

It is possible to use different analyzers for search and indexing. This can be specified in index mappings or overridden on the individual search. Think about this when you start doing something weird, but make sure you understand analyzers and filters more deeply than this article gets.

3 comments:

  1. I didn't know about the soundex plugin. That's something we could use. Thanks for figuring out the tough bits. :) This is a good post.

    ReplyDelete
  2. Thank u for the article, but iam struck up with an error

    {"error":"MapperParsingException[mapping [anyindexname]]; nested: MapperParsingException[Analyzer [search_soundex] not found for field [description]]; ","status":400}

    Can any one help me to overcome this issue...

    ReplyDelete
    Replies
    1. sorry the seems like this
      ***yourdocumenttype

      {"error":"MapperParsingException[mapping [yourdocumenttype]]; nested: MapperParsingException[Analyzer [search_soundex] not found for field [description]]; ","status":400}

      Delete