Site search sanity

Search is a component of most web sites. Therefore, it is a problem solved many times before. The solution of choice (at least in the Java sphere) is Lucene. Insert documents in an index, build queries to find the ones you want. Lucene is a library, so there are a bunch of other tools that wrap Lucene for easier interaction.

Use Solr, use elasticsearch, use Lucene directly – you still have to figure out two things: get your documents into the index, and get the relevant ones out. This post is about getting them out in an ordinary site-search sort of scenario.

For our purposes here, the documents have been indexed with default elasticsearch mappings. This means their fields have passed through the default analyzer, which breaks them down into terms (words), puts these in lowercase, and throws some extremely common words (stop words). The search text will go through the same default analyzer so that we’re comparing apples to apples, and not Apples or APPLES or “apples and oranges.”

What does a reasonable Lucene-style query for site search look like? There’s documentation out there about the query API, all about what to type when you know what kind of query you want – but what kind of query do I want?

Our indexed documents each have two fields: name and description. The search match some text against both fields. Handle some misspellings or typos. Emphasize exact matches in the name field. This seems pretty straightforward, but it isn’t trivial. It involves a compound query combining an exact phrase match, exact term matches, and fuzzy term matches.


The outer query is a BoolQuery. A BoolQuery is more complex than AND/OR logic, because there’s more to a query than a “yes” or “no” on each document — there is an ordering to the results.

There are three ways to construct a BoolQuery:

  • only “must” components. This is like a big AND statement: each record returned matches every “must” query component.
  • only “should” components. This is a lot like an OR statement; each record returned matches at least one of the “should” query components. The more “should” components that match the record, the higher the record’s rank.
  • a mix of “must” and “should” components. In this case, the “must” clauses determine which records will be returned. The “should” components contribute to ranking.

For the simplest site search, all the different text queries go in as “should” components. We’re taking the giant-OR route.

Phrase Query

The first subquery is a TextQuery with type “phrase.” This is elasticsearch parlance; Lucene has a  PhraseQuery. The objective here is to find the exact phrase the user typed in. A slop of 1 means there can be one extra word in between the words in the phrase. Increasing the slop to 2 will match two-word phrases with the words out of order. Adding a boost of 4 tells Lucene that this query is 4 times as important as the other queries.

Text Query

The other text queries have a type of boolean (which is the default in elasticsearch). These will find any match on any of the terms in the search text. There is one for each field that I’m searching in.


The intention here is to match terms that are close to the ones entered. It’ll match words that are a few characters off from the search text. The details are fuzzy to me, but that’s compatible with this objective. Increasing maxQueryTerms increases the flexibility of the match but can slow performance. Minimum similarity can be raised (toward 1) if the users think your matches are too flexible. Prefix length could be set to 1 to required that the first letters of the words matched are the same. Finally, a boost of less than one reduces the ranking of these matches compared to the exact term matches from the text queries.
The result of all this is a search that emphasizes exact matches. It includes near-matches, but puts them toward the end of the results.
If you have any suggestions to make this better, please, leave a comment.
For elasticsearch users, this is the code in the Java API:
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
  new TextQueryBuilder(“name”,text).type(Type.PHRASE)
  new TextQueryBuilder(“name”, text).type(Type.BOOLEAN));
  new TextQueryBuilder(“description”, text).type(Type.BOOLEAN));        
  new FuzzyLikeThisQueryBuilder(
     new String[] {“name”, “description”})
final SearchRequest searchRequest = 
  new SearchRequestBuilder(client).setIndices(“indexName”)                                   .setQuery(boolQuery).setFrom(0)

final SearchResponse response =;

services on a Mac

Say there’s a program that needs to run on your Mac all the time forever and ever. This post describes how to set that up. The example here sets up the script that runs elasticsearch.*

On a Mac, services are controlled by launchd, which is configured using launchctl. This example uses launchctl to set up a service that starts as soon as it’s configured, starts up automatically at system startup, and gets restarted every time the job dies.

1) Create a configuration (plist) file for it. Name the file like: com.yourorganization.whateverpackage.jobtitle.plist. In the example, mine is org.elasticsearch.test.plist

<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN"”&gt;


Some important bits explained:
org.elasticsearch.test – the Label is the job’s name. After the job is loaded into launchctl, this is how you can refer to it for start, stop, list.
/Volumes/Projects/elasticsearch/bin/elasticsearch – the ProgramArguments contains the name of the program you want to run, and then any arguments to pass to it. (there’s an alternative to put the program or script name in a separate place, but this is easy. It keeps it right next to its arguments.)
-f – this argument is specific to elasticsearch, but the meaning of it is important. Your program or script should not fork and then exit. This means it should run everything in the foreground and never exit. launchd has a job to keep your script running, so your script needs to keep running.
/Volumes/Projects/elasticsearch –  launchd will cd to the WorkingDirectory before running your program. Your script should not cd.

2) load your plist configuration:

  • copy your plist file to /Library/LaunchDaemons
  • Load it: launchctl load /Library/LaunchDaemons/org.elasticsearch.test.plist (your plist filename)
    • note that whatever user you are when you run this launchctl load will, by default, be thu user the job runs as. If you don’t like that, you can configure the UserName in your plist (man launchd.plist for more info). If you sudo launchctl load, your job will run as root.
  • Check it: launchctl list org.elasticsearch.test (your job label)
    • if you just do launchctl list, it’ll list all the jobs set up by your user. sudo launchctl list to see everything. Grep for yours.
    • listing the specific job gives you some status information. Look for a PID – that means your script is running.
    • if you see no PID and a LastExitStatus of 256, then your job might be forking and exiting. Don’t do that.
  • Now ps -ef | grep for your program, and see that it’s running.
    • Kill it. See whether it gets started again with no action on your part. It should.
    • Check your program’s logs.
  • If you need your program to NOT run constantly, you’ll need to unload it.
    • launchctl unload /Library/LaunchDaemons/org.elasticsearch.test.plist

Great! That’s the whole thing. For way more options, check the man pages. Also, here’s a poorly organized but nonetheless useful reference page.
* there is a service wrapper for elasticsearch, but it didn’t work for us. It doesn’t set the job up to run continually.

configuring soundex in elasticsearch

elasticsearch is trivially easy to set up and start using with reasonable default settings. Like any framework, deviating from those defaults increases the challenge. Phonetic searches like soundex are supported by elasticsearch, but not out-of-the-box.

What is soundex? It classifies words according to how they sound, so that similar-sounding words will match each other in a search. For instance, “frend” will find “friend” because they both translate to the soundex secret code F653.

This post describes in explicit detail how to configure a soundex search on specific fields in an index on elasticsearch v0.19.2 on a Mac. It includes some explanation of what each step means.

If you haven’t installed elasticsearch yet, do that.

Part 1: Install the plugin

The phonetic plugin (github) is trivial to install. Go to the directory where elasticsearch is installed, and:

bin/plugin -install elasticsearch/elasticsearch-analysis-phonetic/1.1.0 .

The rest of the configuration is not so easy. What the plugin gave us was a new type of filter. That filter knows how to turn words into soundex secret codes. Our next objective is to teach elasticsearch how to perform this translation on searchable data.

Part 2: Configure the analyzer

The phonetic filter will be part of an analyzer. What is an analyzer?
When we tell elasticsearch about searchable data (aka “insert a document into an index”), the searchable fields are broken down into bits (tokens), which are then stored in the index. The remarkably crude diagram below illustrates this process. A document is inserted which contains a description like “bunch of words.” This string is divided up by the tokenizer. Boring words like “of” (aka stop words) are discarded by the standard filter. The remaining words are translated into soundex secret codes by the soundex filter. Finally, these tokens make it into the index, where they are stored.
When a search string comes in to elasticsearch, that string goes through the same analyzer, where it gets broken down into tokens and those tokens get filtered and translated in the same way. The search tokens are then compared with tokens in the index, and by this means matches are located. Putting the search string through the same analyzer as the searched data is important, because it makes the searched tokens correspond to the indexed tokens.
What does this mean for us? To add the soundex filter to our data processing stream, we need to configure a custom analyzer. Put this into config/elasticsearch.yml:

index :
    analysis :
        analyzer :
            search_soundex :
                type : custom
                tokenizer : standard                filter : [standard, lowercase, soundex_filter]
        filter :
            soundex_filter :
                type : phonetic                encoder : soundex                replace : true

Watch out for:

  • no tabs. YAML hates tabs, as any indentation-sensitive format must. But you won’t get the error you expect, because elasticsearch translates each tab into a single space. You might get an error like this: “ParserException[while parsing a block mapping; expected , but found BlockMappingStart]” because elasticsearch has inconveniently altered your indentation. 
  • No other non-space whitespace. We copied and pasted from a web page (like this one!) and got nonbreaking spaces in our file, and the result was that the analyzer was not found. No error loading the index, only “analyzer search_soundex not found” when we tried to put an item into our index.
  • if you add a second analyzer or filter, integrate it into this section; adding a new, similar block appears to overwrite the first one. (this was only my experience)
  • there are some examples of adding the analyzer definition within the index creation configuration. I never got that to work.

    Some interesting bits:

    search_soundex is the name of the analyzer. Call it what you want, but remember it for the next section.
    replace : true tells our filter to discard the original tokens, passing only the secret soundex codes onward to the index. If you want to do both exact-match and phonetic-match searches on your data, you may need to set this to false. However, this bit us in the butt later, so I don’t recommend it.

    Test your analyzer:

    First, restart elasticsearch to pick up your configuration changes.
      curl localhost:9200/_shutdown -XPOST
    Second, you’ll need an index — any index. If you don’t have one yet, do this:
      curl localhost:9200/anyindexname -XPUT
    Third, use the analyzer api to see what happens to an input string when it goes through your analyzer.
      curl “localhost:9200/anyindexname/_analyze?analyzer=search_soundex&pretty=true” -d ‘bunch of words’
    This will give you some JSON output. The interesting bits are labeled “token” — these are the tokens that come out of your analyzer. You should see B520 and W632 among them.

    Part 3: Configure field mappings

    Next, it is necessary to tell elasticsearch to use this analyzer. For this, it is necessary to map fields on an index. These mappings are by default done by elasticsearch dynamically, but for soundex to work, we have to set them up ourselves.
    This must be done at index creation time, as far as I can determine. This means it is necessary to delete the index and recreate it for each experiment. For static configuration, these settings can be placed in a template file, but that’s outside the scope of this post. I’m going to show how to create the index through the REST API. It should be possible to get these mappings set up on index creation with the Java API, but I had trouble with that.
    If your index already exists, then delete it:
      curl localhost:9200/anyindexname -XDELETE
    Create your index:

    curl localhost:9200/anyindexnameXPUT -d ‘
      “mappings” :{
         “yourdocumenttype” : {
            “properties” : {
               “description” : { “type”:”string”,”index”:”analyzed”,”analyzer” : “search_soundex” }

    Explanation of the interesting bits:

    -XPUT tells curl to use the PUT http command. All the stuff after -d will be passed as the data for the put, and this is the index configuration.
    anyindexname is the name of your index, whatever you like
    yourdocumenttype is the type of document you’re going to insert. Call it whatever.
    description is the name of the field that is going to be searched phonetically.
    The rest of it is structure.

    Test your mappings

      curl localhost:9200/anyindexname/_mapping
    You should see your configuration come out, including the analyzer. If not, bang your head against the wall.

    Intermission: insert some data

    It’s time to populate the index. You don’t need to do anything differently compared to without soundex. If you’re following along and want to throw in some test data, try this:
      curl localhost:9200/anyindexname/yourdocumenttype/1 -XPOST -d ‘{ “description” : “bunch of words” }’
    To see all the data in your index, here’s a trick. An explanation of facets is outside the scope of this post, but this will retrieve everything in the index and count all the tokens in all the descriptions.
      curl localhost:9200/anyindexname/_search?pretty=true -d ‘ { “query” : { “matchAll” : {} }, “facets”:{“tag”:{“terms”:{“field”:”description“}}}}’
    Look for the “term” items, and see the soundex codes.

    Part 4: Set up the query

    When searching this index for data in this field, it is important that the search string be passed through the same analyzer. This will happen by default as long as you specify the field name.  That way secret soundex codes are compared against secret soundex codes and all is well.
    curl localhost:9200/anyindexname/yourdocumenttype/_search -d ‘ 
    { “query” : 
       { “text” : 
         { “description” : 
            { “query” : “werds bunk

    Java API:

    TransportClient client = new TransportClient(ImmutableSettings.settingsBuilder()

                    .put(“”, “elasticsearch_jessitron“).build())
                    .addTransportAddress(new InetSocketTransportAddress(“localhost”, 9300));

    SearchRequest request = new SearchRequestBuilder(client)
                    .setQuery(textQuery(“description“, “werds bunk“))     
    SearchResponse searchResponse =;
    assertThat(searchResponse.getHits().totalHits(), is(1l));
    The above query will find any document with either a token that sounds like “words” or a token that sounds like “bunch.” You can change it to require that the description contain both “words” and “bunch” by telling the text query to use AND instead of OR. (If you’re going to do this, then be sure you don’t have replace : false set on your soundex filter. You won’t find anything when your search terms aren’t exact.)
    curl localhost:9200/anyindexname/yourdocumenttype/_search -d ‘ 
    { “query” : 
       { “text” : 
         { “description” : 
            { “query” : “werds bunk“,
              “operator” : “and”

    Java API:

    TransportClient client = new TransportClient(ImmutableSettings.settingsBuilder()

                    .put(“”, “elasticsearch_jessitron“).build())
                    .addTransportAddress(new InetSocketTransportAddress(“localhost”, 9300));

    SearchRequest request = new SearchRequestBuilder(client)
                    .setQuery(textQuery(“description“, “werds bunk“)
    SearchResponse searchResponse =;
    assertThat(searchResponse.getHits().totalHits(), is(1l));

    Watch out for:

    It is tempting to use the “_all” field in the searches, to base results on all fields at once. “_all” is not a loop through all fields so much as its own special field: it has its own analyzer both for indexing and searching. It is configured separately, so it won’t do soundex unless you set that up specifically.

    Part 5: Deep breath

    That was a lot more work than most stuff in elasticsearch. Having done this, you now know how to map fields and search on them. You know how to define analyzers and filters by field. This will stand you in good stead for further configuration. For instance, adding geographic search is easy after you’ve done soundex.

    It is possible to use different analyzers for search and indexing. This can be specified in index mappings or overridden on the individual search. Think about this when you start doing something weird, but make sure you understand analyzers and filters more deeply than this article gets.

      elasticsearch: first things first

      elasticsearch is a wrapper around Lucene that supports distributed, scalable search, particularly compatible with Amazon’s EC2 setup. The fun part about elasticsearch is the installation and setup (on a Mac):

      step 1) download it.
      step 2) unzip that file.
      step 3) go to that directory and run bin/elasticsearch

      Done. Ready to go. Proceed with inserting data.

      Except! (There’s always a catch.) At startup, elasticsearch goes out and looks around your network for other computers running elasticsearch. If it finds one with the same cluster name, it’ll hook up, and poof! two elasticsearch clusters become one. Any data you insert will be visible to your fellow developers who are also fooling around. Or testers, if it’s running on a test box. This can freak you out when you’re not expecting it. Therefore:

      step 2a) in config/elasticsearch.yml, the first setting is Change the value from “elasticsearch” to something uniquely yours.

      Once you’ve changed your config, you’ll need to shut down your elasticsearch cluster:

      curl localhost:9200/_shutdown -XPOST

      then do ‘ps -ef | grep elasticsearch’ to make sure it’s dead. Kill the process if it is still running. (Yes, you can disable that easy-peasy command for production.)

      Like any framework, elasticsearch is very easy to use as long as you stick with the default settings. To use elasticsearch for real, you need to think about indexes, documents, mappings, analyzers, filters, nodes, clusters, shards, and probably more concepts I haven’t encountered yet. However, the reasonable defaults and smart initialization mean that you don’t have to think about any of these right away. We can get the product up and running for proof of concept with very little effort and customize it in our own sweet time.

      Inserting data into elasticsearch over HTTP: a breakdown

      There are a zillion examples of what to type to insert into elasticsearch. But what does each part mean?

      Shoving information into elasticsearch is pretty easy. You don’t have to set anything up. (so, if you have a typo, good luck figuring out later where your data went. This is the price of sensible defaults.)

      Here is one way to throw some data into an index – type this at a *nix prompt:

      curl -XPOST “http://localhost:9200/indexname/typename/optionalUniqueId” -d ‘{ “field” : “value” }

      Here is what that means:
      curl is a command that sends messages over HTTP.
      -X is the option that tells curl which HTTP command to send. GET is the default.
      POST is one of the HTTP commands that you can use for this insertion. PUT will work as well, but then the optionalUniqueId is not optional.
      localhost is the machine where elasticsearch is running.
      9200 is the default port number for elasticsearch
      indexname is the name of your index. This must be in all lowercase. You can use different indexes to restrict your searches later. Also, indexes are associated with particular mapping configurations. The defaults are sensible, but know that you can configure stuff by index and search by index (or multiple indexes).
      typename describes the type of document you’re sticking into the index. You can use this later to narrow searches. Also, the ID of each document in the index should be unique per type.
      optionalUniqueId if you have an intelligent ID for the document you’re sticking in, then put it here. Otherwise elasticsearch will create one. When you want to update your object, you’ll need this. it’s also handy for retrieval of exactly one object.
      -d tells curl “here comes the data!”
      { “field” : “value” } represents any valid JSON. all this stuff is stored for your object.

      The output of this is an HTTP 200 if the document was updated or HTTP 201 if a document was created. If you want curl to tell you what http status code came back, add this to your command line: -w ” status: %{http_code}”

      Here are two of the easiest ways to see what you just inserted.

      Retrieve by ID:

      curl “http://localhost:9200/indexname/typename/optionalUniqueId?pretty=true

      This does a GET to fetch the object by ID.
      ?pretty=true tells elasticsearch to put newlines and indentation into the JSON so that it’s easier for humans to read.

      Retrieve everything in the index:
      curl “http://localhost:9200/indexname/_search?pretty=true

      _searchtells elasticsearch that this is a query. Since no parameters are provided, everything is returned.
      Notice that typename is omitted here. If you include it, then you’ll get back everything of that type in the index.