Tuesday, April 24, 2012

services on a Mac

Say there's a program that needs to run on your Mac all the time forever and ever. This post describes how to set that up. The example here sets up the script that runs elasticsearch.*

On a Mac, services are controlled by launchd, which is configured using launchctl. This example uses launchctl to set up a service that starts as soon as it's configured, starts up automatically at system startup, and gets restarted every time the job dies.

1) Create a configuration (plist) file for it. Name the file like: com.yourorganization.whateverpackage.jobtitle.plist. In the example, mine is org.elasticsearch.test.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
    <dict>
        <key>Label</key>
        <string>org.elasticsearch.test</string>
        <key>ProgramArguments</key>
        <array>
            <string>/Volumes/Projects/elasticsearch/bin/elasticsearch</string>
            <string>-f</string>
        </array>
        <key>WorkingDirectory</key>
        <string>/Volumes/Projects/elasticsearch</string>
        <key>StandardErrorPath</key>
        <string>logs/service-err.log</string>
        <key>StandardOutPath</key>
        <string>logs/service-out.log</string>
        <key>OnDemand</key>
        <false/>
        <key>KeepAlive</key>
        <true/>
        <key>RunAtLoad</key>
        <true/>
    </dict>
</plist>

Some important bits explained:
org.elasticsearch.test - the Label is the job's name. After the job is loaded into launchctl, this is how you can refer to it for start, stop, list.
/Volumes/Projects/elasticsearch/bin/elasticsearch - the ProgramArguments contains the name of the program you want to run, and then any arguments to pass to it. (there's an alternative to put the program or script name in a separate place, but this is easy. It keeps it right next to its arguments.)
-f - this argument is specific to elasticsearch, but the meaning of it is important. Your program or script should not fork and then exit. This means it should run everything in the foreground and never exit. launchd has a job to keep your script running, so your script needs to keep running.
/Volumes/Projects/elasticsearch -  launchd will cd to the WorkingDirectory before running your program. Your script should not cd.

2) load your plist configuration:
  • copy your plist file to /Library/LaunchDaemons
  • Load it: launchctl load /Library/LaunchDaemons/org.elasticsearch.test.plist (your plist filename)
    • note that whatever user you are when you run this launchctl load will, by default, be thu user the job runs as. If you don't like that, you can configure the UserName in your plist (man launchd.plist for more info). If you sudo launchctl load, your job will run as root.
  • Check it: launchctl list org.elasticsearch.test (your job label)
    • if you just do launchctl list, it'll list all the jobs set up by your user. sudo launchctl list to see everything. Grep for yours.
    • listing the specific job gives you some status information. Look for a PID - that means your script is running.
    • if you see no PID and a LastExitStatus of 256, then your job might be forking and exiting. Don't do that.
  • Now ps -ef | grep for your program, and see that it's running.
    • Kill it. See whether it gets started again with no action on your part. It should.
    • Check your program's logs.
  • If you need your program to NOT run constantly, you'll need to unload it.
    • launchctl unload /Library/LaunchDaemons/org.elasticsearch.test.plist

Great! That's the whole thing. For way more options, check the man pages. Also, here's a poorly organized but nonetheless useful reference page.

* there is a service wrapper for elasticsearch, but it didn't work for us. It doesn't set the job up to run continually.

Wednesday, April 18, 2012

configuring soundex in elasticsearch

elasticsearch is trivially easy to set up and start using with reasonable default settings. Like any framework, deviating from those defaults increases the challenge. Phonetic searches like soundex are supported by elasticsearch, but not out-of-the-box.

What is soundex? It classifies words according to how they sound, so that similar-sounding words will match each other in a search. For instance, "frend" will find "friend" because they both translate to the soundex secret code F653.

This post describes in explicit detail how to configure a soundex search on specific fields in an index on elasticsearch v0.19.2 on a Mac. It includes some explanation of what each step means.

If you haven't installed elasticsearch yet, do that.

Part 1: Install the plugin

The phonetic plugin (github) is trivial to install. Go to the directory where elasticsearch is installed, and:
bin/plugin -install elasticsearch/elasticsearch-analysis-phonetic/1.1.0 .

The rest of the configuration is not so easy. What the plugin gave us was a new type of filter. That filter knows how to turn words into soundex secret codes. Our next objective is to teach elasticsearch how to perform this translation on searchable data.


Part 2: Configure the analyzer

The phonetic filter will be part of an analyzer. What is an analyzer?
When we tell elasticsearch about searchable data (aka "insert a document into an index"), the searchable fields are broken down into bits (tokens), which are then stored in the index. The remarkably crude diagram below illustrates this process. A document is inserted which contains a description like "bunch of words." This string is divided up by the tokenizer. Boring words like "of" (aka stop words) are discarded by the standard filter. The remaining words are translated into soundex secret codes by the soundex filter. Finally, these tokens make it into the index, where they are stored.

When a search string comes in to elasticsearch, that string goes through the same analyzer, where it gets broken down into tokens and those tokens get filtered and translated in the same way. The search tokens are then compared with tokens in the index, and by this means matches are located. Putting the search string through the same analyzer as the searched data is important, because it makes the searched tokens correspond to the indexed tokens.

What does this mean for us? To add the soundex filter to our data processing stream, we need to configure a custom analyzer. Put this into config/elasticsearch.yml:
index :
    analysis :
        analyzer :
            search_soundex :
                type : custom
                tokenizer : standard                filter : [standard, lowercase, soundex_filter]
        filter :
            soundex_filter :
                type : phonetic                encoder : soundex                replace : true

Watch out for:

  • no tabs. YAML hates tabs, as any indentation-sensitive format must. But you won't get the error you expect, because elasticsearch translates each tab into a single space. You might get an error like this: "ParserException[while parsing a block mapping; expected <block end>, but found BlockMappingStart]" because elasticsearch has inconveniently altered your indentation. 
  • No other non-space whitespace. We copied and pasted from a web page (like this one!) and got nonbreaking spaces in our file, and the result was that the analyzer was not found. No error loading the index, only "analyzer search_soundex not found" when we tried to put an item into our index.
  • if you add a second analyzer or filter, integrate it into this section; adding a new, similar block appears to overwrite the first one. (this was only my experience)
  • there are some examples of adding the analyzer definition within the index creation configuration. I never got that to work.

Some interesting bits:

search_soundex is the name of the analyzer. Call it what you want, but remember it for the next section.
replace : true tells our filter to discard the original tokens, passing only the secret soundex codes onward to the index. If you want to do both exact-match and phonetic-match searches on your data, you may need to set this to false. However, this bit us in the butt later, so I don't recommend it.

Test your analyzer:

First, restart elasticsearch to pick up your configuration changes.
  curl localhost:9200/_shutdown -XPOST
Second, you'll need an index -- any index. If you don't have one yet, do this:
  curl localhost:9200/anyindexname -XPUT
Third, use the analyzer api to see what happens to an input string when it goes through your analyzer.
  curl "localhost:9200/anyindexname/_analyze?analyzer=search_soundex&pretty=true" -d 'bunch of words'
This will give you some JSON output. The interesting bits are labeled "token" -- these are the tokens that come out of your analyzer. You should see B520 and W632 among them.

Part 3: Configure field mappings

Next, it is necessary to tell elasticsearch to use this analyzer. For this, it is necessary to map fields on an index. These mappings are by default done by elasticsearch dynamically, but for soundex to work, we have to set them up ourselves.
This must be done at index creation time, as far as I can determine. This means it is necessary to delete the index and recreate it for each experiment. For static configuration, these settings can be placed in a template file, but that's outside the scope of this post. I'm going to show how to create the index through the REST API. It should be possible to get these mappings set up on index creation with the Java API, but I had trouble with that.

If your index already exists, then delete it:
  curl localhost:9200/anyindexname -XDELETE
Create your index:
curl localhost:9200/anyindexname -XPUT -d '
{
  "mappings" :{
     "yourdocumenttype" : {
        "properties" : {
           "description" : { "type":"string","index":"analyzed","analyzer" : "search_soundex" }
           }
     }
   }
}'

Explanation of the interesting bits:

-XPUT tells curl to use the PUT http command. All the stuff after -d will be passed as the data for the put, and this is the index configuration.
anyindexname is the name of your index, whatever you like
yourdocumenttype is the type of document you're going to insert. Call it whatever.
description is the name of the field that is going to be searched phonetically.
The rest of it is structure.

Test your mappings

Run
  curl localhost:9200/anyindexname/_mapping
You should see your configuration come out, including the analyzer. If not, bang your head against the wall.

Intermission: insert some data

It's time to populate the index. You don't need to do anything differently compared to without soundex. If you're following along and want to throw in some test data, try this:
  curl localhost:9200/anyindexname/yourdocumenttype/1 -XPOST -d '{ "description" : "bunch of words" }'
To see all the data in your index, here's a trick. An explanation of facets is outside the scope of this post, but this will retrieve everything in the index and count all the tokens in all the descriptions.
  curl localhost:9200/anyindexname/_search?pretty=true -d ' { "query" : { "matchAll" : {} }, "facets":{"tag":{"terms":{"field":"description"}}}}'
Look for the "term" items, and see the soundex codes.

Part 4: Set up the query

When searching this index for data in this field, it is important that the search string be passed through the same analyzer. This will happen by default as long as you specify the field name.  That way secret soundex codes are compared against secret soundex codes and all is well.

REST API:
curl localhost:9200/anyindexname/yourdocumenttype/_search -d ' 
{ "query" : 
   { "text" : 
     { "description" : 
        { "query" : "werds bunk"
     } 
   }
}'

Java API:

TransportClient client = new TransportClient(ImmutableSettings.settingsBuilder()
                .put("cluster.name", "elasticsearch_jessitron").build())
                .addTransportAddress(new InetSocketTransportAddress("localhost", 9300));

SearchRequest request = new SearchRequestBuilder(client)
                .setQuery(textQuery("description", "werds bunk"))     
                .setIndices("anyindexname")
                .setTypes("yourdocumenttype")
                .request();
SearchResponse searchResponse = client.search(request).actionGet();

assertThat(searchResponse.getHits().totalHits(), is(1l));

The above query will find any document with either a token that sounds like "words" or a token that sounds like "bunch." You can change it to require that the description contain both "words" and "bunch" by telling the text query to use AND instead of OR. (If you're going to do this, then be sure you don't have replace : false set on your soundex filter. You won't find anything when your search terms aren't exact.)

REST API:
curl localhost:9200/anyindexname/yourdocumenttype/_search -d ' 
{ "query" : 
   { "text" : 
     { "description" : 
        { "query" : "werds bunk",
          "operator" : "and"
     } 
   }
}'

Java API:

TransportClient client = new TransportClient(ImmutableSettings.settingsBuilder()
                .put("cluster.name", "elasticsearch_jessitron").build())
                .addTransportAddress(new InetSocketTransportAddress("localhost", 9300));

SearchRequest request = new SearchRequestBuilder(client)
                .setQuery(textQuery("description", "werds bunk")
                .operator(TextQueryBuilder.Operator.AND)
                ).setIndices("anyindexname")
                .setTypes("yourdocumenttype")
                .request();
SearchResponse searchResponse = client.search(request).actionGet();

assertThat(searchResponse.getHits().totalHits(), is(1l));

Watch out for:

It is tempting to use the "_all" field in the searches, to base results on all fields at once. "_all" is not a loop through all fields so much as its own special field: it has its own analyzer both for indexing and searching. It is configured separately, so it won't do soundex unless you set that up specifically.

Part 5: Deep breath

That was a lot more work than most stuff in elasticsearch. Having done this, you now know how to map fields and search on them. You know how to define analyzers and filters by field. This will stand you in good stead for further configuration. For instance, adding geographic search is easy after you've done soundex.

It is possible to use different analyzers for search and indexing. This can be specified in index mappings or overridden on the individual search. Think about this when you start doing something weird, but make sure you understand analyzers and filters more deeply than this article gets.

elasticsearch: first things first

elasticsearch is a wrapper around Lucene that supports distributed, scalable search, particularly compatible with Amazon's EC2 setup. The fun part about elasticsearch is the installation and setup (on a Mac):

step 1) download it.
step 2) unzip that file.
step 3) go to that directory and run bin/elasticsearch

Done. Ready to go. Proceed with inserting data.

Except! (There's always a catch.) At startup, elasticsearch goes out and looks around your network for other computers running elasticsearch. If it finds one with the same cluster name, it'll hook up, and poof! two elasticsearch clusters become one. Any data you insert will be visible to your fellow developers who are also fooling around. Or testers, if it's running on a test box. This can freak you out when you're not expecting it. Therefore:

step 2a) in config/elasticsearch.yml, the first setting is cluster.name. Change the value from "elasticsearch" to something uniquely yours.

Once you've changed your config, you'll need to shut down your elasticsearch cluster:

curl localhost:9200/_shutdown -XPOST
then do 'ps -ef | grep elasticsearch' to make sure it's dead. Kill the process if it is still running. (Yes, you can disable that easy-peasy command for production.)

Like any framework, elasticsearch is very easy to use as long as you stick with the default settings. To use elasticsearch for real, you need to think about indexes, documents, mappings, analyzers, filters, nodes, clusters, shards, and probably more concepts I haven't encountered yet. However, the reasonable defaults and smart initialization mean that you don't have to think about any of these right away. We can get the product up and running for proof of concept with very little effort and customize it in our own sweet time.

Friday, April 13, 2012

Inserting data into elasticsearch over HTTP: a breakdown

There are a zillion examples of what to type to insert into elasticsearch. But what does each part mean?

Shoving information into elasticsearch is pretty easy. You don't have to set anything up. (so, if you have a typo, good luck figuring out later where your data went. This is the price of sensible defaults.)

Here is one way to throw some data into an index - type this at a *nix prompt:

curl -XPOST "http://localhost:9200/indexname/typename/optionalUniqueId" -d '{ "field" : "value" }'

Here is what that means:
curl is a command that sends messages over HTTP.
-X is the option that tells curl which HTTP command to send. GET is the default.
POST is one of the HTTP commands that you can use for this insertion. PUT will work as well, but then the optionalUniqueId is not optional.
localhost is the machine where elasticsearch is running.
9200 is the default port number for elasticsearch
indexname is the name of your index. This must be in all lowercase. You can use different indexes to restrict your searches later. Also, indexes are associated with particular mapping configurations. The defaults are sensible, but know that you can configure stuff by index and search by index (or multiple indexes).
typename describes the type of document you're sticking into the index. You can use this later to narrow searches. Also, the ID of each document in the index should be unique per type.
optionalUniqueId if you have an intelligent ID for the document you're sticking in, then put it here. Otherwise elasticsearch will create one. When you want to update your object, you'll need this. it's also handy for retrieval of exactly one object.
-d tells curl "here comes the data!"
{ "field" : "value" } represents any valid JSON. all this stuff is stored for your object.

The output of this is an HTTP 200 if the document was updated or HTTP 201 if a document was created. If you want curl to tell you what http status code came back, add this to your command line: -w " status: %{http_code}"

Here are two of the easiest ways to see what you just inserted.

Retrieve by ID:

curl "http://localhost:9200/indexname/typename/optionalUniqueId?pretty=true"

This does a GET to fetch the object by ID.
?pretty=true tells elasticsearch to put newlines and indentation into the JSON so that it's easier for humans to read.

Retrieve everything in the index:
curl "http://localhost:9200/indexname/_search?pretty=true"

_searchtells elasticsearch that this is a query. Since no parameters are provided, everything is returned.
Notice that typename is omitted here. If you include it, then you'll get back everything of that type in the index.

Thursday, April 12, 2012

Lessons from SICP: what is iterative programming?

SICP insight of the day: iterative programming doesn't require a for loop. Tail recursion is iterative programming. It looks very close to recursion, but the distinction is: a program is iterative if everything it needs is stored in a few state variables. In an imperative style the state is stored in mutable variables. In a functional style the state is passed as parameters to the function, over and over again.

The key difference between recursion and iteration is whether any necessary information is stored on the stack between calls. Is there any operation to perform after the recursive call completes? If so, we need the stack. If not, tail recursion, iterative program. Of course, this is why iterative programs can go on forever, while recursive programs will eventually run out of stack space. Recursion requires the computer to store more state for every pass. Iterative programs have a fixed amount of state.

The awesome concept here is the distinction between the shape of program execution vs the style it is written in. Look at what the code does, not how it goes about it.

There is one advantage of tail recursion over a for loop, even though both produce the same kind of program. Tail recursion lets everything stay immutable. This is safer.
When we're training ourselves to think functionally, using recursion in place of loops and mutables: look at the for loop. What state is tracked within there? Typically there is some state that determines when to stop (a counter, typically) and one or more running totals that represent the output of the loop. Take each of these and make it a parameter to the recursive function. Pattern-match on the state to determine when to end the loop; output the running totals at that point. In the recursive call, increment or advance the pieces of state and pass the updated ones as parameters to the next call.

For instance:
String printContents(String[] arr) {
 String output = "";
 for (int i = 0; i < arr.length; i++) {
   output += " " + i + ") " + arr[i] + "\n";
 }
 return output;
}
Here, the state of the iteration is contained in the the counter i and the aggregator output. A direct, simple translation to a tail-recursive function is then:
String printContents(int i, String output, String[] arr) {
   i >= arr.length?
      output :
      printContents(i + 1, output + " " + i + ") " + arr[i] + "\n", arr);
}
See how the loop counter became an argument and the aggregator became an argument? The input was an argument to begin with and stays there. This transfers the needed state to the next call.

People familiar with functional style will immediately recognize that there are easier ways to flip through the contents of an array or collection. That's fine -- functional languages have syntax to make this very common pattern extra-clean. The point is that when we're used to thinking of iteration in terms of loops, we can train ourselves to recognize the iterative state and move it into parameters.

Now, if the purpose of your for-loop is to produce a bunch of side effects, I got no help for ya.

Disclaimer: never do this in real life in Java because tail-recursion is not supported.

Personal footnote: The other day a friend gave me some serious geek cred points when he noticed the Wizard Book on my nightstand. It is a classic, and it's filling in holes in my education. I'm only on chapter one, but every section that I read makes my day.

Wednesday, April 11, 2012

Excluding manual tests in Gradle

test {
   exclude '**/Manual*'
}

this is here because I know I will forget it.