Domain name analyzer/tokenizer with Elasticsearch

Elasticsearch is amazingly easy to use given how powerful it is, but sometimes the functionality seems to be a couple of years ahead of the documentation. Case in point: I wanted to analyze hostnames/domains to mimic the “site:” feature of Google search. The way that works is the match from right to left, only on whole parts (‘labels’ as I’ve discovered they’re called in the DNS system).

To see how this works, test out the following examples: “site:rosoft.com” “site:microsoft” “site:microsoft.com”

Only the last should produce any hits (until some registrar sees all the usual traffic to rosoft.com and grabs it).

The standard Elasticsearch analyzer won’t do this. It will treat a hostname like “projects.andrewwhitby.com” as a single term, which isn’t much good. A simple improvement would be to tokenize on the ’.’ delimiter, but then the second example above would work. What you need is a custom analyzer. I couldn’t find any obvious example online, and assumed I might even have to code something, but I was wrong.

It turns out you can configure the built-in PathHierarchy tokenizer to behave in just this way. Because, of course, a domain name is just a slightly weird back-to-front pathname. An example script for Sense is below.

I guess the right-to-leftness (RTL) of domain names is fairly arbitrary, but there must be a some internet lore for why it’s this way. You could have thought the default choice would be left-to-right (LTR), given the file system analogy (and maybe a slight advantage in matching left-substrings over right substrings).

Tangentially, it turns out that RTL languages, like Hebrew, are supported in the domain name system. For example, this is a link to the Israel Internet Association: איגוד-האינטרנט.org.il. But mixing LTR and RTL scripts like this produces some pretty unintuitive, if not simply broken, behaviour (at least on OS X + Chrome).

######################################################
# Set things up

#DELETE /test_index

PUT /test_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "domain_name_analyzer": {
                    "filter":"lowercase",
                    "tokenizer": "domain_name_tokenizer",
                    "type": "custom"
                }
            },
            "tokenizer": {
                "domain_name_tokenizer": {
                    "type": "PathHierarchy",
                    "delimiter": ".",
                    "reverse": true
                }
            }
        }
    }
}

PUT /test_index/_mapping/site
{
    "properties": {
        "url": {
            "type":      "string",
            "analyzer":  "domain_name_analyzer"
        }
    }
}

######################################################
# Input some sample data


POST /test_index/site
{
  "url": "www.google.com.au"
}

POST /test_index/site
{
  "url": "www.com.au.uk"
}

######################################################
# Proof that the analyzer is working

GET /test_index/_analyze?analyzer=domain_name_analyzer&text=www.google.com.au

######################################################
# Test searching

GET /test_index/site/_search
{
  "query": {
    "term": {
      "url": "com.au"
    }
  }
}

GET /test_index/site/_search
{
  "query": {
    "term": {
      "url": "com.au.uk"
    }
  }
}
Comments (1)

Hi, thanks for analyzer. What about if i have little bit more complex url like http://www.example.com/link and i want www.example.com? Thx.

podolinek

I’m sure there’s a way, and perhaps somebody else will reply, but what I do is just strip the hostname out of the URL and store both as fields.

Andrew Whitby

Add comment

Comments are moderated and will not appear immediately.