Building a City Search with Elixir and Python

The other day I was wondering whether there was an easy self-made local alternative to something like the Google Places API, that I could use in a Phoenix app. I wanted to search for a city and wanted to get back the city itself, its state, and the country.

I found the free GeoLite2 city dataset, provided by Maxmind, which I could use to create a city search index.

(In case you directly want to dive into the programmatic materialisation of what I came up with, it is available on Github.)

I did a quick search and stumbled upon the searchex project by @andyl. This actually looked like it was exactly what I was searching for. However, there is very little documentation yet. So, unfortunately, I couldn’t really figure out how to get it working.

Then, while thinking about how to approach this, Whoosh, a Python package that I have used at work, came to my mind. Whoosh is a library for indexing text and searching the index. It is pretty easy to set up and delivers great search results with little effort.

With this in my mind, I was wondering whether there was a way to call Python code from Elixir. After some further research and some articles later I found the Erlang library erlport, which allows you to call Ruby and Python code from Elixir. There is also an Elixir wrapper for it, bearing the sounding name Export. You could also use erlport directly in Elixir, but Export gives you some convenient functions on top and a more Elixir-like feeling.

Setting Up a New mix Project & Python virtualenv

In order to get started with our custom city index, let’s set up a prototype mix project, called elixir_python:

mix new elixir_python

Head to the mix.exs file and add the export dependency:

# mix.exs
# ...
defp deps do
  [{:export, "~> 0.1.0"}]
end

Then install the dependencies with:

mix deps.get

For setting up a Python environment you can use virturalenv to create a local virtual environment. Also keep in mind to activate it after creating:

virtualenv -p python3 venv
source venv/bin/activate

We will use Whoosh, so we need a requirements.txt next to our mix.exs that defines the Python dependencies:

# /requirements.txt

whoosh==2.7.4

Install the requirements with:

pip install -r requirements.txt

Next, we need a directory where our Python code will live. Let’s create a lib/python directory where we will put the *.py files later on. You can really put them wherever you want, you just have to link to the directory when using Export.

In your lib/python directory create a geolite2.py file. This is where we will put the code for our city search index. Next, download the GeoLite2 CSV files from dev.maxmind.com and put the English city locations in the /lib/python/data directory. For our Python requirements we will also need a requirements.txt file in our project’s root directory.

Our Elixir code will live in lib/elixir_python/geolite2.ex.

The overall project structure should now look like this:

└── elixir_python
    ├── config
    ├── lib
    │   ├── elixir_python
    │   │   └── geolite2.ex
    │   ├── python
    │   │   ├── data
    │   │   │   └── GeoLite2-City-Locations-en.csv
    │   │   ├── __init__.py
    │   │   └── geolite2.py
    │   └── elixir_python.ex
    ├── mix.exs
    ├── requirements.txt
    └── …

The Python Part

Our geolite2 Python module will have an API composed of two functions:

# lib/python/geolite2.py

def create_index():
    # We will add code here in some minutes...
    pass


def search(query, count=10):
    # We will add some code here soon...
    pass

The first one creates our search index using the GeoLite2 city CSV file. The second lets us search for cities, states or countries and will pass the results back to Elixir.

Indexing the City Data

For each Whoosh index you can define a certain structure, its schema. The schema defines which data you want to store in the index and which fulltext–or content–you want to run the search on.

Our city schema looks like this:

from whoosh.fields import SchemaClass, TEXT
from whoosh.analysis import NgramWordAnalyzer


class CitySchema(SchemaClass):
    city = TEXT(stored=True)
    state = TEXT(stored=True)
    country = TEXT(stored=True)
    content = TEXT(analyzer=NgramWordAnalyzer(minsize=2), phrase=False)

We want to store the city, the state, and the country. The content field will hold the fulltext to search in, in our case it will be the joined city, state, and country name. This allows us to also search for cities, states or countries and provide multiple query terms to narrow down our results. We use an NgramWordAnalyzer and set the phrase argument to False in order to save some space (see this whoosh recipe for more details).

Before creating the index let’s define our directory names and files we want to use along with some handy functions for building the absolute paths to these files:

# lib/python/geolite2.py

import os


# The base directory where out data lies, relative to this file
DATA_BASE_DIR = 'data'

# The actual city data file
CITY_DATA_FILE = 'GeoLite2-City-Locations-en.csv'

# Our base directory where the index files are stored
INDEX_BASE_DIR = 'index'

# The name of our index
CITY_INDEX = 'city'


def index_path(index_name):
    """ Returns the absolute index path for the given index name """

    index_dir = '{}_index'.format(index_name)
    return os.path.join(current_path(), INDEX_BASE_DIR, index_dir)


def data_file_path(file_name):
    """ Returns the absolute path to the file with the given name """

    return os.path.join(current_path(), DATA_BASE_DIR, file_name)


def current_path():
    """ Returns the absolute directory of this file """

    return os.path.dirname(os.path.abspath(__file__))

Armed with these helpers we can now go ahead and define the actual index creation function. We read the CSV file line by line, and create the schema from it. Some lines in the CSV do not represent cities but states or countries, so we skip these lines, unless there is a value in the city column:

# lib/python/geolite2.py

import csv
import shutil

from whoosh.index import create_in

# ...

def create_index():
    """ Create search index files """

    path = index_path(CITY_INDEX)

    _recreate_path(path)

    index = create_in(path, CitySchema)
    writer = index.writer()

    with open(data_file_path(CITY_DATA_FILE)) as csv_file:
        reader = csv.DictReader(csv_file)

        for row in reader:
            _add_document(data=row, writer=writer)

    writer.commit()


def _recreate_path(path):
    """ Deletes and recreates the given path """

    if os.path.exists(path):
        shutil.rmtree(path)

    os.makedirs(path)


def _add_document(row, writer):
    """ Writes the data to the index """

    city = row.get('city_name')

    if not city:
        return

    state = row.get('subdivision_1_name')
    country = row.get('country_name')
    content = ' '.join([city, state, country])

    writer.add_document(
        city=city,
        state=state,
        country=country,
        content=content
    )

Note that the content ("<city> <state> <country>") is the actual text we analyse and put into the index. The rest of the schema properties is just stored data, which we can access again later on in our results and pass on to our Elixir app.

Searching for Cities

Let’s now implement the function for making a search request:

# lib/python/geolite2.py

from whoosh.qparser import QueryParser
from whoosh.query import Prefix

# ...

def search(query, count=10):
    """ Searches for the given query and returns `count` results """

    index = open_dir(index_path(CITY_INDEX))

    with index.searcher() as searcher:
        parser = QueryParser('content', index.schema, termclass=Prefix)
        parsed_query = parser.parse(query)
        results = searcher.search(parsed_query, limit=count)

        data = [[result['city'],
                 result['state'],
                 result['country']]
                for result in results]

        return data

First, we get the city index that we created with create_index(). Then we build an instance of whoosh’s QueryParser in order to parse our query using our city schema. We use the termclass=Prefix here to only match documents that contain any term that starts with the given query text (see the whoosh.query.Prefix docs). The parsed query is then passed to a searcher which finally runs the search and compiles the results for us. In order to keep it simple we collect the needed data in a list of lists. This will be the data we are going to receive from our Elixir function in a moment.

The Elixir part

Our Elixir API will look pretty much the same as the Python API:

# /lib/elixir_python/geolite2.ex

defmodule ElixirPython.GeoLite2 do
  def create_index do
    # We will add code here in some more minutes...
  end

  def search(query, count \\ 10) do
    # We will add some code here later...
  end
end

Calling Python

To prepare for calling our Python functions from Elixir, add a python_call/3 function to the ElixirPython module. It creates a Python instance for us and runs the Python code we provide it with.

# lib/elixir_python.ex

defmodule ElixirPython do
  use Export.Python

  @python_dir "lib/python" # <-- this is the dir we created before

  def python_call(file, function, args \\ []) do
    {:ok, py} = Python.start(python_path: Path.expand(@python_dir))
    Python.call(py, file, function, args)
  end
end

We make use of Export’s Python module. Python.start/1 returns a tuple including a Python instance. In order to pick up our modules we pass the path to our Python directory as base path. Python.call/4 takes care of calling the given Python function from the respective module file.

Creating the City Index

We use the python_call function we just defined to run the create_index function in Python:

# lib/elixir_python/geolite2.ex

defmodule ElixirPython.GeoLite2 do
  import ElixirPython, only: [python_call: 2]

  @python_module "geolite2"

  def create_index do
    python_call(@python_module, "create_index")
  end

  # ...
end

Searching for Cities

To run a search query we use our python_call function again to call the Python search function we defined. The returned value is a list of lists holding the stored index data. We just loop over it and create Maps from it:

# lib/elixir_python/geolite2.ex

defmodule ElixirPython.GeoLite2 do
  import ElixirPython, only: [python_call: 2, python_call: 3]

  @python_module "geolite2"

  # ...

  def search(query, count \\ 10) do
    results = python_call(@python_module, "search", [query, count])

    for [city, state, country] <- results do
      %{city: "#{city}", state: "#{state}", country: "#{country}"}
    end
  end

And we are done with our hunt for a city search index and we can use it now. Make sure you activated your Python virturalenv, then open up iex and give it a try:

source venv/bin/activate
iex -S mix
iex(1)> ElixirPython.GeoLite2.create_index()

iex(2)> ElixirPython.GeoLite2.search("Berlin", 3)
[%{city: "Berlin", country: "Germany", state: "Land Berlin"},
 %{city: "Berlingen", country: "Belgium", state: "Flanders"},
 %{city: "Falkenberg", country: "Germany", state: "Land Berlin"}]

Yay. It works!

Let’s try another one:

iex(3)> ElixirPython.GeoLite2.search("San José")
[]

Hm, why is that? We couldn’t find any results, although San José is definitely in the index. It’s because our system does not normalise special characters and accents yet. Let’s do this in a final next step.

Handling Special Characters

On the Elixir side this is easy to do. There is an erlang lib called iconv. Let’s just add it to our mix.exs:

# mix.exs

#...
defp deps do
  [{:export, "~> 0.1.0"},
   {:iconv, "~> 1.0"}]
end

Then install the dependency with:

mix deps.get

Let’s now preprocess the query before we pass it to our Python function:

# lib/elixir_python/geolite2.ex

defmodule ElixirPython.GeoLite2 do
  # ...

  def search(query, count \\ 10) do
    query = clean_text(query)
    # ...
  end

  defp clean_text(text) do
    :iconv.convert("utf-8", "ascii//translit", text)
  end
end

When we rerun our query now we get the wanted city in the results:

iex(4)> ElixirPython.GeoLite2.search("San José")
[%{city: "San José", country: "Costa Rica", state: "Provincia de San Jose"},
# ...
]

We still have some problems with e.g. German cities, like Görlitz, that use Umlauts, so let’s transform them to their ASCII counterparts before creating the index:

# lib/python/geolite2.py

import unicodedata

# ...

def _add_document(row, writer):
    """ Writes the data to the index """

    # ...

    # clean up the content that goes to the index by using _cleanup_text():
    content = _cleanup_text(' '.join([city, state, country]))

    writer.add_document(
        city=city,
        state=state,
        country=country,
        content=content
    )


def _cleanup_text(text):
    """ Removes accents and replaces umlauts """

    replaces = [
        ['ä', 'ae'],
        ['ö', 'oe'],
        ['ü', 'ue'],
        ['Ä', 'Ae'],
        ['Ö', 'Oe'],
        ['Ü', 'Ue'],
        ['ß', 'ss']
    ]

    for original, replacement in replaces:
        text = text.replace(original, replacement)

    text = unicodedata.normalize('NFKD', text)
    text = ''.join([char for char in text if not unicodedata.combining(char)])
    text = text.encode('ascii', 'ignore').decode('ascii')

    return text

If we recreate our index and run a Görlitz query now, we will get some fitting results.

Wrapping Up

We managed to build a small local fulltext index for a city search without too much effort. Our system returns great search results and we allowed to search for cities with or without using special characters.

All in all, it does not scale very well, though. I tried to build another, more sophisticated city index including about 4.4 M cities and villages world-wide and the location coordinates (latitude and longitude) and time zone for each place. If you are interested you can find the script for combining the city data and location data in this gist. It took quite some time to build the index (about 40 minutes on my laptop) and resulted in an index file of 1.3 GB size (compared to ~29 MB for the GeoLite2 index). Although it also worked well and you will get fitting search results, it takes about 10 seconds to finish a single request. This approach would need some additional caching and further optimisation in order to be useful in any kind of way.

So, eventually, I ended up using the Google Places API anyway 😉. But, hey: “Wieder was gelernt.”