Building a City Search with Elixir and Python
The other day I was wondering whether there was an easy self-made local alternative to something like the Google Places API, that I could use in a Phoenix app. I wanted to search for a city and wanted to get back the city itself, its state, and the country.
I found the free GeoLite2 city dataset, provided by Maxmind, which I could use to create a city search index.
(In case you directly want to dive into the programmatic materialisation of what I came up with, it is available on Github.)
I did a quick search and stumbled upon the searchex project by @andyl. This actually looked like it was exactly what I was searching for. However, there is very little documentation yet. So, unfortunately, I couldn’t really figure out how to get it working.
Then, while thinking about how to approach this, Whoosh, a Python package that I have used at work, came to my mind. Whoosh is a library for indexing text and searching the index. It is pretty easy to set up and delivers great search results with little effort.
With this in my mind, I was wondering whether there was a way to call Python code from Elixir. After some further research and some articles later I found the Erlang library erlport, which allows you to call Ruby and Python code from Elixir. There is also an Elixir wrapper for it, bearing the sounding name Export. You could also use erlport directly in Elixir, but Export gives you some convenient functions on top and a more Elixir-like feeling.
Setting Up a New mix Project & Python virtualenv
In order to get started with our custom city index,
let’s set up a prototype mix project, called elixir_python
:
mix new elixir_python
Head to the mix.exs
file and add the export
dependency:
# mix.exs
# ...
defp deps do
[{:export, "~> 0.1.0"}]
end
Then install the dependencies with:
mix deps.get
For setting up a Python environment you can use virturalenv to create a local virtual environment. Also keep in mind to activate it after creating:
virtualenv -p python3 venv
source venv/bin/activate
We will use Whoosh, so we need a requirements.txt
next to our mix.exs
that defines
the Python dependencies:
# /requirements.txt
whoosh==2.7.4
Install the requirements with:
pip install -r requirements.txt
Next, we need a directory where our Python code will live. Let’s create a lib/python
directory where we will put the *.py files later on.
You can really put them wherever you want, you just have to link to the directory
when using Export.
In your lib/python
directory create a geolite2.py
file. This is where we will
put the code for our city search index. Next, download the GeoLite2 CSV files from
dev.maxmind.com and put the English city locations in the /lib/python/data
directory. For our Python requirements we will also need a requirements.txt
file
in our project’s root directory.
Our Elixir code will live in lib/elixir_python/geolite2.ex
.
The overall project structure should now look like this:
└── elixir_python
├── config
├── lib
│ ├── elixir_python
│ │ └── geolite2.ex
│ ├── python
│ │ ├── data
│ │ │ └── GeoLite2-City-Locations-en.csv
│ │ ├── __init__.py
│ │ └── geolite2.py
│ └── elixir_python.ex
├── mix.exs
├── requirements.txt
└── …
The Python Part
Our geolite2 Python module will have an API composed of two functions:
# lib/python/geolite2.py
def create_index():
# We will add code here in some minutes...
pass
def search(query, count=10):
# We will add some code here soon...
pass
The first one creates our search index using the GeoLite2 city CSV file. The second lets us search for cities, states or countries and will pass the results back to Elixir.
Indexing the City Data
For each Whoosh index you can define a certain structure, its schema. The schema defines which data you want to store in the index and which fulltext–or content–you want to run the search on.
Our city schema looks like this:
from whoosh.fields import SchemaClass, TEXT
from whoosh.analysis import NgramWordAnalyzer
class CitySchema(SchemaClass):
city = TEXT(stored=True)
state = TEXT(stored=True)
country = TEXT(stored=True)
content = TEXT(analyzer=NgramWordAnalyzer(minsize=2), phrase=False)
We want to store the city, the state, and the country. The content
field will
hold the fulltext to search in, in our case it will be the joined city, state, and
country name. This allows us to also search for cities, states or countries and
provide multiple query terms to narrow down our results.
We use an NgramWordAnalyzer
and set the phrase
argument to False
in order
to save some space (see this whoosh recipe for more details).
Before creating the index let’s define our directory names and files we want to use along with some handy functions for building the absolute paths to these files:
# lib/python/geolite2.py
import os
# The base directory where out data lies, relative to this file
DATA_BASE_DIR = 'data'
# The actual city data file
CITY_DATA_FILE = 'GeoLite2-City-Locations-en.csv'
# Our base directory where the index files are stored
INDEX_BASE_DIR = 'index'
# The name of our index
CITY_INDEX = 'city'
def index_path(index_name):
""" Returns the absolute index path for the given index name """
index_dir = '{}_index'.format(index_name)
return os.path.join(current_path(), INDEX_BASE_DIR, index_dir)
def data_file_path(file_name):
""" Returns the absolute path to the file with the given name """
return os.path.join(current_path(), DATA_BASE_DIR, file_name)
def current_path():
""" Returns the absolute directory of this file """
return os.path.dirname(os.path.abspath(__file__))
Armed with these helpers we can now go ahead and define the actual index creation function. We read the CSV file line by line, and create the schema from it. Some lines in the CSV do not represent cities but states or countries, so we skip these lines, unless there is a value in the city column:
# lib/python/geolite2.py
import csv
import shutil
from whoosh.index import create_in
# ...
def create_index():
""" Create search index files """
path = index_path(CITY_INDEX)
_recreate_path(path)
index = create_in(path, CitySchema)
writer = index.writer()
with open(data_file_path(CITY_DATA_FILE)) as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
_add_document(data=row, writer=writer)
writer.commit()
def _recreate_path(path):
""" Deletes and recreates the given path """
if os.path.exists(path):
shutil.rmtree(path)
os.makedirs(path)
def _add_document(row, writer):
""" Writes the data to the index """
city = row.get('city_name')
if not city:
return
state = row.get('subdivision_1_name')
country = row.get('country_name')
content = ' '.join([city, state, country])
writer.add_document(
city=city,
state=state,
country=country,
content=content
)
Note that the content ("<city> <state> <country>"
) is the actual text we analyse
and put into the index. The rest of the schema properties is just stored data,
which we can access again later on in our results and pass on to our Elixir app.
Searching for Cities
Let’s now implement the function for making a search request:
# lib/python/geolite2.py
from whoosh.qparser import QueryParser
from whoosh.query import Prefix
# ...
def search(query, count=10):
""" Searches for the given query and returns `count` results """
index = open_dir(index_path(CITY_INDEX))
with index.searcher() as searcher:
parser = QueryParser('content', index.schema, termclass=Prefix)
parsed_query = parser.parse(query)
results = searcher.search(parsed_query, limit=count)
data = [[result['city'],
result['state'],
result['country']]
for result in results]
return data
First, we get the city index that we created with create_index()
.
Then we build an instance of whoosh’s QueryParser
in order to parse our query
using our city schema. We use the termclass=Prefix
here to only match documents
that contain any term that starts with the given query text
(see the whoosh.query.Prefix docs).
The parsed query is then passed to a searcher which finally runs the search and
compiles the results for us.
In order to keep it simple we collect the needed data in a list of lists.
This will be the data we are going to receive from our Elixir function in a moment.
The Elixir part
Our Elixir API will look pretty much the same as the Python API:
# /lib/elixir_python/geolite2.ex
defmodule ElixirPython.GeoLite2 do
def create_index do
# We will add code here in some more minutes...
end
def search(query, count \\ 10) do
# We will add some code here later...
end
end
Calling Python
To prepare for calling our Python functions from Elixir, add a python_call/3
function to the ElixirPython
module. It creates a Python instance for us
and runs the Python code we provide it with.
# lib/elixir_python.ex
defmodule ElixirPython do
use Export.Python
@python_dir "lib/python" # <-- this is the dir we created before
def python_call(file, function, args \\ []) do
{:ok, py} = Python.start(python_path: Path.expand(@python_dir))
Python.call(py, file, function, args)
end
end
We make use of Export’s Python module. Python.start/1
returns a tuple including
a Python instance. In order to pick up our modules we pass the path to our Python
directory as base path.
Python.call/4
takes care of calling the given Python function from the
respective module file.
Creating the City Index
We use the python_call
function we just defined to run the create_index
function in Python:
# lib/elixir_python/geolite2.ex
defmodule ElixirPython.GeoLite2 do
import ElixirPython, only: [python_call: 2]
@python_module "geolite2"
def create_index do
python_call(@python_module, "create_index")
end
# ...
end
Searching for Cities
To run a search query we use our python_call
function again to call the Python
search
function we defined. The returned value is a list of lists holding
the stored index data. We just loop over it and create Maps from it:
# lib/elixir_python/geolite2.ex
defmodule ElixirPython.GeoLite2 do
import ElixirPython, only: [python_call: 2, python_call: 3]
@python_module "geolite2"
# ...
def search(query, count \\ 10) do
results = python_call(@python_module, "search", [query, count])
for [city, state, country] <- results do
%{city: "#{city}", state: "#{state}", country: "#{country}"}
end
end
Running a Search
And we are done with our hunt for a city search index and we can use it now. Make sure you activated your Python virturalenv, then open up iex and give it a try:
source venv/bin/activate
iex -S mix
iex(1)> ElixirPython.GeoLite2.create_index()
⌛
iex(2)> ElixirPython.GeoLite2.search("Berlin", 3)
[%{city: "Berlin", country: "Germany", state: "Land Berlin"},
%{city: "Berlingen", country: "Belgium", state: "Flanders"},
%{city: "Falkenberg", country: "Germany", state: "Land Berlin"}]
Yay. It works!
Let’s try another one:
iex(3)> ElixirPython.GeoLite2.search("San José")
[]
Hm, why is that? We couldn’t find any results, although San José is definitely in the index. It’s because our system does not normalise special characters and accents yet. Let’s do this in a final next step.
Handling Special Characters
On the Elixir side this is easy to do. There is an erlang lib called iconv.
Let’s just add it to our mix.exs
:
# mix.exs
#...
defp deps do
[{:export, "~> 0.1.0"},
{:iconv, "~> 1.0"}]
end
Then install the dependency with:
mix deps.get
Let’s now preprocess the query before we pass it to our Python function:
# lib/elixir_python/geolite2.ex
defmodule ElixirPython.GeoLite2 do
# ...
def search(query, count \\ 10) do
query = clean_text(query)
# ...
end
defp clean_text(text) do
:iconv.convert("utf-8", "ascii//translit", text)
end
end
When we rerun our query now we get the wanted city in the results:
iex(4)> ElixirPython.GeoLite2.search("San José")
[%{city: "San José", country: "Costa Rica", state: "Provincia de San Jose"},
# ...
]
We still have some problems with e.g. German cities, like Görlitz, that use Umlauts, so let’s transform them to their ASCII counterparts before creating the index:
# lib/python/geolite2.py
import unicodedata
# ...
def _add_document(row, writer):
""" Writes the data to the index """
# ...
# clean up the content that goes to the index by using _cleanup_text():
content = _cleanup_text(' '.join([city, state, country]))
writer.add_document(
city=city,
state=state,
country=country,
content=content
)
def _cleanup_text(text):
""" Removes accents and replaces umlauts """
replaces = [
['ä', 'ae'],
['ö', 'oe'],
['ü', 'ue'],
['Ä', 'Ae'],
['Ö', 'Oe'],
['Ü', 'Ue'],
['ß', 'ss']
]
for original, replacement in replaces:
text = text.replace(original, replacement)
text = unicodedata.normalize('NFKD', text)
text = ''.join([char for char in text if not unicodedata.combining(char)])
text = text.encode('ascii', 'ignore').decode('ascii')
return text
If we recreate our index and run a Görlitz
query now, we will get some fitting results.
Wrapping Up
We managed to build a small local fulltext index for a city search without too much effort. Our system returns great search results and we allowed to search for cities with or without using special characters.
All in all, it does not scale very well, though. I tried to build another, more sophisticated city index including about 4.4 M cities and villages world-wide and the location coordinates (latitude and longitude) and time zone for each place. If you are interested you can find the script for combining the city data and location data in this gist. It took quite some time to build the index (about 40 minutes on my laptop) and resulted in an index file of 1.3 GB size (compared to ~29 MB for the GeoLite2 index). Although it also worked well and you will get fitting search results, it takes about 10 seconds to finish a single request. This approach would need some additional caching and further optimisation in order to be useful in any kind of way.
So, eventually, I ended up using the Google Places API anyway 😉. But, hey: “Wieder was gelernt.”