Open Source Geocoders

Name   Implementation
Datastore Target Road Network Address Parsing Technique Fuzzy Name Matching Technique Provides Configuration
Last Activity
Explorer GeoCoder Description  "a data and country independent geocoding engine" which can "assign latitude and longitude coordinates to any United States street address or intersection"
Maintainer SRC
  C++ ? TIGER Regex +? Soundex? No ?
Frost TIGER Geocoder Description  A geocoder for TIGER data implemented in PostgreSQL
Maintainer Stephen Frost et al
  PostgreSQL SQL PostgreSQL TIGER Regex Soundex? No ?
Geo-Coder-US Description  For geocoding US addresses, that is, estimating the latitude and longitude of any street address or intersection in the United States, using the TIGER/Line data set
Maintainer Schuyler Erle
  Perl ? TIGER Regex Metaphone? No ?
Geocoder::US Description  A rewrite of Geo-Coder-US into Ruby (and also requiring C and SQLite).
"Although it is primarily intended for use with the US Census Bureauís free TIGER/Line dataset, it uses an abstract US address data model that can be employed with other sources of US street address range data"
Maintainer GeoCommons
  Ruby, C SQLite TIGER Regex Metaphone No ?
JGeoCoder Description  A Java API loosely modelled after Geo::Coder::US
Maintainer ???
  Java JDBC (H2 image supplied) TIGER Regex, Java code Soundex? No 2008
PAGC Description  "The Postal Address Geo-Coder (or PAGC) is a library and a CGI based web service written in ANSI C that uses an address-ranged street network shapefile along with one or more postal addresses and provides the longitude/latitude coordinates of each matched address. PAGC has been designed to make it easily extensible to the postal address structure of many Western countries. Out of the box it works with publicly available road network data sets from the US (shapefiles of the US Census Bureauís TIGER/Line data) and Canada (shapefiles of Statistics Canadaís Road Network Files). "
  C Berkely DB TIGER, Stats Can Road Network "based on Aho-Corasick string matching" Soundex, Edit Distance Yes, to some extent 2011?

Assessment Criteria

Address Parsing Technique

One of the key challenges in geocoding is parsing free-form address text. Addresses are semi-structured, and typical human-generated addresses exhibit a high degree of variability in structure. In addition, typical address grammars are inherently ambiguous (in particular, street names can contain words which can also be interpreted as directionals or street types). All geocoders need to implement some form of address parsing, in such as way that it deals with complex grammars and ambiguity.

The most direct way of implementing adress parsing is to code it in a programming language. Often this can make use of language support for things like regular expressions. This approach has the downside of being opaque to understanding, and difficult to modify (both by users and the application maintainer themselves!)

More sophisticated address parsers define the parsing rules using a grammar-driven algorithm. Ideally, the grammar is exposed using a configuration language, allowing customization to suit different address domains.

Fuzzy Name Matching Technique

Another challenging area in geocoding is supporting approximate or fuzzy matching of input street names to the reference street dataset. This is essential to accomodate spelling mistakes and user uncertainty in real-world input. There are a variety of techniques which are commmonly used: Soundex, Metaphone, Bigrams, etc. One challenge with fuzzy name matching against a large corpus of valid names is how to obtain efficient performance. For this an indexing strategy is almost certainly required. Each fuzzy matching approach may require a different indexing technique.

Configuration Language

Real-world address models and road network datasets are typically fairly complex and non-uniform. A high-quality geocoder allows customization of various operational parameters in order to support a wider variety of input reference datasets. Configuration parameters can include such things as: When a large number of configuration parameters are provided, it is likely that the best way to expose them is in a file whose contents are specified by a configuration language.