humanleg.org.uk

contact key code distractions snaps

OSL Musical Chairs

Update 2011-11-05: Quite a lot of this is out of date now. I've done a lot of tweaking to deal with strange corner cases, but this remains a description of the general idea behind the matching. Since this was last updated I more-or-less rewrote the fuzzystrmatch levenshtein function to add some custom hacks and optimizations. Spaces now only count as half an edit in a levenshtein distance, as spacing differences are common and not quite as significant as spelling differences. OSM fields name, alt_name, name:en, name:cy, name:gd are all taken into account. Accented characters are normalized. As a result this now works pretty damn well with welsh and gaelic names. The best documentation of the algorithm is probably the source itself or the page on the OSM wiki that contains a reasonably up-to-date FAQ.

The OS Locator dataset is a great resource for UK OpenStreetMappers. It's essentially a gazetteer of street names throughout Great Britain with a six figure national grid referenced bounding box for each. Thing is, it's not in a particularly useful format for use. It's a CSV file really. We need to be able to know where there are disagreements between OSL and OSM and perhaps more importantly find street names that are missing altogether from OSM.

So I set about creating a mapping between OSL streets and OSM streets1. Using that it is possible to create a list of OSL streets determining whether they are present in OSM, missing from OSM, or disagree with OSM in some way.

For now I am mostly interested in generating the correspondence list with as few incorrect matches as possible. Doing so required quite a lot of tuning and heuristics to match streets that may be misnamed, mis-spelt, mis-located, etc. in either one or both of the datasets. Fortunately this is still well within the capabilities of PostgreSQL, one of the tools of choice of the OSM community.

Finding local candidates

It's impractical really to test for matching between every possible pair of OSL and OSM streets (there are around 800k entries in OSL and around 700k OSM streets in GB), so we must select our matching candidates carefully. Given a bounding box for each OSL street and each OSM street, for each named OSL entry we need to look for potential matches within a certain distance.

Using a PostGIS r-tree index for the OSM dataset's bounding boxes, we look for OSM streets whose bounding boxes are within a certain radius (it's not really a radius, it's an expanded bounding box) of each OSL street. This generates a candidate list for each OSL street. ( This is actually logically the same as looking for OSL streets within an expanded bounding box of an OSM street bounding box, so, for now at least, the mapping is symmetrical. )

I typically use 300m for this radius. Pick a radius too high and:

Pick a radius too low and we risk missing a corresponding OSM street due to:

On top of these problems, we have the fact that bounding boxes, while they are great for indexing and querying, are not so great geometrically. They are by no means feature orientation invariant (coverage-wise they are highly biased against axis-aligned features). For these reasons I didn't use the absolute distance between potential matches very heavily in the rest of the algorithm.

Fuzzy name matching

Once we have a candidate list - a list of local OSM streets - for each OSL street we use levenshtein distance between the two names as a measure of their similarity2. Levenshtein distance is a measure of edit distance between two strings - the minimum number of character deletions, replacements or additions that need to be made to transform one to the other. Before comparison both names undergo normalization in the same way:

We also generate a normalization where we skip the abbreviation step. We then choose the lower of the levenshtein distances between the abbreviated normalizations and the non-abbreviated normalizations. This is a heuristic I put in because I found out it allowed us to more closely match spacing errors which affected the regular expression which substituted the abbreviations.

The only reliable way for the normalization to do the abbreviation is to look for words to abbreviate that begin and end with whitespace. Lark Hill will be normalized lark h whereas Larkhill will be normalized larkhill - the levenshtein distance between the two will be artificially high if we only use the abbreviated normalized form. So we use min ( levenshtein ( abbr_norm_a , abbr_norm_b ) , levenshtein ( unabbr_norm_a , unabbr_norm_b ) )4 as our distance metric.

Candidate selection

So we now have a (symmetrical) mapping between OSL and OSM streets with a distance associated with each mapping. It would be simple to take the top match for each OSL street and be done with it, but there are some added nuances.

It would be nice to think that town planners and our forefathers decided not to put similarly named roads within 300m of each other, but that's really not the case. In particular, we have a lot of Barnstaple Road/Barnstaple Close type combinations where, not only are the streets nearby, often they are connected. Supposing we were missing Barnstaple Close on OSM. An OSL street looking for a match would find as its nearest match barnstaple rd with a levenshtein distance of 2 and think this is just a mislabeled Barnstaple Close. But of course, it isn't. Barnstaple Road is a road in its own right.

So we need a mechanism for an OSM street to be able to boot an OSL street off it if it is claimed by an OSL street with a better match. A simpler way of looking at it is - for a match to be considered valid, an OSM street must be an OSL street's (co-)first choice and that OSL street must be the OSM street's (co-)first choice. A bit like an incredibly brutal dating service.

This is quite easy with PostgreSQL 8.4's new window functions5.

It's quite possible for two differently non-perfect OSM streets to match an OSL street equally well/poorly. For example consider an OSL street named JOHNSTON STREET having in its candidate list both Johnson Street and Johnstone Street - both are Levenshtein distance 1 away, and so, providing there are no exact matches, either could be chosen as the match. This not desirable as we want our results to be fully deterministic and repeatable, so a series of tie-breakers are used to make the final match selection more stable, such as ref= equality , spatial distance , and finally resorting to OSM street id. Better match stability makes it easier to keep track of where actual relevant changes have been made to the OSM planet between runs.

Code

I called the python script I hacked up to do this OSL musical chairs, because I thought it was funny. I was surprised at the good results I got from this simple approach, as I was fully prepared to have to write some sort of simulated annealing algorithm to get all the street matches to agree with each other. Turns out, all the vaguely closeish matches in the world don't matter if the corresponding street just plain ain't there.

Code is up on bitbucket.org. License GPLv3. Code is integrated with a web interface to the results. OSL streets are shown as boxes, coloured depending on their match status. Green for perfect or near perfect matches, fading to blue for matches with problems and red for unmatched entries. Apologies to colourblind users. Boxes can be selected using their top right handles. A match history is shown for streets that have had changes.

Remaining Problems

Footnotes

  1. By OSM street I mean an OSM way with a highway= tag and a name= or ref=.
  2. This is all helped by PostgreSQL's contrib section having a fuzzystrmatch module that includes a levenshtein () function.
  3. Except it's SQL, so it's LEAST(). Can't expect any sane syntax can we?
  4. Nothing to do with DSP window functions.
  5. My favourite is 132KV SWITCH HOUSE ROAD.