OSL Musical Chairs
Update 2011-11-05: Quite a lot of this is out of date now. I've done a lot of tweaking to deal with strange corner cases, but this remains a description of the general idea behind the matching. Since this was last updated I more-or-less rewrote the fuzzystrmatch levenshtein function to add some custom hacks and optimizations. Spaces now only count as half
an edit in a levenshtein distance, as spacing differences are common and not quite as significant as spelling differences. OSM fields name, alt_name, name:en, name:cy, name:gd are all taken into account. Accented characters are normalized. As a result this now works pretty damn well with welsh and gaelic names. The best documentation of the algorithm is probably the source itself or the page on the OSM wiki that contains a reasonably up-to-date FAQ.
The OS Locator dataset is a great resource for UK OpenStreetMappers. It's essentially a gazetteer of street names throughout Great Britain with a six figure national grid referenced bounding box for each. Thing is, it's not in a particularly useful format for use. It's a CSV file really. We need to be able to know where there are disagreements between OSL and OSM and perhaps more importantly find street names that are missing altogether from OSM.
So I set about creating a mapping between OSL streets and OSM streets1. Using that it is possible to create a list of OSL streets determining whether they are present in OSM, missing from OSM, or disagree with OSM in some way.
For now I am mostly interested in generating the correspondence list with as few incorrect matches as possible. Doing so required quite a lot of tuning and heuristics to match streets that may be misnamed, mis-spelt, mis-located, etc. in either one or both of the datasets. Fortunately this is still well within the capabilities of PostgreSQL, one of the tools of choice of the OSM community.
Finding local candidates
It's impractical really to test for matching between every possible pair of OSL and OSM streets (there are around 800k entries in OSL and around 700k OSM streets in GB), so we must select our matching candidates carefully. Given a bounding box for each OSL street and each OSM street, for each named OSL entry we need to look for potential matches within a certain distance.
Using a PostGIS r-tree index for the OSM dataset's bounding boxes, we look for OSM streets whose bounding boxes are within a certain radius (it's not really a radius, it's an expanded bounding box) of each OSL street. This generates a candidate list for each OSL street. ( This is actually logically the same as looking for OSL streets within an expanded bounding box of an OSM street bounding box, so, for now at least, the mapping is symmetrical. )
I typically use 300m for this radius. Pick a radius too high and:
- We generate a large number of matches for the next stage.
- We risk matching against a genuine, very similarly named street nearby.
Pick a radius too low and we risk missing a corresponding OSM street due to:
- It being in slightly the wrong place
- Often OSL's bounding box will only be for a subset of the road. Sometimes, on a long road, and OSMer will only put the name tag on a section of the road - often on the end of the road leading into a populated place. So basically OSM and OSL have chosen to apply a name to just a subset of the road, and of course these two subsets may not agree.
On top of these problems, we have the fact that bounding boxes, while they are great for indexing and querying, are not so great geometrically. They are by no means feature orientation invariant (coverage-wise they are highly biased against axis-aligned features). For these reasons I didn't use the absolute distance between potential matches very heavily in the rest of the algorithm.
Fuzzy name matching
Once we have a candidate list - a list of local OSM streets - for each OSL street we use levenshtein distance between the two names as a measure of their similarity2. Levenshtein distance is a measure of edit distance between two strings - the minimum number of character deletions, replacements or additions that need to be made to transform one to the other. Before comparison both names undergo normalization in the same way:
- Conversion to lowercase. Capitalization is erratic between mappers and is of no interest to me. Besides, all OSL streetnames ARE IN ALL CAPS ANYWAY. I want to consider names of any capitalization a match.
-
Removal of punctuation. Punctuation is both erratic between OSM mappers and OSL entries. The significant culprits of this include:
-
The posessive apostrophe.
Robert's Way
vsRoberts Way
-
The abbreviation full stop.
St.
vsSt
-
The posessive apostrophe.
- Removal of extraneous spaces such as leading spaces, trailing spaces or multiple consecutive spaces sometimes caused by the punctuation removal.
-
Abbreviation of common words such as:
saint
tost
road
tord
street
tost
close
tocl
way
tow
hill
toh
- etc...
-
Some, such as saint/st are, again, erratic. It has long been a point of discussion in the OSM community whether
St. Andrews Road
should really be entered asSaint Andrews Road
. From the looks of the OSL dataset, the same confusion exists within Ordnance Survey and local councils. Normalizing allsaint
s tost
allows us to match any combination of variations. For abbreviations, such as close/cl, while it is still largely preferred to enter the word in full (I would consider a close entered ascl
an error), I still want to match between either variation. -
Similarity or difference between these common words is less significant than other words. Let's see what I mean by this.
>>> levenshtein ( "brick road" , "brick street" )
5
>>> levenshtein ( "brick road" , "inkly road" )
5
According to our simple levenshtein metric,Brick Road
is just as likely to beInkly Road
asBrick Street
. We of course know that it's far more likely that someone's mislabeledBrick Road
asBrick Street
or vice-versa than for someone to have enteredInkly Road
asBrick Road
. So theRoad
/Street
is a less significant error to us than getting the primary name wrong. Abbreviating these words allows us to weight the error differently. >>> levenshtein ( "brick rd" , "brick st" )
2
>>> levenshtein ( "brick rd" , "inkly rd" )
5
Here, the levenshtein distance on our normalized names makes it clear thatBrick Road
andInkly Road
are further apart. This has really allowed us to warp space in the levenshtein dimensions (if you care to imagine such a thing) to make road/street errors seem closer than they are. - Spelling errors in these words are less common. This isn't really an argument in its favour, just another reason it doesn't break many situations.
We also generate a normalization where we skip the abbreviation step. We then choose the lower of the levenshtein distances between the abbreviated normalizations and the non-abbreviated normalizations. This is a heuristic I put in because I found out it allowed us to more closely match spacing errors which affected the regular expression which substituted the abbreviations.
The only reliable way for the normalization to do the abbreviation is to look for words to abbreviate that begin and end with whitespace. Lark Hill
will be normalized lark h
whereas Larkhill
will be normalized larkhill
- the levenshtein distance between the two will be artificially high if we only use the abbreviated normalized form. So we use min ( levenshtein ( abbr_norm_a , abbr_norm_b ) , levenshtein ( unabbr_norm_a , unabbr_norm_b ) )
4 as our distance metric.
Candidate selection
So we now have a (symmetrical) mapping between OSL and OSM streets with a distance associated with each mapping. It would be simple to take the top match for each OSL street and be done with it, but there are some added nuances.
It would be nice to think that town planners and our forefathers decided not to put similarly named roads within 300m of each other, but that's really not the case. In particular, we have a lot of Barnstaple Road
/Barnstaple Close
type combinations where, not only are the streets nearby, often they are connected. Supposing we were missing Barnstaple Close on OSM. An OSL street looking for a match would find as its nearest match barnstaple rd
with a levenshtein distance of 2 and think this is just a mislabeled Barnstaple Close
. But of course, it isn't. Barnstaple Road is a road in its own right.
So we need a mechanism for an OSM street to be able to boot an OSL street off it if it is claimed by an OSL street with a better match. A simpler way of looking at it is - for a match to be considered valid, an OSM street must be an OSL street's (co-)first choice and that OSL street must be the OSM street's (co-)first choice. A bit like an incredibly brutal dating service.
This is quite easy with PostgreSQL 8.4's new window functions5.
It's quite possible for two differently non-perfect OSM streets to match an OSL street equally well/poorly. For example consider an OSL street named JOHNSTON STREET
having in its candidate list both Johnson Street
and Johnstone Street
- both are Levenshtein distance 1 away, and so, providing there are no exact matches, either could be chosen as the match. This not desirable as we want our results to be fully deterministic and repeatable, so a series of tie-breakers are used to make the final match selection more stable, such as ref= equality , spatial distance , and finally resorting to OSM street id. Better match stability makes it easier to keep track of where actual relevant changes have been made to the OSM planet between runs.
Code
I called the python script I hacked up to do this OSL musical chairs, because I thought it was funny. I was surprised at the good results I got from this simple approach, as I was fully prepared to have to write some sort of simulated annealing algorithm to get all the street matches to agree with each other. Turns out, all the vaguely closeish matches in the world don't matter if the corresponding street just plain ain't there.
Code is up on bitbucket.org. License GPLv3. Code is integrated with a web interface to the results. OSL streets are shown as boxes, coloured depending on their match status. Green for perfect or near perfect matches, fading to blue for matches with problems and red for unmatched entries. Apologies to colourblind users. Boxes can be selected using their top right handles. A match history is shown for streets that have had changes.
- Python (2.5 definitely works)
- (Geo)Django 1.2
- PostgreSQL 8.4 with contrib module fuzzystrmatch registered into your candidate database. This is probably available in your distribution's postgresql-contrib package.
- PostGIS (1.5 definitely works)
Remaining Problems
-
The web interface does its job ok, but
- While it's good enough in a browser, it would be much nicer to have the same sort of functionality directly in an editor, so you don't have to pan two views around to the same area. But I can't really see any way of doing it with current editors.
- I would love to add some sort of functionality that allowed users to add comments, manually change the status of entries - mark them as incorrect, inappropriate for use6. This is a bit of a fantasy though - there are problems with how to do user authentication for a start.
- ref=s are handled a bit strangely at the moment. There is no flexibility when matching these refs as there is with name=. name= still takes priority in cases where both are provided by an OSL entry. In these situations, ref= is mostly used as a tie-breaker. There are still some sticky situations that users may notice where both OSL and OSM entries have both properties but it's a bit difficult to decide what the policy should actually be here.
- An interesting thing I've learnt testing this technique on my hometown is that dictaphone mapping (at least for me) sucks quite badly. It would appear I've made plenty of phonetically-similar spelling mistakes when reviewing my dictaphone record. I think I'll use photo mapping or pen & paper mapping from now on.
Footnotes
- By
OSM street
I mean an OSM way with a highway= tag and a name= or ref=. - This is all helped by PostgreSQL's contrib section having a fuzzystrmatch module that includes a
levenshtein ()
function. - Except it's SQL, so it's
LEAST()
. Can't expect any sane syntax can we? - Nothing to do with DSP window functions.
- My favourite is 132KV SWITCH HOUSE ROAD.