A consumer who visits a map-based real estate website such as Trulia expects the application to meet some basic standards of accuracy and consistency. For example, the markers on the map should correspond to the actual locations of the properties. Simple, right? That’s what we thought too…
To plot a point on a map, you need coordinates: a longitude and a latitude. Converting a property’s address (street number, street name, city, state, ZIP) to a long/lat pair, a process known as geocoding, is more complicated than you’d think. There’s no single, authoritative source for this type of information. Tell us your home address and try 10 different geocoding solutions; you’ll probably get 10 slightly different sets of coordinates. If you’re lucky, they’ll all be within a radius of a few hundred feet.
This seems like the kind of problem that the federal government should handle. And it does… kind of. The U.S. Census Bureau maintains a dataset called TIGER that features an accurate representation of all of the streets, roads, avenues, etc. in the country and their intersections. But there are a couple of problems: first, they provide the data but no interface for actually making use of it; and second, the data represents the roads but not the individual properties. For the first problem, open source comes to the rescue: Schuyler Erle and Rich Gibson of Locative Technologies have developed a Perl-based application for querying the TIGER dataset that is available at http://geocoder.us. For the second problem, the best we can do is address interpolation: given the coordinates of the two intersections nearest a property, address interpolation estimates the property’s coordinates based on the street number. For example, if Nardinelli Avenue intersects Calderon Road at 400 Nardinelli and it intersects Inkinen Road at 500 Nardinelli, interpolation would estimate that 468 Nardinelli is 68% of the way along the line connecting the coordinates at 400 Nardinelli to the coordinates at 500 Nardinelli. In urban areas with straight roads organized in grids, these approximations are quite accurate; in rural areas with fewer intersections and curvier roads, however, address interpolation can yield very inaccurate geocodes.
Occasionally you can even get embarrassing snafus like a single-family home plotted right smack in the middle of a major river.
For those who are not content to ride the emotional rollercoaster of address interpolation, there is yet hope: various private companies such as Navteq have assembled datasets that boast point-level accuracy (down to the individual properties). Some of the major Internet companies have licensed these datasets and created interfaces to the data through proprietary geocoding applications. Google, Yahoo! and Microsoft all offer free geocoders that software developers can incorporate into their applications through HTTP-based APIs. The accuracy and coverage of these geocoders is generally high, but each option has limitations on usage and/or responsiveness: Google imposes a strict limit of no more than one query per 1.725 seconds; Yahoo! only allows 5,000 queries per day; and Microsoft’s servers are often very slow to respond (though they also offer a batch mode that is supposedly much more efficient).
Keep in mind that all these APIs use the web as a pipeline to exchange information with your application and each request (except those that are locally cached) will involve considerable latency. If your application geocodes addresses on the fly and your users expect quick response times, you may be better off using a locally-installed geocoder such as the TIGER-based system mentioned above.
Of course, if you’re willing to shell out a little (or a lot) of that VC funding, the options get more attractive. For example, Group1 Software markets a software suite that features point-level geocoding and can run on your local Linux server. Another option is the Eagle Geocoder from Tele Atlas, which runs remotely but allows you to purchase geocodes on a per-unit basis.
With so many choices, it’s tempting to mix and match your geocoder depending on the location or to escalate failed geocodes from a free service to a paid one. This type of tiered strategy can improve your success rate while keeping costs low, but note that it comes with a cost. Many geocoders have a consistent bias in one direction relative to the map on which you’ll be plotting the coordinates. If all of your geocodes come from one source, you may be able to correct the bias systematically; if you use multiple geocoders, this translation becomes difficult or impossible.
A couple of final caveats: unless the sources of your address data are absolutely perfect, you’ll need to do some address parsing and cleaning before you even send your query to the geocoder. And it’s a good idea to do a little sanity checking afterward, too — the TIGER data has an unfortunate habit of occasionally freaking out and, say, plotting a San Francisco home in the middle of the Sahara desert.
Regarding the title of this post: yes, a new “Transformers“ movie is coming out next summer. Our guts tell us that this movie will change the course of history forever (for both humans and machines). We’re not sure how we’ll get through the 12 intervening months — but thank God (and Samuel L.) for “Snakes On A Plane”.