-
Notifications
You must be signed in to change notification settings - Fork 36
Consider not rewriting street names #32
Comments
I'd better defer to @ingalls. You're right that this is a lossy operation, but there's something to be said for having this data as close to ready to go as possible, and to avoid making users re-deduce these cultural variations over and over. Possible solutions:
I don't think 2 buys us much. I think 3 will wind up being pretty helpful now that we've got coverage outside the anglophone world -- alas, there's logic in the expansion function, so this will be more complicated than just specifying a different |
Another solution would be for us to package the expansion code as a utility library for use by OpenAddress users. Then they could download the original unaltered street names and use our nice code to expand it locally in their application. |
I like the idea @NelsonMinar proposed. Address expansion is a common problem and having a separate module which we could all collaborate on would be generally useful, @mapzen will contribute to this. |
I like the idea of separate expansion code as well. Expertise from someone who knows addressing (initially in the U.S.) like @geomantic could be of use here. I assume there are similar nitpicky details elsewhere, and the |
(related fun fact: a few weeks ago, Uber released a version of its app for drivers that had the speaking voice incorrectly set to British English, and pronounced every “St.” as “Saint”, as in “turn left on 9th Saint”) |
@migurski I'll stick up for I think we're going to need some additional flexibility, but I wouldn't suggest bucketing by ISO code. There's going to be a ton of overlap in a given region, for one thing. For another, if we start writing a lot of per-country code we could find ourselves facing all the problems of a big scraper project |
I guess all I mean is that it’s not so much advanced as different, and as we encounter other non-USian approaches toward addressing structure, it will make more sense to handle them as such. The code seems great. |
Yeah If the consensus holds to stop doing the address expansion in the main OpenAddress output product, I volunteer to spin off a new project with my Python code to make it easy for people to do the expansion locally. Library function and a command line tool, see also Mike's StreetNames package. Would be nice to bring the Javascript code in too, would be great for both Node and in-browser use. We could even share test cases! I love the idea that other folks might contribute; having a small separate library would make it easier to bring in expertise from people who know more about address localization. |
👍 on a library for name expansion. It seems like that should be part of a general "address normalization" library, but that's a much bigger bite to take. Let's start with name expansion! |
I could extract |
It sounds like we're zeroing in on an approach, but I suggest waiting a few days for @ingalls (who's still on vacation) to get back, dig out from email & weigh in. |
|
@thatdatabaseguy has done some work for Mapzen over here: https://github.com/openvenues/address_normalizer Seems similar, maybe he can take a look here & weigh in? (This is a project that's pulling address & POI info from the web via the common crawl, which we will publish very soon. Probably relevant to openaddresses too). |
Huh, that sounds very much like the thing. |
@migurski @NelsonMinar I'm certainly open to new ways of looking at abbreviated/expanded approach but my preference would be expanded.
The conform process is aimed at taking the weight off developers shoulders and like format I think having standardized expansions will make peoples lives even easier. I do like the idea of splitting it out into a library. I think we also need another option in conform that is something like If the What do you all think? |
Welcome back @ingalls! I'm out of my expertise when it comes to address localization, glad to have all the comments from people who actually put this data to use. My main complaint is the code we currently have is not good enough to apply to all our data. I cringe at the idea that we might be formatting "Nord Mauerstraße" in Berlin as "North Mauer Street". (Made up example.) I feel strongly we shouldn't degrade data. So my first thought was just to stop trying to do any expansions. But there's so much interest in doing good expansions I'm newly encouraged! We can steer OpenAddresses towards doing expansions where we know we can do a good job and then improve our expansion capability. I imagine there has to be prior art from national post offices, etc we can use. Specific proposal:
I think the near-term outcome is to make the conversion code expand the output only for the US locale, where our current code works well. Then we can start expanding the code to other locales incrementally. |
@NelsonMinar This sounds great. One thing is should we bother maintaining two expansion libraries? I'm a huge nodejs fan but I think in the interest of being able to iterate quickly we should probably stick with one language. I think writing this in python to match machine is probably going to be best. Plan:
Note: The reason I think going by and |
👍 I like the idea of the |
See: openaddresses/expand for some starter expansion files that we can being to work off. I propose that the name of the |
Hey guys - as @randymeech mentioned I have created a library https://github.com/openvenues/address_normalizer which handles most or all of these cases without losing information. At the moment we produce all the combinations (Dr expands to either Drive or Doctor, potentially to Dominican Republic as well depending on which gazetteers are being used) and give you back a few variants on the string without trying to resolve the ambiguity. This is suitable for search indexing and deduping, but not necessarily for displaying addresses and formatting. One option is to keep the original text as the display address but index the normalized form(s) in search/dbs. That's currently what we do in deduping (first surface form into the index wins). I've also been thinking about a statistical NLP approach which would label one candidate expansion as more likely than the others using a sequence model, just need to construct a small training set of ambiguous strings and their correct normalized forms. Our library also does a lot more than just synonym expansion. It includes:
There's a new version that hasn't been pushed to Github yet which is written in (modern) C, default bindings in Python and easy to bind to other languages like Node, etc. The included gazetteers are just text files and can be easily edited collaboratively where it makes sense or you can BYO expansions or mix and match. I've also written some code to download the latest GeoNames and add all the alternate=>canonical names to the trie which helps with expanding city aliases e.g. NYC => New York City, ATL => Atlanta. They also have a lot of cross-lingual toponyms e.g. Milan/Milano which is very useful from an l10n/i18n perspective. The currently deployed (mostly Python) code is running in production and is deduping millions of addresses across several data sources including OpenAddresses and OSM. Feel free to review, use, fork or contribute to our repo (I'm renaming it libpostal for the new release). I'm hoping it can be shared by multiple organizations and combine best practices in address normalization, extraction and parsing. Let me know if that sounds useful/interesting to you guys. |
@thatdatabaseguy how would you characterize the stability and ease of installation of |
@migurski great! Installation should be relatively painless (pip install https://github.com/openvenues/address_normalizer ought to do it, will get in PyPI). You'll need a C compiler when you install it, clang/gcc. C compilation is currently being handled by setup.py. There's a Makefile in the new version but for Python installations pip install will still work. There are no external C packages that have to be separately installed through a package manager, which is usually where complications arise. I'm building on Mac 10.9 and Ubuntu 14.04 currently. Python 2.7 definitely works, Python 3 I haven't tested yet but will definitely support both. As far as API, I would wait to check out the upcoming release. There are some changes on the Python side since some of the code has moved to C, but the API is fairly small as it stands and will remain as such. We'll send a heads up when that's ready and the API should be pretty stable after that. |
Sounds worth looking at. I'm not entirely sure how to use it; can you help? I’ve opened a ticket on the original project. Also, I don’t see any tests or anything, curious if we can somehow help with that? |
Also, @ingalls I like your plan. Seems like |
@migurski Sounds great! I'm completely for using |
I'm going on vacation in a few days and would like to reach some resolution. I think the consensus is to keep doing expansions of street names in the output product, and treat the expansion code as a first class product so it can be independently improved. |
(Ugh, hit ctrl-Enter in GitHub early.) I still volunteer to spin my Python code out into a separate library. I'll do this in a couple of days if I don't hear otherwise. I like the idea of using |
@NelsonMinar, most of the unit tests are in the new release, but pushed some basic API tests to the Python version in the meantime including most of what you guys had in expand.py plus some deduping test cases: https://github.com/openvenues/address_normalizer/blob/master/test.py, happy to add more. Displaying a single best expansion is a goal on our side as well. So there's a separate function for both use cases. I've renamed what was previously "normalize_street_address" to "expand_street_address" and have the normalize_street_address call return a single result rather than a set. The normalize calls currently just return one element from the set - real implementation will be in the upcoming release. I've derived a training set from OSM way names and addr tags, many of which already use expanded forms instead of abbreviations, to estimate probabilities of a particular expansion given the surface form's position, its neighboring tokens and which dictionaries they match. Some expansions can be ambiguous across languages e.g. you would want a phrase-initial "Dr." to normalize to "Doktor" in German while "Doctor" might be more statistically likely because there happen to be more addresses in English in your data set, so a country/language/locale hint may be advisable in a few cases for added precision. The vast majority of the time though simple maximum likelihood will do the trick. |
@skorasaurus noticed that "CR" is not correctly interpreted as County Road in #1084; we may want to revisit this thread soonish. |
With the new |
I wouldn't want to implement the street name rewriting algorithms as regex! But the new function syntax does give us a way to specify processing on a per-source basis. We could define new functions for common rewriting and add them to the source specs. That would give us localization at least. |
Yeah. Start with an America setting and go from there. |
Seeing another example of a bad expansion, with “Dr Carlton B Goodlett Pl” expanded to “Drive Carlton B Goodlett Place” in San Francisco. |
Might be worth taking a look at @thatdatabaseguy's successor to openvenues/address-normalizer, openvenues/libpostal. It has expansion models trained for a large number of languages. |
@riordan I think for this particular problem, as @NelsonMinar has pointed out before, what's needed is a "single best" expansion, which is beyond libpostal's immediate capabilities (disambiguation is next on the roadmap though). We return a set of possible expansions, some of which may be grammatically nonsensical, but at least one will be correct. For search indexing or deduping that's fine, but for display it's problematic. What if OA did the inverse of what it's doing now and abbreviated everything instead? That can be done deterministically and with no errors, while still handling duplicates. Thoughts? |
I still think we ought to do nothing, just pass on the authoritative source data as-is. In some other discussion we were close to a consensus on providing two files; one unedited, and one cleaned up. Maybe in the cleaned up one we make a best effort to rewrite things. |
Yep, I'm partial to @NelsonMinar's idea here. That implies that someone has to build out a cleanup pipeline though 😄. |
Also agree with @NelsonMinar and @iandees. I think this ticket stands: we’ll not rewrite street names. |
Time to begin the great cleanup. |
Wait, it's over a year later and my opinion prevails? I win! :-) Seriously, I'm impressed with how our thinking has evolved on source data. And the value of cleaning data in a more comprehensive step. Thanks in particular to having users. As my first post says, I think removing our current expansion is as simple as removing all calls to |
First we ignored you |
thanks @migurski - assessing .... |
I think this is the right direction to move in. I'm unsure of the effect it will have on our systems, but our pipeline doesn't auto-update anyway -- we'll keep this in mind for the next time we pull a fresh OA CSV. |
👍 One of the ideas we talked about was doing post-processing in the batch download files: de-duping, scrubbing, and perhaps the formatting and name expansion that’s been removed here. |
We at the CFPB like this transition towards not altering street name, as it brings the data closer to the authoritative source while not reducing its effectiveness for our use case. |
Echoing @kgudel's point (but on behalf of the Pelias team). It's great (and critical) that OA be a place to get "Authoritative Data". But it's also then a burden that we pass onto any other consumers of the data, particularly those who don't have the knowledge or resources to do their own analysis or processing to make more sense of it. Plus the OpenAddresses project, by being a collective effort already, is in a great place to pull together knowledge of how to interpret different address formats from the downstream users, and it would be a shame for that to go to waste. I'm a fan of @NelsonMinar's proposal (#283) for there to be the "Authoritative" files ( |
We’ve talked about applying the expansions (and other post-processes) to only the collected files, available in big zip files at http://results.openaddresses.io. Would the availability of untouched data at the individual source level still work for CFPB, @kgudel? E.g. this would be post processed: http://data.openaddresses.io/openaddr-collected-us_northeast.zip |
This pull is basically ready: #281.
|
The current Node and Python code rewrites street names, turning things like "W ST SEBASTIAN ST" in the source data to "West Saint Sebastian Street" in the output. The problem is the transform can only degrade information from the source. We're not using any source-specific information to add value, it's all just a guess. The current code is definitely US English only.
I think this transformation is a bad feature for the product. The Python code I'm implementing in ditch-node does it for now to make it as compatible as possible as the Node code. If we decide to remove this feature from the processing, in the Python code it's as easy as removing expand.py and any calls to it.
The text was updated successfully, but these errors were encountered: