Skip to content
This repository has been archived by the owner on May 5, 2022. It is now read-only.

Consider not rewriting street names #32

Closed
NelsonMinar opened this issue Jan 7, 2015 · 54 comments
Closed

Consider not rewriting street names #32

NelsonMinar opened this issue Jan 7, 2015 · 54 comments
Assignees

Comments

@NelsonMinar
Copy link
Contributor

The current Node and Python code rewrites street names, turning things like "W ST SEBASTIAN ST" in the source data to "West Saint Sebastian Street" in the output. The problem is the transform can only degrade information from the source. We're not using any source-specific information to add value, it's all just a guess. The current code is definitely US English only.

I think this transformation is a bad feature for the product. The Python code I'm implementing in ditch-node does it for now to make it as compatible as possible as the Node code. If we decide to remove this feature from the processing, in the Python code it's as easy as removing expand.py and any calls to it.

@migurski
Copy link
Member

migurski commented Jan 7, 2015

Curious for @sbma44 and @ingalls take on this. I personally agree with you Nelson, especially as we expand to worldwide addresses and the abbreviations may not mean what we think they mean.

@sbma44
Copy link
Contributor

sbma44 commented Jan 7, 2015

I'd better defer to @ingalls. You're right that this is a lossy operation, but there's something to be said for having this data as close to ready to go as possible, and to avoid making users re-deduce these cultural variations over and over.

Possible solutions:

  1. emit additional columns containing original, unexpanded data
  2. accept a flag that turns off expansion
  3. allow expansion rules to be overridden per-source

I don't think 2 buys us much. I think 3 will wind up being pretty helpful now that we've got coverage outside the anglophone world -- alas, there's logic in the expansion function, so this will be more complicated than just specifying a different expand.json. Besides, a given set of expansions is likely to be useful across a geographic or language area broader than a single source, making the conform object the wrong place for this. 1 would be fine but could break some configurations and might be a pain to implement, if the Python rewrite is sticking to the pattern of the node version of conform.

@NelsonMinar
Copy link
Contributor Author

Another solution would be for us to package the expansion code as a utility library for use by OpenAddress users. Then they could download the original unaltered street names and use our nice code to expand it locally in their application.

@missinglink
Copy link

I like the idea @NelsonMinar proposed. Address expansion is a common problem and having a separate module which we could all collaborate on would be generally useful, @mapzen will contribute to this.

@migurski
Copy link
Member

migurski commented Jan 7, 2015

I like the idea of separate expansion code as well. Expertise from someone who knows addressing (initially in the U.S.) like @geomantic could be of use here. I assume there are similar nitpicky details elsewhere, and the advanced_merge approach felt like a bit of kludge on top of what should really be a per-country approach keyed on ISO code.

@migurski
Copy link
Member

migurski commented Jan 7, 2015

(related fun fact: a few weeks ago, Uber released a version of its app for drivers that had the speaking voice incorrectly set to British English, and pronounced every “St.” as “Saint”, as in “turn left on 9th Saint”)

@sbma44
Copy link
Contributor

sbma44 commented Jan 7, 2015

@migurski I'll stick up for advanced_merge -- I think it's solving a different problem. Specifically, some Asian addressing systems delimit address components with hyphens, and some source files require that more than two fields be joined.

I think we're going to need some additional flexibility, but I wouldn't suggest bucketing by ISO code. There's going to be a ton of overlap in a given region, for one thing. For another, if we start writing a lot of per-country code we could find ourselves facing all the problems of a big scraper project

@migurski
Copy link
Member

migurski commented Jan 7, 2015

I guess all I mean is that it’s not so much advanced as different, and as we encounter other non-USian approaches toward addressing structure, it will make more sense to handle them as such. The code seems great.

@NelsonMinar
Copy link
Contributor Author

Yeah advanced_merge is different from this issue and necessary. It's the only way to merge the number portion of an address, and I bet there's at least one US source where it'll be the only way to get an address like "122½" out of "NUMBER" and "FRACTION" columns. I have some broader questions about how to make the conform process better for international sources, there's definitely room for some per-region configuration. But one thing at a time!

If the consensus holds to stop doing the address expansion in the main OpenAddress output product, I volunteer to spin off a new project with my Python code to make it easy for people to do the expansion locally. Library function and a command line tool, see also Mike's StreetNames package. Would be nice to bring the Javascript code in too, would be great for both Node and in-browser use. We could even share test cases! I love the idea that other folks might contribute; having a small separate library would make it easier to bring in expertise from people who know more about address localization.

@iandees
Copy link
Member

iandees commented Jan 7, 2015

👍 on a library for name expansion.

It seems like that should be part of a general "address normalization" library, but that's a much bigger bite to take. Let's start with name expansion!

@migurski
Copy link
Member

migurski commented Jan 7, 2015

I could extract StreetNames out of nvkelso/map-label-style-manual/tools/street_names into a separate package here under OA. It’s code code and tests for abbreviating U.S. names; could grow code and tests for the opposite that we’re talking about here.

@sbma44
Copy link
Contributor

sbma44 commented Jan 7, 2015

It sounds like we're zeroing in on an approach, but I suggest waiting a few days for @ingalls (who's still on vacation) to get back, dig out from email & weigh in.

@migurski
Copy link
Member

migurski commented Jan 7, 2015

:concur:

@randymeech
Copy link

@thatdatabaseguy has done some work for Mapzen over here: https://github.com/openvenues/address_normalizer

Seems similar, maybe he can take a look here & weigh in?

(This is a project that's pulling address & POI info from the web via the common crawl, which we will publish very soon. Probably relevant to openaddresses too).

@migurski
Copy link
Member

migurski commented Jan 8, 2015

Huh, that sounds very much like the thing.

@ingalls
Copy link
Member

ingalls commented Jan 8, 2015

@migurski @NelsonMinar I'm certainly open to new ways of looking at abbreviated/expanded approach but my preference would be expanded.

  • Municipal governents don't always follow the standard USPS abbreviations. court is usually ct but I've also seen it as co.
  • The classic ambiguity between Drive v Doctor and Street v Saint
  • Easier to abbrivate back into preferred abreviation scheme (See point 1)

The conform process is aimed at taking the weight off developers shoulders and like format I think having standardized expansions will make peoples lives even easier.

I do like the idea of splitting it out into a library. I think we also need another option in conform that is something like expansion: usps or expansion: 'fake_county' to allow us to have multiple overlapping expansion types instead of just one giant list with possible conflicts.

If the expansion flag was not set than expand.py would be skipped.

What do you all think?

@NelsonMinar
Copy link
Contributor Author

Welcome back @ingalls! I'm out of my expertise when it comes to address localization, glad to have all the comments from people who actually put this data to use.

My main complaint is the code we currently have is not good enough to apply to all our data. I cringe at the idea that we might be formatting "Nord Mauerstraße" in Berlin as "North Mauer Street". (Made up example.) I feel strongly we shouldn't degrade data. So my first thought was just to stop trying to do any expansions.

But there's so much interest in doing good expansions I'm newly encouraged! We can steer OpenAddresses towards doing expansions where we know we can do a good job and then improve our expansion capability. I imagine there has to be prior art from national post offices, etc we can use.

Specific proposal:

  1. Create a new GitHub project for street name expansion. Seed it with the Javascript code in csv.js and the Python code in expand.py. Make them separately installable libraries with NPM/PyPi. Share a suite of test data so we ensure the two libraries behave consistently.
  2. Remove the expansion code itself from openaddresses-conform and openaddresses-machine, include our libraries instead.
  3. Add localization to the expansion library. My first thought was to do it by region, but @ingalls proposal of a conform option is also appealing.

I think the near-term outcome is to make the conversion code expand the output only for the US locale, where our current code works well. Then we can start expanding the code to other locales incrementally.

@ingalls
Copy link
Member

ingalls commented Jan 8, 2015

@NelsonMinar This sounds great. One thing is should we bother maintaining two expansion libraries? I'm a huge nodejs fan but I think in the interest of being able to iterate quickly we should probably stick with one language. I think writing this in python to match machine is probably going to be best.

Plan:

  • Create country specific expansion files in new repo
  • Alter openaddresses/test.js to allow expand flag
  • Add expand flag to applicable sources
  • Get expand.py extracted => library
  • Switch machine to use new expand library

Note: The reason I think going by and expand flag over a region is that we have overlapping data sources (ie one source for all of MA and then municipalities in MA) these could potentially use different expansion tokens.

@sbma44
Copy link
Contributor

sbma44 commented Jan 8, 2015

👍 I like the idea of the expand flag.

@ingalls
Copy link
Member

ingalls commented Jan 8, 2015

See: openaddresses/expand for some starter expansion files that we can being to work off. I propose that the name of the expand tag match 1:1 with the files in the /maps directory

@albarrentine
Copy link
Contributor

Hey guys - as @randymeech mentioned I have created a library https://github.com/openvenues/address_normalizer which handles most or all of these cases without losing information.

At the moment we produce all the combinations (Dr expands to either Drive or Doctor, potentially to Dominican Republic as well depending on which gazetteers are being used) and give you back a few variants on the string without trying to resolve the ambiguity. This is suitable for search indexing and deduping, but not necessarily for displaying addresses and formatting. One option is to keep the original text as the display address but index the normalized form(s) in search/dbs. That's currently what we do in deduping (first surface form into the index wins). I've also been thinking about a statistical NLP approach which would label one candidate expansion as more likely than the others using a sequence model, just need to construct a small training set of ambiguous strings and their correct normalized forms.

Our library also does a lot more than just synonym expansion. It includes:

  • a lexer-based tokenizer which looks for multiple patterns simultaneously and classifies each token type as it goes through the text (that runs at C speed). Plays nicely with unicode
  • normalization of diacritical marks (é => e, etc.) which can be turned off as you prefer
  • Multi-word numeric expression and ordinal normalization ("One twenty-first" => 121st)
  • An efficient trie data structure which allows us to look up millions of distinct multi-word phrases e.g. from GeoNames in a single pass through the text while using on the order of 100M of memory or so. This allows us to reason about address components found in text/webpages and helps with picking apart combined fields in certain data sets.
  • A (soon-to-be-implemented) slight variant on this trie allows us to expand German street suffixes which can either be written as a separate word or tacked onto the street name (Nord Mauerstraße => Nord Mauer straße or Nord Mauer strasse).

There's a new version that hasn't been pushed to Github yet which is written in (modern) C, default bindings in Python and easy to bind to other languages like Node, etc. The included gazetteers are just text files and can be easily edited collaboratively where it makes sense or you can BYO expansions or mix and match. I've also written some code to download the latest GeoNames and add all the alternate=>canonical names to the trie which helps with expanding city aliases e.g. NYC => New York City, ATL => Atlanta. They also have a lot of cross-lingual toponyms e.g. Milan/Milano which is very useful from an l10n/i18n perspective.

The currently deployed (mostly Python) code is running in production and is deduping millions of addresses across several data sources including OpenAddresses and OSM.

Feel free to review, use, fork or contribute to our repo (I'm renaming it libpostal for the new release). I'm hoping it can be shared by multiple organizations and combine best practices in address normalization, extraction and parsing. Let me know if that sounds useful/interesting to you guys.

@migurski
Copy link
Member

migurski commented Jan 9, 2015

@thatdatabaseguy how would you characterize the stability and ease of installation of address_normalizer, on Mac 10.9 and 10.10, Python 2.7 and 3.3+, and Linux? What about the API? It looks like a great candidate for what we’re describing here.

@albarrentine
Copy link
Contributor

@migurski great! Installation should be relatively painless (pip install https://github.com/openvenues/address_normalizer ought to do it, will get in PyPI). You'll need a C compiler when you install it, clang/gcc. C compilation is currently being handled by setup.py. There's a Makefile in the new version but for Python installations pip install will still work. There are no external C packages that have to be separately installed through a package manager, which is usually where complications arise. I'm building on Mac 10.9 and Ubuntu 14.04 currently. Python 2.7 definitely works, Python 3 I haven't tested yet but will definitely support both.

As far as API, I would wait to check out the upcoming release. There are some changes on the Python side since some of the code has moved to C, but the API is fairly small as it stands and will remain as such. We'll send a heads up when that's ready and the API should be pretty stable after that.

@migurski
Copy link
Member

migurski commented Jan 9, 2015

Sounds worth looking at. I'm not entirely sure how to use it; can you help? I’ve opened a ticket on the original project. Also, I don’t see any tests or anything, curious if we can somehow help with that?

@migurski
Copy link
Member

migurski commented Jan 9, 2015

Also, @ingalls I like your plan. Seems like address_normalizer is in theory compatible, as an implementation of expansion.

@ingalls
Copy link
Member

ingalls commented Jan 9, 2015

@migurski Sounds great! I'm completely for using address_normalizer if the install is painless and it fulfils our needs, no reason to reinvent the wheel.

@NelsonMinar
Copy link
Contributor Author

I'm going on vacation in a few days and would like to reach some resolution. I think the consensus is to keep doing expansions of street names in the output product, and treat the expansion code as a first class product so it can be independently improved.

@NelsonMinar
Copy link
Contributor Author

(Ugh, hit ctrl-Enter in GitHub early.)

I still volunteer to spin my Python code out into a separate library. I'll do this in a couple of days if I don't hear otherwise.

I like the idea of using address_normalizer instead as our expander, it's way more sophisticated than anything we're likely to come up with soon. It feels like the code isn't quite ready though; there's an upcoming release promised. Also I'm uncomfortable about the lack of any tests in the GitHub project. So I suggest we hold off a bit on switching to it, watch and wait for the project to mature. I'm glad to be wrong about that though. FWIW I got it working in a quick test and it seems pretty solid, at least for simple US English examples. The big thing we'd have to consider is it seems aimed at returning a large list of possible expansions (for matching alternatives) whereas we'd need to pick one single best expansion to publish.

@NelsonMinar NelsonMinar self-assigned this Jan 17, 2015
@albarrentine
Copy link
Contributor

@NelsonMinar, most of the unit tests are in the new release, but pushed some basic API tests to the Python version in the meantime including most of what you guys had in expand.py plus some deduping test cases: https://github.com/openvenues/address_normalizer/blob/master/test.py, happy to add more.

Displaying a single best expansion is a goal on our side as well. So there's a separate function for both use cases. I've renamed what was previously "normalize_street_address" to "expand_street_address" and have the normalize_street_address call return a single result rather than a set. The normalize calls currently just return one element from the set - real implementation will be in the upcoming release. I've derived a training set from OSM way names and addr tags, many of which already use expanded forms instead of abbreviations, to estimate probabilities of a particular expansion given the surface form's position, its neighboring tokens and which dictionaries they match. Some expansions can be ambiguous across languages e.g. you would want a phrase-initial "Dr." to normalize to "Doktor" in German while "Doctor" might be more statistically likely because there happen to be more addresses in English in your data set, so a country/language/locale hint may be advisable in a few cases for added precision. The vast majority of the time though simple maximum likelihood will do the trick.

@migurski
Copy link
Member

@skorasaurus noticed that "CR" is not correctly interpreted as County Road in #1084; we may want to revisit this thread soonish.

@migurski
Copy link
Member

migurski commented Sep 1, 2015

With the new regexp behavior, we might be able to move street name rewriting into the sources themselves.

@NelsonMinar
Copy link
Contributor Author

I wouldn't want to implement the street name rewriting algorithms as regex! But the new function syntax does give us a way to specify processing on a per-source basis. We could define new functions for common rewriting and add them to the source specs. That would give us localization at least.

@migurski
Copy link
Member

migurski commented Sep 1, 2015

Yeah. Start with an America setting and go from there.

@migurski
Copy link
Member

Seeing another example of a bad expansion, with “Dr Carlton B Goodlett Pl” expanded to “Drive Carlton B Goodlett Place” in San Francisco.

@riordan
Copy link

riordan commented Jan 20, 2016

Might be worth taking a look at @thatdatabaseguy's successor to openvenues/address-normalizer, openvenues/libpostal. It has expansion models trained for a large number of languages.

@albarrentine
Copy link
Contributor

@riordan I think for this particular problem, as @NelsonMinar has pointed out before, what's needed is a "single best" expansion, which is beyond libpostal's immediate capabilities (disambiguation is next on the roadmap though). We return a set of possible expansions, some of which may be grammatically nonsensical, but at least one will be correct. For search indexing or deduping that's fine, but for display it's problematic.

What if OA did the inverse of what it's doing now and abbreviated everything instead? That can be done deterministically and with no errors, while still handling duplicates. Thoughts?

@NelsonMinar
Copy link
Contributor Author

I still think we ought to do nothing, just pass on the authoritative source data as-is. In some other discussion we were close to a consensus on providing two files; one unedited, and one cleaned up. Maybe in the cleaned up one we make a best effort to rewrite things.

@iandees
Copy link
Member

iandees commented Jan 20, 2016

Yep, I'm partial to @NelsonMinar's idea here. That implies that someone has to build out a cleanup pipeline though 😄.

@migurski
Copy link
Member

Also agree with @NelsonMinar and @iandees. I think this ticket stands: we’ll not rewrite street names.

@riordan
Copy link

riordan commented Jan 21, 2016

Time to begin the great cleanup.

@NelsonMinar
Copy link
Contributor Author

Wait, it's over a year later and my opinion prevails? I win! :-) Seriously, I'm impressed with how our thinking has evolved on source data. And the value of cleaning data in a more comprehensive step. Thanks in particular to having users.

As my first post says, I think removing our current expansion is as simple as removing all calls to expand.py. I'm sure tests will need updating too. The other project is to make a cleanup script and change the packaging to produce two data files. Not sure when that happens.

@migurski
Copy link
Member

First we ignored you
then we laughed at you
then we fought you
then you won.

@migurski
Copy link
Member

So, this is the big change in expectations: 0eaf1b1

I would love input from @trescube, @feomike, @riordan, @kgudel, @dianashk, @sbma44, and anyone else who might be affected by this major modification to the output files.

@feomike
Copy link

feomike commented Jan 21, 2016

thanks @migurski - assessing ....

@sbma44
Copy link
Contributor

sbma44 commented Jan 21, 2016

I think this is the right direction to move in. I'm unsure of the effect it will have on our systems, but our pipeline doesn't auto-update anyway -- we'll keep this in mind for the next time we pull a fresh OA CSV.

@migurski
Copy link
Member

👍

One of the ideas we talked about was doing post-processing in the batch download files: de-duping, scrubbing, and perhaps the formatting and name expansion that’s been removed here.

@kgudel
Copy link

kgudel commented Jan 22, 2016

We at the CFPB like this transition towards not altering street name, as it brings the data closer to the authoritative source while not reducing its effectiveness for our use case.

@riordan
Copy link

riordan commented Jan 22, 2016

Echoing @kgudel's point (but on behalf of the Pelias team). It's great (and critical) that OA be a place to get "Authoritative Data".

But it's also then a burden that we pass onto any other consumers of the data, particularly those who don't have the knowledge or resources to do their own analysis or processing to make more sense of it. Plus the OpenAddresses project, by being a collective effort already, is in a great place to pull together knowledge of how to interpret different address formats from the downstream users, and it would be a shame for that to go to waste.

I'm a fan of @NelsonMinar's proposal (#283) for there to be the "Authoritative" files (out.csv) and then for there to be an additional enhance step, with enhanced versions that pull together the OA interpretation from its downstream consumers (where it makes sense to do so). That way we can all use the authoritative data, but pool what we know about making sense of it.

@migurski
Copy link
Member

We’ve talked about applying the expansions (and other post-processes) to only the collected files, available in big zip files at http://results.openaddresses.io. Would the availability of untouched data at the individual source level still work for CFPB, @kgudel?

E.g. this would be post processed: http://data.openaddresses.io/openaddr-collected-us_northeast.zip
But this would not: http://data.openaddresses.io/runs/20063/us/ny/statewide.zip

@feomike
Copy link

feomike commented Jan 22, 2016

@migurski yup, that should be ok w/ us cc @kgudel thanks for this discussion, and more importantly thanks for reaching out to ask. it matters a bunch.

@migurski
Copy link
Member

This pull is basically ready: #281.

Stops rewriting street names during batch set and CI jobs, and moves name expansion to periodic collections instead. End result for users of large downloads will be identical, but provides a first step toward CSV enhancer idea.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests