Consider not rewriting street names #32

NelsonMinar · 2015-01-07T04:00:07Z

The current Node and Python code rewrites street names, turning things like "W ST SEBASTIAN ST" in the source data to "West Saint Sebastian Street" in the output. The problem is the transform can only degrade information from the source. We're not using any source-specific information to add value, it's all just a guess. The current code is definitely US English only.

I think this transformation is a bad feature for the product. The Python code I'm implementing in ditch-node does it for now to make it as compatible as possible as the Node code. If we decide to remove this feature from the processing, in the Python code it's as easy as removing expand.py and any calls to it.

migurski · 2015-01-07T06:00:11Z

Curious for @sbma44 and @ingalls take on this. I personally agree with you Nelson, especially as we expand to worldwide addresses and the abbreviations may not mean what we think they mean.

sbma44 · 2015-01-07T15:22:16Z

I'd better defer to @ingalls. You're right that this is a lossy operation, but there's something to be said for having this data as close to ready to go as possible, and to avoid making users re-deduce these cultural variations over and over.

Possible solutions:

emit additional columns containing original, unexpanded data
accept a flag that turns off expansion
allow expansion rules to be overridden per-source

I don't think 2 buys us much. I think 3 will wind up being pretty helpful now that we've got coverage outside the anglophone world -- alas, there's logic in the expansion function, so this will be more complicated than just specifying a different expand.json. Besides, a given set of expansions is likely to be useful across a geographic or language area broader than a single source, making the conform object the wrong place for this. 1 would be fine but could break some configurations and might be a pain to implement, if the Python rewrite is sticking to the pattern of the node version of conform.

NelsonMinar · 2015-01-07T15:37:11Z

Another solution would be for us to package the expansion code as a utility library for use by OpenAddress users. Then they could download the original unaltered street names and use our nice code to expand it locally in their application.

missinglink · 2015-01-07T16:46:51Z

I like the idea @NelsonMinar proposed. Address expansion is a common problem and having a separate module which we could all collaborate on would be generally useful, @mapzen will contribute to this.

migurski · 2015-01-07T18:38:21Z

I like the idea of separate expansion code as well. Expertise from someone who knows addressing (initially in the U.S.) like @geomantic could be of use here. I assume there are similar nitpicky details elsewhere, and the advanced_merge approach felt like a bit of kludge on top of what should really be a per-country approach keyed on ISO code.

migurski · 2015-01-07T18:45:36Z

(related fun fact: a few weeks ago, Uber released a version of its app for drivers that had the speaking voice incorrectly set to British English, and pronounced every “St.” as “Saint”, as in “turn left on 9th Saint”)

sbma44 · 2015-01-07T18:49:47Z

@migurski I'll stick up for advanced_merge -- I think it's solving a different problem. Specifically, some Asian addressing systems delimit address components with hyphens, and some source files require that more than two fields be joined.

I think we're going to need some additional flexibility, but I wouldn't suggest bucketing by ISO code. There's going to be a ton of overlap in a given region, for one thing. For another, if we start writing a lot of per-country code we could find ourselves facing all the problems of a big scraper project

migurski · 2015-01-07T19:01:42Z

I guess all I mean is that it’s not so much advanced as different, and as we encounter other non-USian approaches toward addressing structure, it will make more sense to handle them as such. The code seems great.

NelsonMinar · 2015-01-07T19:02:18Z

Yeah advanced_merge is different from this issue and necessary. It's the only way to merge the number portion of an address, and I bet there's at least one US source where it'll be the only way to get an address like "122½" out of "NUMBER" and "FRACTION" columns. I have some broader questions about how to make the conform process better for international sources, there's definitely room for some per-region configuration. But one thing at a time!

If the consensus holds to stop doing the address expansion in the main OpenAddress output product, I volunteer to spin off a new project with my Python code to make it easy for people to do the expansion locally. Library function and a command line tool, see also Mike's StreetNames package. Would be nice to bring the Javascript code in too, would be great for both Node and in-browser use. We could even share test cases! I love the idea that other folks might contribute; having a small separate library would make it easier to bring in expertise from people who know more about address localization.

iandees · 2015-01-07T19:09:49Z

👍 on a library for name expansion.

It seems like that should be part of a general "address normalization" library, but that's a much bigger bite to take. Let's start with name expansion!

migurski · 2015-01-07T19:14:42Z

I could extract StreetNames out of nvkelso/map-label-style-manual/tools/street_names into a separate package here under OA. It’s code code and tests for abbreviating U.S. names; could grow code and tests for the opposite that we’re talking about here.

sbma44 · 2015-01-07T23:21:16Z

It sounds like we're zeroing in on an approach, but I suggest waiting a few days for @ingalls (who's still on vacation) to get back, dig out from email & weigh in.

migurski · 2015-01-07T23:36:04Z

:concur:

randymeech · 2015-01-08T18:03:21Z

@thatdatabaseguy has done some work for Mapzen over here: https://github.com/openvenues/address_normalizer

Seems similar, maybe he can take a look here & weigh in?

(This is a project that's pulling address & POI info from the web via the common crawl, which we will publish very soon. Probably relevant to openaddresses too).

migurski · 2015-01-08T18:09:22Z

Huh, that sounds very much like the thing.

ingalls · 2015-01-08T18:32:33Z

@migurski @NelsonMinar I'm certainly open to new ways of looking at abbreviated/expanded approach but my preference would be expanded.

Municipal governents don't always follow the standard USPS abbreviations. court is usually ct but I've also seen it as co.
The classic ambiguity between Drive v Doctor and Street v Saint
Easier to abbrivate back into preferred abreviation scheme (See point 1)

The conform process is aimed at taking the weight off developers shoulders and like format I think having standardized expansions will make peoples lives even easier.

I do like the idea of splitting it out into a library. I think we also need another option in conform that is something like expansion: usps or expansion: 'fake_county' to allow us to have multiple overlapping expansion types instead of just one giant list with possible conflicts.

If the expansion flag was not set than expand.py would be skipped.

What do you all think?

NelsonMinar · 2015-01-08T20:57:44Z

Welcome back @ingalls! I'm out of my expertise when it comes to address localization, glad to have all the comments from people who actually put this data to use.

My main complaint is the code we currently have is not good enough to apply to all our data. I cringe at the idea that we might be formatting "Nord Mauerstraße" in Berlin as "North Mauer Street". (Made up example.) I feel strongly we shouldn't degrade data. So my first thought was just to stop trying to do any expansions.

But there's so much interest in doing good expansions I'm newly encouraged! We can steer OpenAddresses towards doing expansions where we know we can do a good job and then improve our expansion capability. I imagine there has to be prior art from national post offices, etc we can use.

Specific proposal:

Create a new GitHub project for street name expansion. Seed it with the Javascript code in csv.js and the Python code in expand.py. Make them separately installable libraries with NPM/PyPi. Share a suite of test data so we ensure the two libraries behave consistently.
Remove the expansion code itself from openaddresses-conform and openaddresses-machine, include our libraries instead.
Add localization to the expansion library. My first thought was to do it by region, but @ingalls proposal of a conform option is also appealing.

I think the near-term outcome is to make the conversion code expand the output only for the US locale, where our current code works well. Then we can start expanding the code to other locales incrementally.

ingalls · 2015-01-08T21:20:24Z

@NelsonMinar This sounds great. One thing is should we bother maintaining two expansion libraries? I'm a huge nodejs fan but I think in the interest of being able to iterate quickly we should probably stick with one language. I think writing this in python to match machine is probably going to be best.

Plan:

Create country specific expansion files in new repo
Alter openaddresses/test.js to allow expand flag
Add expand flag to applicable sources
Get expand.py extracted => library
Switch machine to use new expand library

Note: The reason I think going by and expand flag over a region is that we have overlapping data sources (ie one source for all of MA and then municipalities in MA) these could potentially use different expansion tokens.

sbma44 · 2015-01-08T21:23:35Z

👍 I like the idea of the expand flag.

ingalls · 2015-01-08T21:48:23Z

See: openaddresses/expand for some starter expansion files that we can being to work off. I propose that the name of the expand tag match 1:1 with the files in the /maps directory

albarrentine · 2015-01-08T23:15:58Z

Hey guys - as @randymeech mentioned I have created a library https://github.com/openvenues/address_normalizer which handles most or all of these cases without losing information.

At the moment we produce all the combinations (Dr expands to either Drive or Doctor, potentially to Dominican Republic as well depending on which gazetteers are being used) and give you back a few variants on the string without trying to resolve the ambiguity. This is suitable for search indexing and deduping, but not necessarily for displaying addresses and formatting. One option is to keep the original text as the display address but index the normalized form(s) in search/dbs. That's currently what we do in deduping (first surface form into the index wins). I've also been thinking about a statistical NLP approach which would label one candidate expansion as more likely than the others using a sequence model, just need to construct a small training set of ambiguous strings and their correct normalized forms.

Our library also does a lot more than just synonym expansion. It includes:

a lexer-based tokenizer which looks for multiple patterns simultaneously and classifies each token type as it goes through the text (that runs at C speed). Plays nicely with unicode
normalization of diacritical marks (é => e, etc.) which can be turned off as you prefer
Multi-word numeric expression and ordinal normalization ("One twenty-first" => 121st)
An efficient trie data structure which allows us to look up millions of distinct multi-word phrases e.g. from GeoNames in a single pass through the text while using on the order of 100M of memory or so. This allows us to reason about address components found in text/webpages and helps with picking apart combined fields in certain data sets.
A (soon-to-be-implemented) slight variant on this trie allows us to expand German street suffixes which can either be written as a separate word or tacked onto the street name (Nord Mauerstraße => Nord Mauer straße or Nord Mauer strasse).

There's a new version that hasn't been pushed to Github yet which is written in (modern) C, default bindings in Python and easy to bind to other languages like Node, etc. The included gazetteers are just text files and can be easily edited collaboratively where it makes sense or you can BYO expansions or mix and match. I've also written some code to download the latest GeoNames and add all the alternate=>canonical names to the trie which helps with expanding city aliases e.g. NYC => New York City, ATL => Atlanta. They also have a lot of cross-lingual toponyms e.g. Milan/Milano which is very useful from an l10n/i18n perspective.

The currently deployed (mostly Python) code is running in production and is deduping millions of addresses across several data sources including OpenAddresses and OSM.

Feel free to review, use, fork or contribute to our repo (I'm renaming it libpostal for the new release). I'm hoping it can be shared by multiple organizations and combine best practices in address normalization, extraction and parsing. Let me know if that sounds useful/interesting to you guys.

migurski · 2015-01-09T01:25:38Z

@thatdatabaseguy how would you characterize the stability and ease of installation of address_normalizer, on Mac 10.9 and 10.10, Python 2.7 and 3.3+, and Linux? What about the API? It looks like a great candidate for what we’re describing here.

albarrentine · 2015-01-09T03:39:28Z

@migurski great! Installation should be relatively painless (pip install https://github.com/openvenues/address_normalizer ought to do it, will get in PyPI). You'll need a C compiler when you install it, clang/gcc. C compilation is currently being handled by setup.py. There's a Makefile in the new version but for Python installations pip install will still work. There are no external C packages that have to be separately installed through a package manager, which is usually where complications arise. I'm building on Mac 10.9 and Ubuntu 14.04 currently. Python 2.7 definitely works, Python 3 I haven't tested yet but will definitely support both.

As far as API, I would wait to check out the upcoming release. There are some changes on the Python side since some of the code has moved to C, but the API is fairly small as it stands and will remain as such. We'll send a heads up when that's ready and the API should be pretty stable after that.

migurski · 2015-01-09T04:59:47Z

Sounds worth looking at. I'm not entirely sure how to use it; can you help? I’ve opened a ticket on the original project. Also, I don’t see any tests or anything, curious if we can somehow help with that?

migurski · 2015-01-09T05:04:47Z

Also, @ingalls I like your plan. Seems like address_normalizer is in theory compatible, as an implementation of expansion.

ingalls · 2015-01-09T13:27:46Z

@migurski Sounds great! I'm completely for using address_normalizer if the install is painless and it fulfils our needs, no reason to reinvent the wheel.

NelsonMinar · 2015-01-17T01:22:41Z

I'm going on vacation in a few days and would like to reach some resolution. I think the consensus is to keep doing expansions of street names in the output product, and treat the expansion code as a first class product so it can be independently improved.

NelsonMinar · 2015-01-17T01:25:47Z

(Ugh, hit ctrl-Enter in GitHub early.)

I still volunteer to spin my Python code out into a separate library. I'll do this in a couple of days if I don't hear otherwise.

I like the idea of using address_normalizer instead as our expander, it's way more sophisticated than anything we're likely to come up with soon. It feels like the code isn't quite ready though; there's an upcoming release promised. Also I'm uncomfortable about the lack of any tests in the GitHub project. So I suggest we hold off a bit on switching to it, watch and wait for the project to mature. I'm glad to be wrong about that though. FWIW I got it working in a quick test and it seems pretty solid, at least for simple US English examples. The big thing we'd have to consider is it seems aimed at returning a large list of possible expansions (for matching alternatives) whereas we'd need to pick one single best expansion to publish.

albarrentine · 2015-01-17T21:48:49Z

@NelsonMinar, most of the unit tests are in the new release, but pushed some basic API tests to the Python version in the meantime including most of what you guys had in expand.py plus some deduping test cases: https://github.com/openvenues/address_normalizer/blob/master/test.py, happy to add more.

Displaying a single best expansion is a goal on our side as well. So there's a separate function for both use cases. I've renamed what was previously "normalize_street_address" to "expand_street_address" and have the normalize_street_address call return a single result rather than a set. The normalize calls currently just return one element from the set - real implementation will be in the upcoming release. I've derived a training set from OSM way names and addr tags, many of which already use expanded forms instead of abbreviations, to estimate probabilities of a particular expansion given the surface form's position, its neighboring tokens and which dictionaries they match. Some expansions can be ambiguous across languages e.g. you would want a phrase-initial "Dr." to normalize to "Doktor" in German while "Doctor" might be more statistically likely because there happen to be more addresses in English in your data set, so a country/language/locale hint may be advisable in a few cases for added precision. The vast majority of the time though simple maximum likelihood will do the trick.

migurski · 2015-07-25T16:36:45Z

@skorasaurus noticed that "CR" is not correctly interpreted as County Road in #1084; we may want to revisit this thread soonish.

migurski · 2015-09-01T16:09:33Z

With the new regexp behavior, we might be able to move street name rewriting into the sources themselves.

NelsonMinar · 2015-09-01T16:16:26Z

I wouldn't want to implement the street name rewriting algorithms as regex! But the new function syntax does give us a way to specify processing on a per-source basis. We could define new functions for common rewriting and add them to the source specs. That would give us localization at least.

migurski · 2015-09-01T17:18:54Z

Yeah. Start with an America setting and go from there.

migurski · 2015-11-15T19:19:26Z

Seeing another example of a bad expansion, with “Dr Carlton B Goodlett Pl” expanded to “Drive Carlton B Goodlett Place” in San Francisco.

riordan · 2016-01-20T17:12:08Z

Might be worth taking a look at @thatdatabaseguy's successor to openvenues/address-normalizer, openvenues/libpostal. It has expansion models trained for a large number of languages.

albarrentine · 2016-01-20T22:55:25Z

@riordan I think for this particular problem, as @NelsonMinar has pointed out before, what's needed is a "single best" expansion, which is beyond libpostal's immediate capabilities (disambiguation is next on the roadmap though). We return a set of possible expansions, some of which may be grammatically nonsensical, but at least one will be correct. For search indexing or deduping that's fine, but for display it's problematic.

What if OA did the inverse of what it's doing now and abbreviated everything instead? That can be done deterministically and with no errors, while still handling duplicates. Thoughts?

NelsonMinar · 2016-01-20T22:57:39Z

I still think we ought to do nothing, just pass on the authoritative source data as-is. In some other discussion we were close to a consensus on providing two files; one unedited, and one cleaned up. Maybe in the cleaned up one we make a best effort to rewrite things.

iandees · 2016-01-20T22:59:01Z

Yep, I'm partial to @NelsonMinar's idea here. That implies that someone has to build out a cleanup pipeline though 😄.

migurski · 2016-01-21T00:29:28Z

Also agree with @NelsonMinar and @iandees. I think this ticket stands: we’ll not rewrite street names.

riordan · 2016-01-21T00:37:13Z

Time to begin the great cleanup.

NelsonMinar · 2016-01-21T00:44:12Z

Wait, it's over a year later and my opinion prevails? I win! :-) Seriously, I'm impressed with how our thinking has evolved on source data. And the value of cleaning data in a more comprehensive step. Thanks in particular to having users.

As my first post says, I think removing our current expansion is as simple as removing all calls to expand.py. I'm sure tests will need updating too. The other project is to make a cleanup script and change the packaging to produce two data files. Not sure when that happens.

migurski · 2016-01-21T00:50:49Z

First we ignored you
then we laughed at you
then we fought you
then you won.

migurski · 2016-01-21T04:16:24Z

So, this is the big change in expectations: 0eaf1b1

I would love input from @trescube, @feomike, @riordan, @kgudel, @dianashk, @sbma44, and anyone else who might be affected by this major modification to the output files.

feomike · 2016-01-21T12:36:33Z

thanks @migurski - assessing ....

sbma44 · 2016-01-21T16:04:47Z

I think this is the right direction to move in. I'm unsure of the effect it will have on our systems, but our pipeline doesn't auto-update anyway -- we'll keep this in mind for the next time we pull a fresh OA CSV.

migurski · 2016-01-21T16:44:38Z

👍

One of the ideas we talked about was doing post-processing in the batch download files: de-duping, scrubbing, and perhaps the formatting and name expansion that’s been removed here.

kgudel · 2016-01-22T18:00:07Z

We at the CFPB like this transition towards not altering street name, as it brings the data closer to the authoritative source while not reducing its effectiveness for our use case.

riordan · 2016-01-22T19:24:18Z

Echoing @kgudel's point (but on behalf of the Pelias team). It's great (and critical) that OA be a place to get "Authoritative Data".

But it's also then a burden that we pass onto any other consumers of the data, particularly those who don't have the knowledge or resources to do their own analysis or processing to make more sense of it. Plus the OpenAddresses project, by being a collective effort already, is in a great place to pull together knowledge of how to interpret different address formats from the downstream users, and it would be a shame for that to go to waste.

I'm a fan of @NelsonMinar's proposal (#283) for there to be the "Authoritative" files (out.csv) and then for there to be an additional enhance step, with enhanced versions that pull together the OA interpretation from its downstream consumers (where it makes sense to do so). That way we can all use the authoritative data, but pool what we know about making sense of it.

migurski · 2016-01-22T19:32:01Z

We’ve talked about applying the expansions (and other post-processes) to only the collected files, available in big zip files at http://results.openaddresses.io. Would the availability of untouched data at the individual source level still work for CFPB, @kgudel?

E.g. this would be post processed: http://data.openaddresses.io/openaddr-collected-us_northeast.zip
But this would not: http://data.openaddresses.io/runs/20063/us/ny/statewide.zip

feomike · 2016-01-22T21:47:10Z

@migurski yup, that should be ok w/ us cc @kgudel thanks for this discussion, and more importantly thanks for reaching out to ask. it matters a bunch.

migurski · 2016-01-23T22:54:26Z

This pull is basically ready: #281.

Stops rewriting street names during batch set and CI jobs, and moves name expansion to periodic collections instead. End result for users of large downloads will be identical, but provides a first step toward CSV enhancer idea.

…s-#32 Stop rewriting street names

NelsonMinar self-assigned this Jan 17, 2015

migurski mentioned this issue Jul 25, 2015

add coshocton county, ohio openaddresses/openaddresses#1084

Merged

migurski mentioned this issue Nov 14, 2015

Abbreviation expansion is expanding the streetname "La Avenida" into "Lane Avenida"? openaddresses/openaddresses#1456

Closed

riordan mentioned this issue Jan 20, 2016

OpenAddresses data source performs incorrect address expansions for non US/ENG addresses pelias/pelias#232

Closed

migurski mentioned this issue Jan 21, 2016

Stop rewriting street names #281

Merged

5 tasks

NelsonMinar mentioned this issue Jan 22, 2016

New feature: CSV enhancer, a new processing step #283

Open

migurski closed this as completed in #281 Jan 27, 2016

migurski referenced this issue Jan 27, 2016

Merge pull request #281 from openaddresses/stop-rewriting-street-name…

4c9c3b0

…s-#32 Stop rewriting street names

riordan mentioned this issue Jan 28, 2016

OpenAddresses importer should rely on individual files, not collected.zip pelias/pelias#247

Closed

Consider not rewriting street names #32

Consider not rewriting street names #32

Comments

NelsonMinar commented Jan 7, 2015

migurski commented Jan 7, 2015

sbma44 commented Jan 7, 2015

NelsonMinar commented Jan 7, 2015

missinglink commented Jan 7, 2015

migurski commented Jan 7, 2015

migurski commented Jan 7, 2015

sbma44 commented Jan 7, 2015

migurski commented Jan 7, 2015

NelsonMinar commented Jan 7, 2015

iandees commented Jan 7, 2015

migurski commented Jan 7, 2015

sbma44 commented Jan 7, 2015

migurski commented Jan 7, 2015

randymeech commented Jan 8, 2015

migurski commented Jan 8, 2015

ingalls commented Jan 8, 2015

NelsonMinar commented Jan 8, 2015

ingalls commented Jan 8, 2015

sbma44 commented Jan 8, 2015

ingalls commented Jan 8, 2015

albarrentine commented Jan 8, 2015

migurski commented Jan 9, 2015

albarrentine commented Jan 9, 2015

migurski commented Jan 9, 2015

migurski commented Jan 9, 2015

ingalls commented Jan 9, 2015

NelsonMinar commented Jan 17, 2015

NelsonMinar commented Jan 17, 2015

albarrentine commented Jan 17, 2015

migurski commented Jul 25, 2015

migurski commented Sep 1, 2015

NelsonMinar commented Sep 1, 2015

migurski commented Sep 1, 2015

migurski commented Nov 15, 2015

riordan commented Jan 20, 2016

albarrentine commented Jan 20, 2016

NelsonMinar commented Jan 20, 2016

iandees commented Jan 20, 2016

migurski commented Jan 21, 2016

riordan commented Jan 21, 2016

NelsonMinar commented Jan 21, 2016

migurski commented Jan 21, 2016

migurski commented Jan 21, 2016

feomike commented Jan 21, 2016

sbma44 commented Jan 21, 2016

migurski commented Jan 21, 2016

kgudel commented Jan 22, 2016

riordan commented Jan 22, 2016

migurski commented Jan 22, 2016

feomike commented Jan 22, 2016

migurski commented Jan 23, 2016