Skip to content
This repository has been archived by the owner on Sep 2, 2024. It is now read-only.

Tags: ONSdigital/address-index-data

Tags

beta-v0.3

Adds synonyms to the mapping (#63)

beta-v0.1

Feature/extracting sao information (#54)

* Deriving the parser post-processing steps to capture SAO number and suffix information.

* Small improvements - allowed additional chars in the identification of flat range and building number, and changed the criteria.

* Added a new clause and improved those with a single component not to include ranges.

* Fixed the case where trying to find “65A” and also covering accidentally both from “55A-65A”. Using negative lookahead and behind to rule out the latter options.

* Simplified the parsing and pushed the modifications to the post-processing that operates on data frames rather than inside the loop.

* Created a new class for the address parser that holds all the pre- and post-processing steps.

* A new class that can be used to do normalisation, probabilistic parsing, and post-processing.

* AddressParser class is now complete and linking tests pass.

* Simple script to use the sao suffix data as a test set.

* Modified the local hybrid index creation so that it uses out-of-core computation (dask).

* A few modifications to the address parser to reflect was done inside the address linker.

* Updated he address linker to use the address parser class.

* Moved the creation of the final hybrid index from address linking to the data file. Added a function to create a test index for testing the linking code.

* a quick fix as the test_index_uprns is a numpy array rather than pandas data frame.

* Documentation changes to the address linking.

* Small bug fix to the parsing post-processing logic. Changed the address linking UPRN type to float64 to support comparisons and missing values.

* Some improvements to the matching logic.

* Added an extraction step to the parser post-processing which identifies numbers from street names. This happens for messy inputs where the street hasn’t been entered correctly.

* Added a new blocking rule to the matching engine.