This repository has been archived by the owner on Sep 2, 2024. It is now read-only.
Tags: ONSdigital/address-index-data
Tags
Feature/extracting sao information (#54) * Deriving the parser post-processing steps to capture SAO number and suffix information. * Small improvements - allowed additional chars in the identification of flat range and building number, and changed the criteria. * Added a new clause and improved those with a single component not to include ranges. * Fixed the case where trying to find “65A” and also covering accidentally both from “55A-65A”. Using negative lookahead and behind to rule out the latter options. * Simplified the parsing and pushed the modifications to the post-processing that operates on data frames rather than inside the loop. * Created a new class for the address parser that holds all the pre- and post-processing steps. * A new class that can be used to do normalisation, probabilistic parsing, and post-processing. * AddressParser class is now complete and linking tests pass. * Simple script to use the sao suffix data as a test set. * Modified the local hybrid index creation so that it uses out-of-core computation (dask). * A few modifications to the address parser to reflect was done inside the address linker. * Updated he address linker to use the address parser class. * Moved the creation of the final hybrid index from address linking to the data file. Added a function to create a test index for testing the linking code. * a quick fix as the test_index_uprns is a numpy array rather than pandas data frame. * Documentation changes to the address linking. * Small bug fix to the parsing post-processing logic. Changed the address linking UPRN type to float64 to support comparisons and missing values. * Some improvements to the matching logic. * Added an extraction step to the parser post-processing which identifies numbers from street names. This happens for messy inputs where the street hasn’t been entered correctly. * Added a new blocking rule to the matching engine.