The list of family medicine offices in Bucharest with approximate coordinates
The output.json
file contains information about the family medicine doctors in bucharest, together with geolocation information. This file will contain the most recent list of family medicine doctors.
It will have the following structure:
data = [
{
"title": str, # str
"description": [str], # list of str
"latitude": float, # float
"longitude": float # float
},
...
]
We will use simple versioning for the code and also the output files. The releases will be tagged first with v1
, v2
, v3
etc. Before we start working on a new version for the parser, we will save the output in the ./.archive/
folder, in a newly created corresponding version subfolder.
This is a YOLO structure which has the purpose to maintain older versions in the git repository. The files are pretty small, so the cost is not great from that point of view. And it seems that it's worth paying to be sure we will always have the data available.
.cache
contains cache from previous runs. If you specify the--cache
param when you run the script you will use the data in the cache if available, but also update it at the end of the run;.archive
contains a history of results after running the parser. In a folder calledv1
,v2
, etc. we will store the source file and the outputs generated by running the parser. We will not keep the files of the parser, but each version folder will correspond to a tagged release of the script;- we keep the current source and outputs at the root of the project.
.cache/
|-- addresses_cache.json
|-- coordinates_cache.json
.archive/
|-- v1/
| |-- 20230721_Lista cabinete medicina de familie_20.07.2023
| |-- input.xlsx
| |-- output.json
|-- v2/
| |-- ...
20240401_Lista cabinete medicina de familie_01.04.2024
index.html
input.xlsx
output.json
geocode_medical_addresses.py
...
The source list is not consistent, nor in a proper format. This is why we will start with separate parsers which can later be merged if needed. It's also the reason why we store the source in this repo.
These are some examples of how to run the script:
python geocode_medical_addresses.py
python ./geocode_medical_addresses.py --addresses --geocodes --excel --json --cache
python ./geocode_medical_addresses.py --addresses --geocodes --excel --json --cache --dev
Main source:
Here are some ideas about how to handle the newly downloaded files:
- we keep the filename as close to the source as possible;
- before starting, remember to create a release for the parser and also save the current output in the
./.archive
folder; - make a minimal cleanup in the file (remove the formatting, remove the headers form the file), using a previous source file as a model.
We use OSM and Nominatim to get the coordinates for the address. In case an address is not found automatically, we can go to the Nominatim website, search for the address for which the error was encountered, manually find something close, then update the manual_address
column in the excel (./input.xlsx
).