-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
🔨 Clustering: bux fixes et amélioration logs Airflow (#1223)
* clustering: bux fixes et amélioration logs Airflow * clustering: TODO pour une prochaine PR * clustering: amélioration fonction + tests pour l'affichage df * remplacement des print par logger.info
- Loading branch information
1 parent
ad883db
commit 9a6f805
Showing
9 changed files
with
209 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
57 changes: 57 additions & 0 deletions
57
dags/cluster/tasks/business_logic/cluster_acteurs_df_sort.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import pandas as pd | ||
|
||
|
||
def cluster_acteurs_df_sort( | ||
df: pd.DataFrame, | ||
cluster_fields_exact: list[str] = [], | ||
cluster_fields_fuzzy: list[str] = [], | ||
) -> pd.DataFrame: | ||
"""Fonction de tri d'une dataframe acteurs | ||
pour favoriser la visualisation des clusters. | ||
La fonction peut fonctionner sur n'importe quel | ||
état d'une dataframe acteurs (sélection, normalisation, | ||
clusterisation). | ||
De fait elle est utilisée tout au long du DAG | ||
airflow de clustering pour mieux visualiser la | ||
construction des clusters au fur et à mesure des tâches. | ||
Args: | ||
df (pd.DataFrame): DataFrame acteurs | ||
cluster_fields_exact (list[str], optional): Liste des champs exacts | ||
pour le clustering. Defaults to []. | ||
cluster_fields_fuzzy (list[str], optional): Liste des champs flous | ||
pour le clustering. Defaults to []. | ||
Returns: | ||
pd.DataFrame: DataFrame acteurs triée | ||
""" | ||
|
||
# On construit une liste de champs de tri | ||
# avec des champs par défauts (ex: cluster_id) | ||
# et des champs spécifiés dans la config du DAG | ||
sort_ideal = ["cluster_id"] # la base du clustering | ||
|
||
# pour déceler des erreurs de clustering rapidement (ex: intra-source) | ||
# mais on le met pas pour les étapes de sélection et normalisation | ||
# car cela casse notre ordre (on a pas de cluster_id et donc | ||
# on préfère par sémantique business que des codes) | ||
if cluster_fields_exact or cluster_fields_fuzzy: | ||
sort_ideal += ["source_code", "acteur_type_code"] | ||
sort_ideal += [x for x in cluster_fields_exact if x not in sort_ideal] | ||
sort_ideal += [x for x in cluster_fields_fuzzy if x not in sort_ideal] | ||
# défaut quand on n'a pas de champs de clustering (étape de sélection) | ||
sort_ideal += [ | ||
x for x in ["code_postal", "ville", "adresse", "nom"] if x not in sort_ideal | ||
] | ||
|
||
# Puis on construit la liste actuelle des champs de tri | ||
# vs. la réalité des champs présents dans la dataframe | ||
# en prenant "au mieux" dans l'ordre idéale et en rajoutant | ||
# ce qui reste de la df | ||
sort_actual = [x for x in sort_ideal if x in df.columns] | ||
sort_actual += [ | ||
x for x in df.columns if x not in cluster_fields_exact and x not in sort_actual | ||
] | ||
return df.sort_values(by=sort_actual, ascending=True)[sort_actual] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.