A Data-Driven Approach to Waterpoint Functionality and Accessibility

Business understanding

Objective: The primary business goal is to predict the operational status of waterpoints in Tanzania, classifying them into one of two categories:

Functional - Fully operational and providing water as intended.
Non-functional - Not operational and failing to provide water.

Why It Matters:

Efficient Resource Allocation: Predicting waterpoint conditions allows stakeholders (The Tanzanian Ministry of Water, NGOs) to allocate resources efficiently by prioritizing repairs and maintenance.
Improved Accessibility to Water: Ensuring functional water points in rural and urban areas is crucial for providing clean water to communities, improving public health, and reducing waterborne diseases.
Cost Optimization: Proactively predicting failures can prevent unnecessary repair costs, and funds can be channeled to more critical interventions.
Sustainability: Understanding patterns in failures helps in better planning for future installations, improving waterpoint durability.

Key Stakeholders:

Government Agencies (Tanzanian Ministry of Water): Responsible for policy-making and infrastructure maintenance. -NGOs and Charities: Focused on improving access to clean water. -Local Communities: Beneficiaries of operational waterpoints. -Donors and Funders: Interested in the impact of their investments in water infrastructure.

Problem Description: The dataset provides various features that capture the physical attributes, installation details, geographic location, and usage patterns of the water points. These variables can help answer questions such as:

What factors contribute most to waterpoint failures?
Are certain geographic areas more prone to non-functional water points?
How do management and funding affect waterpoint longevity?
Can the age of a waterpoint (construction year) predict its condition?

Expected Outcome:

A machine learning model that:

Accurately predicts the operational status of a waterpoint.
Key factors affecting functionality.
Provides actionable insights for maintaining and repairing water points.

Potential Challenges: a) Data Quality: Missing or inconsistent entries in key variables (e.g., funder, construction year).

b) Class Imbalance: If most water points are functional, predicting minority classes (e.g., non-functional) can be challenging.

c) Geographic and Temporal Variability: Different regions may have unique issues (e.g., drought, poor maintenance) that complicate predictions.

d) Interpretability: Translating complex model outputs into actionable insights for stakeholders.

Key Deliverables:

Classification Model: A robust algorithm to predict waterpoint status.
Feature Analysis Report: Insights into which factors most influence waterpoint conditions.
Dashboard or Visualization Tools: To enable stakeholders to view and act on predictions and insights.

Data Understanding

The data used in this project is from the Taarifa Competition

Data Understanding The comes from Driven Data - Tanzanian Water Wells

The data did not contain duplicates but it had missing values the majority of which were categorical columns measures taken were to impute the missing values to retain other rows containing information that was useful in the exploratory data analysis and modeling section.

Exploratory Data Analysis🔍🧐

visualized the distribution of water wells and their status Tableu_Viz

Visualized water quality from the different water wells Tableu_Viz

Visualizing the key stakeholders who funded the establishment of the water pump

The bar chart displays the functionality status of waterpoints across different funders. Among the funders, the Government of Tanzania is the largest contributor, supporting over 8,000 waterpoints, with more than half (4,663) being non-functional. Danida and Hesawa follow as significant contributors, with Danida showing a higher proportion of non-functional water points (1,242 functional vs. 1,713 non-functional). World Vision and World Bank have a relatively balanced split between functional and non-functional waterpoints, while smaller contributors like Private Individuals and Unicef demonstrate a higher number of functional waterpoints compared to non-functional ones. This suggests that while the Government of Tanzania contributes heavily to waterpoints, the functionality rate under its funding is relatively lower compared to other funders.

To further determine what association exists between the funders and the functionality of a water pump a chi-square test was performed to test the whether the association exists.

# chi-square test
from scipy import stats

crosstab= pd.crosstab(plot_data['funder'],plot_data['status_group'])
print(stats.chi2_contingency(crosstab))

Outputs:

A Chi-square statistic (902.8940071619397), an extremely small p-value(1.472005828590181e-188), and a degree of freedom of 9 suggest a strong association between the two categorical variables funder and status group.
The largest contributors to the Chi-square statistic are likely the Government of Tanzania, DANIDA, and Private Individuals, given their high absolute counts and discrepancies between functional and non-functional water points.

Insights:

Funders like the Government of Tanzania and HESAWA may require further analysis to understand why their non-functional water points are so high despite large investments. Private Individuals might benefit from technical or financial support to improve their success rate in maintaining functional water points. Focus on funders with a higher proportion of functional water points (e.g., DANIDA ) could offer insights into best practices.

Modeling

For this section, since we were working with a classification problem the following models were used:

Logistic models
Decision Trees
Convolution Neural Network

Why This Combination of Models? Using a combination of these models ensures a robust analysis:

Baseline to Advanced Progression: Logistic regression serves as a baseline, decision trees handle non-linear relationships, and CNNs explore complex spatial dependencies.
Interpretability vs. Complexity Trade-off: Logistic regression and decision trees offer interpretability, which is crucial for stakeholder understanding, while CNNs provide advanced, potentially higher-performing solutions.
Flexibility: This variety ensures the best fit for the dataset, as the characteristics of the data will determine which model performs best.

This approach balances simplicity, interpretability, and advanced techniques, providing a comprehensive analysis of water pump functionality.

Model Comparison

Since our response variable contains imbalanced classes

Balancing the classes was tackled by using SMOTE-NC (Nominal Continuous) which takes into account when most of the variables in the data are categorical variables. Also, we will use the F1_score as the metric to measure our model's performance on predicting the classes and other evaluation techniques. Fine-tuning was necessary to ensure the models were optimized to their best performance while addressing issues like overfitting in the model.

The bar chart compares the F1 scores of four models used in the classification of water pump functionality: Logistic Regression, Decision Trees (Gini and Entropy), and a Convolutional Neural Network (CNN).

The Decision Tree (Gini) model achieved the highest F1 score of 73.59%, slightly outperforming the Decision Tree (Entropy) model, which had an F1 score of 72.90%.
The CNN model followed closely with an F1 score of 72.26%, demonstrating strong performance despite its complexity.
The Logistic Regression model had the lowest F1 score at 70.10%, indicating it may not handle non-linear relationships as effectively as the other models.

Overall, the Decision Tree models performed the best, with Gini slightly outperforming Entropy, while the CNN showed competitive results, making it a viable option for more complex patterns. Logistic Regression serves as a good baseline but falls short compared to the others.

Overall Comparison: Models with higher AUC values are better at distinguishing between classes.

Tree_Gini and Tree_Entropy, with the highest AUC values (0.78), are the best-performing models.
Logistic regression is the weakest among the models but still acceptable for classification tasks.
The CNN model is competitive but slightly underperforms compared to the decision tree models.

Based on the F1_score comparison and the ROC_AUC curve I chose the best model as the decision tree using the gini impurities.🥳🥳

To evaluate how the best model performed we will use a confusion matrix

Interpretation:

Strengths: The model performs well in predicting functional items (high true negative rate, 86.6%). Weaknesses: It struggles more with predicting non-functional items, as indicated by a lower recall (69.7%) and a significant false negative count (30.3%).

We can also investigate the top ten performing features in our model. To help our stakeholders capitalize on them to improve pump functionality

Summary of Insights

Water-related metrics (quantity_enough, quantity_insufficient, quantity_seasonal, and amount_tsh) dominate as predictors, highlighting the importance of consistent water supply and flow.
Geographic and demographic factors (gps_height, population, basin_Lake Victoria) suggest that environmental and community characteristics play a secondary role.
The age of the infrastructure (decades) reflects the need for maintenance and modernization.

Recommendations

Water Management: Prioritize consistent water availability, as the most critical features relate to quantity and seasonality.
Maintenance Plans: Develop targeted maintenance programs for older wells to improve their functionality.
Regional Focus: Investigate wells in the Lake Victoria basin or at extreme elevations to address location-specific challenges.
Design Optimization: Evaluate the least functional water point types under "other" and address design or maintenance shortcomings.
Population Alignment: Ensure wells are appropriately scaled for the population they serve.

Links

Slides Presentation

For queries or support, reach out via :

Email📩: [email protected]

LinkedIn: Savins Nanyaemuny

Thanks for your time🙂‍↕️

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Data		Data
__pycache__		__pycache__
.gitignore		.gitignore
Borehole_classifier_model.pkl		Borehole_classifier_model.pkl
README.md		README.md
WaterPumps Tanzania.pdf		WaterPumps Tanzania.pdf
functions.py		functions.py
student.ipynb		student.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Data-Driven Approach to Waterpoint Functionality and Accessibility

Table of Contents

Business understanding

Why It Matters:

Key Stakeholders:

Expected Outcome:

Data Understanding

Exploratory Data Analysis🔍🧐

Modeling

Model Comparison

Recommendations

Links

About

Releases

Packages

Contributors 2

Languages

Rhino-byte/Tz_Boreholes

Folders and files

Latest commit

History

Repository files navigation

A Data-Driven Approach to Waterpoint Functionality and Accessibility

Table of Contents

Business understanding

Why It Matters:

Key Stakeholders:

Expected Outcome:

Data Understanding

Exploratory Data Analysis🔍🧐

Modeling

Model Comparison

Recommendations

Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages