GitHub - ZJUEarthData/Geochemistrypi: an open-sourced highly automated machine learning Python framework for data-driven geochemistry discovery

Online Documentation: https://geochemistrypi.readthedocs.io

Source Code: https://github.com/ZJUEarthData/geochemistrypi

Geochemistry π is an open-sourced highly automated machine learning Python framework dedicating to build up MLOps level 1 software product for data-driven geochemistry discovery on tabular data.

Our goal: one data-mining run in 5 minutes, ten data-mining runs in 10 minutes.

Core capabilities are:

Continous Training
Machine Learning Lifecycle Management
Model Inference

Key features are:

Easy to use: The automation of data mining process provides the users with simple number options to choose.
Extensible: It allows appending new algorithms through Scikit-learn with automatic hyper parameter searching by FLAML and Ray.
Traceable: It integrates MLflow to build special storage mechanism to streamline the end-to-end machine learning lifecycle.

Latest Update: follow up by clicking Starred and Watch on our GitHub repository, then get email notifications of the newest features automatically.

Note: Chatbot driven by multi-agent system is available in the right-bottom corner of Online Documentation with a blue button.

The following figure is the simplified overview of Geochemistry π:

The following figure is the frontend-backend separation architecture of Geochemistry:

If the software contributes to your research, cite the work as :

ZhangZhou J*, He Can*, Sun Jianhao, Zhao Jianming, Lyu Yang, Wang Shengxin, Zhao Wenyu, Li Anzhou, Ji Xiaohui. Geochemistry π: Automated machine learning python framework for tabular data (2024). Geochemistry, Geophysics, Geosystems, 25, e2023GC011324

Download link: https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023GC011324

Related report:

Geochemistry π was selected for featuring as an Editor’s Highlight in EOS magazine by American Geophysical Union (fewer than 2 percent of paper are selected) and quoted in Geochemical NEWS by Geochemical Society.

Eos Website: https://eos.org/editor-highlights/machine-learning-for-geochemists-who-dont-want-to-code.

Video Demo

Have an overview of how our software can accelerate your data-mining experiment.

Geochemistry π v0.7.0 Introduction Video [Bilibili] | [YouTube]
Geochemistry π v0.7.0 for Regression Demo [Bilibili] | [YouTube]
Geochemistry π v0.7.0 for Classification Demo [Bilibili] | [YouTube]
MLflow UI user guide - Geochemistry π v0.5.0 [Bilibili] | [YouTube]
Geochemistry π - Download and Run the Beta Version [Bilibili] | [YouTube]

Quick Installation

Our software is well tested on macOS and Windows system with Python 3.9. Other systems and Python version are not guranteed.

One instruction to download on command line, such as Terminal on macOS, Power Shell on Windows.

pip install geochemistrypi

Download the latest version to avoid some old version issues, such as dependency downloading.

pip install "geochemistrypi==0.7.0"

One instruction to download on Jupyter Notebook or Google Colab.

!pip install geochemistrypi

Download the latest version to avoid some old version issues, such as dependency downloading.

!pip install "geochemistrypi==0.7.0"

Check the downloaded version of our software:

geochemistrypi --version

Note: For more detail on installation, please refer to our online documentation in Installation Manual under the section of FOR USER. Over there, we highly recommend to use virtual environment (Conda) to avoid dependency version problems.

The following screenshot shows the downloads and launching of our software on macOS:

Quick Update

One instruction to update the software to the latest version on command line, such as Terminal on macOS, Power Shell on Windows.

pip install --upgrade geochemistrypi

One instruction to download on Jupyter Notebook or Google Colab.

!pip install --upgrade geochemistrypi

Check the updated version of our software:

geochemistrypi --version

Data Preparation

In order to utilize the functions provided by our software, your own data set should satisfy:

be with the suffix .xlsx or .csv, which is supported by Microsoft Excel.
be comprise of location information LATITUDE and LONGITUDE, two columns respectively. It is optional.

If you want to run classification algorithm, you data set should satisfy:

a label column. You can name it as you wish, such as Label.

Column name specification:

No restriction on the column names. You can name them as you want except for two special and optional column LATITUDE and LONGITUDE.
every column can only one column name. Multi level column names are not allowed.
Between two columns with values, a completed void column can exists.

The following are seven built-in data sets in our software stored on Google Drive and Tecent Docs, have a look on them. For the algorithm you intend to run, you can refer to the data format of the corresponding dataset.

Data_Regression.xlsx [Google Drive] | [Tencent Docs]
ApplicationData_Regression.xlsx [Google Drive] | [Tencent Docs]
Data_Classification.xlsx [Google Drive] | [Tencent Docs]
ApplicationData_Classification.xlsx [Google Drive] | [Tencent Docs]
Data_Clustering.xlsx [Google Drive] | [Tencent Docs]
Data_Decomposition.xlsx [Google Drive] | [Tencent Docs]
Data_AnomalyDetection.xlsx [Google Drive] | [Tencent Docs]

Note: For more detail on data preparation, please refer to our online documentation in Model Example under the section of FOR USER.

Running Example

How to run: After successfully downloading, run the instructions as the following examples shown on command line / Jupyter Notebook / Google Colab.

Once the software starts, there are two folders geopi_output and geopi_tracking generated automatically for result storage.

geopi_tracking: It is used by MLflow as the storage for visualized operations in the web interface, which users cannot modify directly.

geopi_output: It is a regular folder aligning with MLflow's storage structure, which users can operate.

From v0.7.0 onwards, there is one new command with the option --desktop to read the training data and application from the folder geopi_input on desktop.

geopi_input: It is used to put the datasets you want our software to process.

Case 1: Run with built-in data set for model training and model inference

On command line:

geochemistrypi data-mining

On Jupyter Notebook / Google Colab:

!geochemistrypi data-mining

Note:

There are five built-in data sets corresponding to five kinds of model pattern.
The generated output directory geopi_output and geopi_tracking will be on desktop by default.

Case 2: Run with your own data set on desktop for model training and model inference

On command line:

geochemistrypi data-mining --desktop

On Jupyter Notebook / Google Colab:

!geochemistrypi data-mining --desktop

Note:

You need to create a directory geopi_input on desktop and put the datesets in it. If there is no geopi_input on desktop, our software will create one for you with all built-in datasets provided.
The generated output directory geopi_output and geopi_tracking will be on desktop by default.

Case 3: Run with your own data set without model inference

On command line:

geochemistrypi data-mining --data your_own_data_set.xlsx

On Jupyter Notebook / Google Colab:

!geochemistrypi data-mining --data your_own_data_set.xlsx

Note:

Currently, .xlsx and .csv files are supported. Please specify the path your data file exists. For Google Colab, don't forget to upload your dataset first.
The generated output directory geopi_output and geopi_tracking will be on the directory where you run this command.

Case 4: Implement model inference on application data

On command line:

geochemistrypi data-mining --training your_own_training_data.xlsx --application your_own_application_data.xlsx

On Jupyter Notebook / Google Colab:

!geochemistrypi data-mining --training your_own_training_data.xlsx --application your_own_application_data.xlsx

Note:

Please make sure the column names (data schema) in both training data file and application data file are the same. Because the operations you perform via our software on the training data will be record automatically and subsequently applied to the application data in the same order.
The training data in our pipeline will be divided into the train set and test set used for training the ML model and evaluating the model's performance. The score includes two types. The first type is the scores from the prediction on the test set while the second type is cv scores from the cross validation on the train set.
The generated output directory geopi_output and geopi_tracking will be on the directory where you run this command.

Case 5: Activate MLflow web interface

On command line:

geochemistrypi data-mining --mlflow

On Jupyter Notebook / Google Colab:

!geochemistrypi data-mining --mlflow

Note:

Once the command is executed, our software will search geopi_tracking directory from the current working directory. If it doesn't exist, then our software will search it on desktop.
Copy the URL shown on the console into any browser to open the MLflow web interface. The URL is normally like this http://127.0.0.1:5000. Search MLflow online to see more operations and usages.

Roadmap

First Phase

It works as a software application with a command-line interface (CLI) to automate data mining process with frequently-used machine learning algorithms and statistical analysis methods, which would further lower the threshold for the geochemists.

The highlight is that through choosing simple number options, the users are able to implement a full cycle of data mining without knowledge of SciPy, NumPy, Pandas, Scikit-learn, FLAML, Ray packages.

The following figure is the activity diagram of automated ML pipeline in Geochemistry π:

Its data section provides feature engineering based on arithmatic operation. It allows the users to have a statistic analysis on the data set as well as on the imputation result, which is supported by the combination of Monte Carlo simulation and hypothesis testing.

Its models section provides both supervised learning and unsupervised learning methods from Scikit-learn framework, including four types of algorithms, regression, classification, clustering, and dimensional reduction. Integrated with FLAML and Ray framework, it allows the users to run AutoML easily, fastly and cost-effectively on the built-in supervised learning algorithms in our framework.

The following figure is the hierarchical architecture of Geochemistry π:

Second Phase

Currently, we are building three access ways to provide more user-friendly service, including web portal, CLI package and API. It allows the user to perform continuous training and model inference by automating the ML pipeline and machine learning lifecycle management by unique storage mechanism in different access layers.

The following figure is the system architecture diagram:

The following figure is the customized automated ML pipeline:

The following figure is the design pattern hierarchical architecture:

The following figure is the storage mechanism:

The whole package is under construction and the documentation is progressively evolving.

Geochemistry π Mind Map

→ Click here for more details

Team Info

Leaders:

Can He (Sany, National University of Singapore, Singapore) Duty: Be responsible for the overall development of the project. Email: [email protected]
Jianming Zhao (Jamie, Zhejiang University, China) Duty: Head of the technical group. Email: [email protected]
Yongkang Chan (Kill-virus, Lanzhou University, China) Duty: Head of the product group. Email: [email protected]
Yang Lyu (Daisy, Zhejiang University, China) Duty: Be responsible for the cloud product. Email: [email protected]

Technical Group:

Jianhao Sun (Jin, Nanjing University, China)
Mengying Ye (Mary, Jilin University, China)
Chengtu Li（Trenki, Henan Polytechnic University, Beijing, China）
Panyan Weng (The University of Sydney, Australia)
Haibin Lai (Michael, Southern University of Science and Technology, China)
Siqi Yao (Clara, Dongguan University of Technology, China)

Product Group:

Zhelan Lin（Lan, Fuzhou University, China）
ShuYi Li (Communication University Of China, Beijing, China)
Junbo Wang (China University of Geosciences, Beijing, China)
Haibin Wang（Watson, University of Sydney, Australia）
Guoqiang Qiu（Elsen, Fuzhou University, China）
Yating Dong (Yetta，Dongguan University of Technology，China)
Bailun Jiang (EPSI / Lille University, France)
Chufan Zhou (Yoko, Institute of Geochemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China)

Join Us :)

The recruitment of research interns is ongoing !!!

Key Point: All things are done online, remote work (*^▽^*)

What can you learn?

Learning the full cycle of data mining (Scikit-learn, Ray, Mlflow) on tabular data, including the algorithms in regression,classification, clustering, and decomposition.
Learning to be a qualified Python developer, including any Python programing contents towards data mining, basic software engineering techniques like frontend (React, Typescript, Ant Design scaffold) and backend (SQL & NoSQL database, RESFful API, FastAPI) development, and cooperation tools like Git.

What can you get?

Research internship proof and reference letter after working for >> 100 hours.
Chance to pay a visit to Hangzhou, China, sponsored by ZJU Earth Data.
Chance to be guided by the experts from IT companies in Silicon Valley and Hangzhou.
Bonus depending on your performance.

Current Working Pattern:

Online working and cooperation
Three weeks per working cycle -> One online meeting per working cycle
One cycle report (see below) per cycle - 5 mins to finish

Even if you are not familiar with topics above, but if you are interested in and have plenty of time to do it. That's enough. We have a full-developed training system to help you, as a newbie of data mining or Python developer, learn steps by steps with seniors until you can make a significant contribution to our project.

More details about the project? Please refer to: English Page: https://person.zju.edu.cn/en/zhangzhou Chinese Page: https://person.zju.edu.cn/zhangzhou#0

Do you want to contribute to this open-source program? Contact with your CV: [email protected]

In-house Materials

Materials are in both Chinese and English. Others unshown below are internal materials.

In-house Videos

Technical record videos are on Bilibili and Youtube synchronously while other meeting videos are internal materials. More Videos will be recorded soon.

Contributors

Mengqi Gao (China University of Geosciences, Beijing, China)
Shengxin Wang (Samson, Lanzhou University, China)
Wenyu Zhao (Molly, Zhejiang University, China)
Qiuhao Zhao (Brad, Zhejiang University, China)
Kaixin Zheng (Hayne, Sun Yat-sen University, China)
Ruitao Chang (China University of Geosciences Beijing, China)
Yucheng Yan (Andy, University of Sydney, Australia)
Anzhou Li (Andrian, Zhejiang University, China)
Keran Li (Kirk, Chengdu University of Technology, China)
Dan Hu (Notre Dame University, United States)
Xunxin Liu (Tante, China University of Geosciences, Wuhan, China)
Fang Li (liv, Shenzhen University, China)
Xin Li (The University of Manchester, United Kingdom)
Ting Liu (Kira, Sun Yat-sen University, China)
Xirui Zhu (Rae, University of York, United Kingdom)
Aixiwake·Janganuer (Ayshuak, Sun Yat-sen University, China)
Zhenglin Xu (Garry, Jilin University, China)
Jianing Wang (National University of Singapore, Singapore)
Junchi Liao(Roceda, University of Electronic Science and Technology of China, China)

Name		Name	Last commit message	Last commit date
Latest commit History 759 Commits
.github		.github
docs		docs
geochemistrypi		geochemistrypi
requirements		requirements
.env		.env
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Demo

Quick Installation

Quick Update

Data Preparation

Running Example

Case 1: Run with built-in data set for model training and model inference

Case 2: Run with your own data set on desktop for model training and model inference

Case 3: Run with your own data set without model inference

Case 4: Implement model inference on application data

Case 5: Activate MLflow web interface

Roadmap

First Phase

Second Phase

Geochemistry π Mind Map

Team Info

Join Us :)

In-house Materials

In-house Videos

Contributors

About

Releases 10

Packages

Contributors 20

Languages

License

ZJUEarthData/Geochemistrypi

Folders and files

Latest commit

History

Repository files navigation

Video Demo

Quick Installation

Quick Update

Data Preparation

Running Example

Case 1: Run with built-in data set for model training and model inference

Case 2: Run with your own data set on desktop for model training and model inference

Case 3: Run with your own data set without model inference

Case 4: Implement model inference on application data

Case 5: Activate MLflow web interface

Roadmap

First Phase

Second Phase

Geochemistry π Mind Map

Team Info

Join Us :)

In-house Materials

In-house Videos

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 20

Languages

Packages