Skip to content

Commit

Permalink
Upload JOSS manuscript
Browse files Browse the repository at this point in the history
  • Loading branch information
bobleesj committed Aug 30, 2024
1 parent c05017e commit 438c9ee
Show file tree
Hide file tree
Showing 3 changed files with 265 additions and 8 deletions.
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ identified the folowing needs:

## Quotes

Here is a quote illustrating how `cifkit` addresses one of the challenges mentioned above.
Here is a quote illustrating how `cifkit` addresses one of the challenges
mentioned above.

> "I am building an X-Ray diffraction analysis (XRD) pattern visualization
> script for my lab using `pymatgen`. I feel like `cifkit` integrated really
Expand Down Expand Up @@ -186,8 +187,8 @@ By structures:
[GitHub](https://github.com/bobleesj/cif-bond-analyzer)
- CIF Cleaner - move, copy .cif files based on attributes -
[GitHub](https://github.com/bobleesj/cif-cleaner)
- Structure Analyzer/Featurizer (SAF) - extract physics-based features from .cif files -
[GitHub](https://github.com/bobleesj/structure-analyzer-featurizer)
- Structure Analyzer/Featurizer (SAF) - extract physics-based features from .cif
files - [GitHub](https://github.com/bobleesj/structure-analyzer-featurizer)

## How to ask for help

Expand Down Expand Up @@ -219,11 +220,13 @@ group of researchers:
- Danila Shiryaev: testing as beta user
- Fabian Zills ([@PythonFZ](https://github.com/PythonFZ)): suggested tooling
improvements
- Emil Jaffal ([@EmilJaffal](https://github.com/EmilJaffal)): initial testing and bug report
- Emil Jaffal ([@EmilJaffal](https://github.com/EmilJaffal)): initial testing
and bug report
- Nikhil Kumar Barua: initial testing and bug report
- Nishant Yadav ([@sethisiddha1998](https://github.com/sethisiddha1998)): initial testing and bug report
- Siddha Sankalpa Sethi ([@runzsh](https://github.com/runzsh)): initial testing and bug report
in initial testing and initial testing and bug report
- Nishant Yadav ([@sethisiddha1998](https://github.com/sethisiddha1998)):
initial testing and bug report
- Siddha Sankalpa Sethi ([@runzsh](https://github.com/runzsh)): initial testing
and bug report in initial testing and initial testing and bug report

We welcome all forms of contributions from the community. Your ideas and
improvements are valued and appreciated.
Expand All @@ -232,4 +235,5 @@ improvements are valued and appreciated.

Please consider citing `cifkit` if it has been useful for your research:

- cifkit – Python package for high-throughput .cif analysis, https://doi.org/10.5281/zenodo.12784259
- cifkit – Python package for high-throughput .cif analysis,
https://doi.org/10.5281/zenodo.12784259
109 changes: 109 additions & 0 deletions paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@

@article{ong_python_2013,
title = {Python {Materials} {Genomics} (pymatgen): {A} robust, open-source python library for materials analysis},
volume = {68},
issn = {0927-0256},
shorttitle = {Python {Materials} {Genomics} (pymatgen)},
url = {https://www.sciencedirect.com/science/article/pii/S0927025612006295},
doi = {10.1016/j.commatsci.2012.10.028},
abstract = {We present the Python Materials Genomics (pymatgen) library, a robust, open-source Python library for materials analysis. A key enabler in high-throughput computational materials science efforts is a robust set of software tools to perform initial setup for the calculations (e.g., generation of structures and necessary input files) and post-calculation analysis to derive useful material properties from raw calculated data. The pymatgen library aims to meet these needs by (1) defining core Python objects for materials data representation, (2) providing a well-tested set of structure and thermodynamic analyses relevant to many applications, and (3) establishing an open platform for researchers to collaboratively develop sophisticated analyses of materials data obtained both from first principles calculations and experiments. The pymatgen library also provides convenient tools to obtain useful materials data via the Materials Project’s REpresentational State Transfer (REST) Application Programming Interface (API). As an example, using pymatgen’s interface to the Materials Project’s RESTful API and phasediagram package, we demonstrate how the phase and electrochemical stability of a recently synthesized material, Li4SnS4, can be analyzed using a minimum of computing resources. We find that Li4SnS4 is a stable phase in the Li–Sn–S phase diagram (consistent with the fact that it can be synthesized), but the narrow range of lithium chemical potentials for which it is predicted to be stable would suggest that it is not intrinsically stable against typical electrodes used in lithium-ion batteries.},
urldate = {2024-08-29},
journal = {Computational Materials Science},
author = {Ong, Shyue Ping and Richards, William Davidson and Jain, Anubhav and Hautier, Geoffroy and Kocher, Michael and Cholia, Shreyas and Gunter, Dan and Chevrier, Vincent L. and Persson, Kristin A. and Ceder, Gerbrand},
month = feb,
year = {2013},
keywords = {Design, High-throughput, Materials, Project, Thermodynamics},
pages = {314--319},
file = {Full Text:/Users/macbook/Zotero/storage/B8ALJEE7/Ong et al. - 2013 - Python Materials Genomics (pymatgen) A robust, op.pdf:application/pdf;ScienceDirect Snapshot:/Users/macbook/Zotero/storage/QMNM7QY4/S0927025612006295.html:text/html},
}

@article{hall_crystallographic_1991,
title = {The crystallographic information file ({CIF}): a new standard archive file for crystallography},
volume = {47},
issn = {1600-5724},
shorttitle = {The crystallographic information file ({CIF})},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1107/S010876739101067X},
doi = {10.1107/S010876739101067X},
language = {en},
number = {6},
urldate = {2024-08-29},
journal = {Acta Crystallographica Section A},
author = {Hall, S. R. and Allen, F. H. and Brown, I. D.},
year = {1991},
note = {\_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1107/S010876739101067X},
pages = {655--685},
file = {Full Text PDF:/Users/macbook/Zotero/storage/QU4JZMZE/Hall et al. - 1991 - The crystallographic information file (CIF) a new.pdf:application/pdf;Snapshot:/Users/macbook/Zotero/storage/Z8KVTFL6/S010876739101067X.html:text/html},
}

@article{larsen_atomic_2017,
title = {The atomic simulation environment—a {Python} library for working with atoms},
volume = {29},
issn = {0953-8984},
url = {https://dx.doi.org/10.1088/1361-648X/aa680e},
doi = {10.1088/1361-648X/aa680e},
abstract = {The atomic simulation environment (ASE) is a software package written in the Python programming language with the aim of setting up, steering, and analyzing atomistic simulations. In ASE, tasks are fully scripted in Python. The powerful syntax of Python combined with the NumPy array library make it possible to perform very complex simulation tasks. For example, a sequence of calculations may be performed with the use of a simple ‘for-loop’ construction. Calculations of energy, forces, stresses and other quantities are performed through interfaces to many external electronic structure codes or force fields using a uniform interface. On top of this calculator interface, ASE provides modules for performing many standard simulation tasks such as structure optimization, molecular dynamics, handling of constraints and performing nudged elastic band calculations.},
language = {en},
number = {27},
urldate = {2024-08-29},
journal = {Journal of Physics: Condensed Matter},
author = {Larsen, Ask Hjorth and Mortensen, Jens Jørgen and Blomqvist, Jakob and Castelli, Ivano E. and Christensen, Rune and Dułak, Marcin and Friis, Jesper and Groves, Michael N. and Hammer, Bjørk and Hargus, Cory and Hermes, Eric D. and Jennings, Paul C. and Jensen, Peter Bjerre and Kermode, James and Kitchin, John R. and Kolsbjerg, Esben Leonhard and Kubal, Joseph and Kaasbjerg, Kristen and Lysgaard, Steen and Maronsson, Jón Bergmann and Maxson, Tristan and Olsen, Thomas and Pastewka, Lars and Peterson, Andrew and Rostgaard, Carsten and Schiøtz, Jakob and Schütt, Ole and Strange, Mikkel and Thygesen, Kristian S. and Vegge, Tejs and Vilhelmsen, Lasse and Walter, Michael and Zeng, Zhenhua and Jacobsen, Karsten W.},
month = jun,
year = {2017},
note = {Publisher: IOP Publishing},
pages = {273002},
file = {IOP Full Text PDF:/Users/macbook/Zotero/storage/R2HBZEV6/Larsen et al. - 2017 - The atomic simulation environment—a Python library.pdf:application/pdf},
}

@article{barua_interpretable_2024,
title = {Interpretable {Machine} {Learning} {Model} on {Thermal} {Conductivity} {Using} {Publicly} {Available} {Datasets} and {Our} {Internal} {Lab} {Dataset}},
volume = {36},
issn = {0897-4756},
url = {https://doi.org/10.1021/acs.chemmater.4c01696},
doi = {10.1021/acs.chemmater.4c01696},
abstract = {Machine learning (ML), a subdiscipline of artificial intelligence studies, has gained importance in predicting or suggesting efficient thermoelectric materials. Previous ML studies have used different literature sources or density functional theory calculations as input. In this work, we develop a ML pipeline trained with multivariable inputs on a massive public dataset of ∼200,000 data utilizing a high-performance computing cluster to predict the thermal conductivity (κ) using four test sets: three publicly available datasets and a dataset built using previously published data from our own group. By taking advantage of this massive dataset, our model presents an opportunity to further expand the understanding of the selection of features with various thermoelectric materials. Among the several supervised ML models implemented, the eXtreme Gradient Boosting algorithm (XGBoost) turned out to be the best method during the 5-fold cross-validation method, with their averaged evaluation coefficients of R2 = 0.96, root mean squared error (RMSE) = 0.38 W m−1K−1, and mean absolute error (MAE) = 0.23 W m−1K−1. Additionally, with the aid of feature selection and importance analysis, useful chemical features were chosen that ultimately led to reasonably good accuracy in the series of test sets measured as per the evaluation coefficients of R2, RMSE, and MAE, with values ranging from 0.72 to 0.89, 0.52 to 1.08, and 0.40 to 0.66 W m−1K−1, respectively. Checking the worst outliers led to the discovery of some errors in the literature. Postmodel prediction, the SHapley Additive exPlanations (SHAP) algorithm was implemented on the XGBoost model to analyze the features that were the key drivers for the model’s decisions. Overall, the developed interpretable methodology produces the prediction of κ of a large variety of materials through the influence of chemical and physical property features. The conclusions drawn apply to the research and applications of thermoelectric and heat insulation materials.},
number = {14},
urldate = {2024-08-29},
journal = {Chemistry of Materials},
author = {Barua, Nikhil K. and Hall, Evan and Cheng, Yifei and Oliynyk, Anton O. and Kleinke, Holger},
month = jul,
year = {2024},
note = {Publisher: American Chemical Society},
pages = {7089--7100},
file = {Full Text PDF:/Users/macbook/Zotero/storage/UQ6UBCJS/Barua et al. - 2024 - Interpretable Machine Learning Model on Thermal Co.pdf:application/pdf},
}

@article{lee_machine_2024,
title = {Machine learning descriptors in materials chemistry used in multiple experimentally validated studies: {Oliynyk} elemental property dataset},
volume = {53},
issn = {2352-3409},
shorttitle = {Machine learning descriptors in materials chemistry used in multiple experimentally validated studies},
url = {https://www.data-in-brief.com/article/S2352-3409(24)00149-5/fulltext},
doi = {10.1016/j.dib.2024.110178},
language = {English},
urldate = {2024-08-29},
journal = {Data in Brief},
author = {Lee, Sangjoon and Chen, Clio and Garcia, Griheydi and Oliynyk, Anton},
month = apr,
year = {2024},
note = {Publisher: Elsevier},
keywords = {Feature engineering, Machine learning, Materials chemistry, Materials informatics},
file = {Full Text PDF:/Users/macbook/Zotero/storage/LT3CRPZS/Lee et al. - 2024 - Machine learning descriptors in materials chemistr.pdf:application/pdf},
}

@article{tyvanchuk_crystal_2024,
title = {The crystal and electronic structure of \textit{{RE}}{23Co6}.{7In20}.3 (\textit{{RE}} = {Gd}–{Tm}, {Lu}): {A} new structure type based on intergrowth of {AlB2}- and {CsCl}-type related slabs},
volume = {976},
issn = {0925-8388},
shorttitle = {The crystal and electronic structure of \textit{{RE}}{23Co6}.{7In20}.3 (\textit{{RE}} = {Gd}–{Tm}, {Lu})},
url = {https://www.sciencedirect.com/science/article/pii/S0925838823045449},
doi = {10.1016/j.jallcom.2023.173241},
abstract = {New ternary rare-earth indides RE23Co6.7In20.3 (RE = Gd–Tm, Lu) have been synthesized by arc-melting the elements under argon and subsequent annealing at 870 K for 1200 h. Single-crystal X-ray diffraction revealed Er23Co6.7In20.3 to crystallize in a new structure type in oP100, space group Pbam and Wyckoff sequence h11g13da with a = 23.203(5), b = 28.399(5), c = 3.5306(6) Å. The crystal structures of RE23Co6.7In20.3 (RE = Tb, Ho, Er and Tm) were determined from single crystal and powder X-ray diffraction data and further investigated by DFT methods. The compounds belong to a large family of ternary rare-earth indides with intergrowth of the AlB2- and CsCl-type related slabs. In the Er23Co6.7In20.3 structure, four types of fragments REIn and RET of CsCl-type, as well as RET2 and REIn2 of AlB2-type, are present simultaneously. A simple Python tool was developed to determine the coordination number for each crystallographic site with various methods and tested on the complex structure of RE23Co6.7In20.3.},
urldate = {2024-08-29},
journal = {Journal of Alloys and Compounds},
author = {Tyvanchuk, Yuriy and Babizhetskyy, Volodymyr and Baran, Stanisław and Szytuła, Andrzej and Smetana, Volodymyr and Lee, Sangjoon and Oliynyk, Anton O. and Mudring, Anja-Verena},
month = mar,
year = {2024},
keywords = {Bonding, Electronic structure, Indide, Intermetallics, Rare earth},
pages = {173241},
file = {ScienceDirect Snapshot:/Users/macbook/Zotero/storage/I9T7BI7X/S0925838823045449.html:text/html},
}
144 changes: 144 additions & 0 deletions paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
title: "cifkit: A User-Friendly Python Package for High-throughput CIF Analysis"
tags:
- Python
- CIF
- crystallography
- materials science
- solid state chemistry
- crystal structure
- machine learning

authors:
- name: Sangjoon Lee
orcid: 0000-0002-2367-3932
corresponding: true
affiliation: 1
- name: Anton O. Oliynyk
- orcid: 0000-0003-0732-7340
affiliation: "2, 3"
affiliations:
- name:
Department of Applied Physics and Applied Mathematics, Columbia
University, New York, NY 10027, USA
index: 1
- name:
Department of Chemistry, Hunter College, City University of New York, New
York, NY 10065, USA
index: 2
- name:
Ph.D. Program in Chemistry, The Graduate Center of the City University of
New York, New York, NY 10016, USA
index: 3
date: 29 August 2024
bibliography: paper.bib
---

# Summary

`cifkit` is designed to provide a set of intuitive utility functions and
variables for processing large datasets, on the order of tens of thousands of
.cif files. `cifkit` serves as an engine for building Python applications that
automate crystal structure analysis, enabling the extraction of physics-based
information crucial for understanding geometric configurations and identifying
irregularities. `cifkit` also offers various tools for determining coordination
numbers, plotting coordination geometry-based polyhedron from each site,
calculating bond fractions, moving and copying .cif files based on a set of
attributes, and determining atomic mixing information.

# Statement of need

In solid state chemistry and materials science, the Crystallographic Information
File (CIF) [@hall_crystallographic_1991] is the predominant file format used to
store and distribute crystal structure information. There are open-source Python
packages that read, edit, and create CIF files. Python Materials Genomics
(pymatgen) [@ong_python_2013] offers functionalities beyond the aforementioned
features, such as generating electronic structure properties and phase diagrams.
Similarly, the Atomic Simulation Environment (ASE) [@larsen_atomic_2017]
provides a suite of powerful tools for generating and running atomistic
simulations. However, both projects are tailored for users with prior
programming experience, often necessitating the use of the tools with API
documentation.

Given that .cif files are primarily generated by experimentalists, often with
the help of software, `cifkit` is designed for users with limited programming
backgrounds to lower the entry barrier for the experimentalist community. This
tool allows members to analyze their synthesized materials using numerical
descriptions, including complex tasks such as extracting descriptors for
geometry-based polyhedra from each atomic site. `cifkit` helps systematically
describe materials and automatically extract intuitive, physics-based, and
measurable information from .cif files, tasks that may require time-consuming
manual work with GUI-based tools like VESTA, Diamond, and CrystalMaker.
Furthermore, `cifkit` offers unique functionalities to visualize the
distribution of file counts based on attributes, extract atomic mixing
information at the bond pair level, preprocess .cif files to parse the atomic
site element from the atomic site label, and identify and separate ill-formatted
files in a few lines of code. By simplifying user interactions while maintaining
robust functionality, `cifkit` enables a broader range of scientists to leverage
computational tools in their research.

# Examples

The following example code demonstrates the user-friendly nature of the tool for
users with limited programming experience. The full installation process can be
executed via a Jupyter notebook, which is distributed through the Google Colab
URL provided in the official documentation.

```python
from cifkit import Cif, Example

# Initalize with the .cif file path
cif = Cif(Example.Er10Co9In20_file_path)
cif.formula
cif.structure
cif.unique_elements
cif.unitcell_lengths
cif.unitcell_angles
```

To extract information from a set of .cif files:

```python
from cifkit import CifEnsemble, Example

# Initalize with the folder path containing .cif files
ensemble = CifEnsemble(Example.ErCoIn_big_folder_path)
ensemble.unique_formulas
ensemble.unique_structures
ensemble.unique_elements

# Determine shortest pair distance per .cif file
ensemble.minimum_distances

# Filter .cif by formula
ensemble.filter_by_formulas(["LaRu2Ge2"])

# Filter .cif by space group name
ensemble.filter_by_space_group_names("Im-3m")
```

# Applications 

`cifkit` serves as a core package for building applications used by academic and
national laboratories for crystal structural analysis and machine learning
studies. CIF Bond Analyzer (CBA) utilizes `cifkit` to extract coordination
geometry information for newly a discovered phase [@tyvanchuk_crystal_2024]. The
Structure Analysis/Featurizer (SAF) employs `cifkit` to construct and extract
physics-based geometric features for binary and ternary compounds. Furthermore,
geometric features generated with `cifkit` are being incorporated into a
follow-up study on thermoelectric materials [@barua_interpretable_2024],
building upon the compositional properties explored in [@lee_machine_2024].

# Testing and documentation

97 percent of the code is covered according to Codecov. The documentation is
provided at https://bobleesj.github.io/cifkit.

# Acknowledgement

We acknowledge the initial testing done by Nishant Yadav, Siddha Sankalpa Sethi,
and Arnab Dutta from the Indian Institute of Technology, Kharagpur. We also
thank Emil Jaffal, Danila Shiryaev, and Alex Vtorov from CUNY Hunter College for
testing. We acknowledge Fabian Zills for his recommendations on Python tooling.

# References

0 comments on commit 438c9ee

Please sign in to comment.