-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
265 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
|
||
@article{ong_python_2013, | ||
title = {Python {Materials} {Genomics} (pymatgen): {A} robust, open-source python library for materials analysis}, | ||
volume = {68}, | ||
issn = {0927-0256}, | ||
shorttitle = {Python {Materials} {Genomics} (pymatgen)}, | ||
url = {https://www.sciencedirect.com/science/article/pii/S0927025612006295}, | ||
doi = {10.1016/j.commatsci.2012.10.028}, | ||
abstract = {We present the Python Materials Genomics (pymatgen) library, a robust, open-source Python library for materials analysis. A key enabler in high-throughput computational materials science efforts is a robust set of software tools to perform initial setup for the calculations (e.g., generation of structures and necessary input files) and post-calculation analysis to derive useful material properties from raw calculated data. The pymatgen library aims to meet these needs by (1) defining core Python objects for materials data representation, (2) providing a well-tested set of structure and thermodynamic analyses relevant to many applications, and (3) establishing an open platform for researchers to collaboratively develop sophisticated analyses of materials data obtained both from first principles calculations and experiments. The pymatgen library also provides convenient tools to obtain useful materials data via the Materials Project’s REpresentational State Transfer (REST) Application Programming Interface (API). As an example, using pymatgen’s interface to the Materials Project’s RESTful API and phasediagram package, we demonstrate how the phase and electrochemical stability of a recently synthesized material, Li4SnS4, can be analyzed using a minimum of computing resources. We find that Li4SnS4 is a stable phase in the Li–Sn–S phase diagram (consistent with the fact that it can be synthesized), but the narrow range of lithium chemical potentials for which it is predicted to be stable would suggest that it is not intrinsically stable against typical electrodes used in lithium-ion batteries.}, | ||
urldate = {2024-08-29}, | ||
journal = {Computational Materials Science}, | ||
author = {Ong, Shyue Ping and Richards, William Davidson and Jain, Anubhav and Hautier, Geoffroy and Kocher, Michael and Cholia, Shreyas and Gunter, Dan and Chevrier, Vincent L. and Persson, Kristin A. and Ceder, Gerbrand}, | ||
month = feb, | ||
year = {2013}, | ||
keywords = {Design, High-throughput, Materials, Project, Thermodynamics}, | ||
pages = {314--319}, | ||
file = {Full Text:/Users/macbook/Zotero/storage/B8ALJEE7/Ong et al. - 2013 - Python Materials Genomics (pymatgen) A robust, op.pdf:application/pdf;ScienceDirect Snapshot:/Users/macbook/Zotero/storage/QMNM7QY4/S0927025612006295.html:text/html}, | ||
} | ||
|
||
@article{hall_crystallographic_1991, | ||
title = {The crystallographic information file ({CIF}): a new standard archive file for crystallography}, | ||
volume = {47}, | ||
issn = {1600-5724}, | ||
shorttitle = {The crystallographic information file ({CIF})}, | ||
url = {https://onlinelibrary.wiley.com/doi/abs/10.1107/S010876739101067X}, | ||
doi = {10.1107/S010876739101067X}, | ||
language = {en}, | ||
number = {6}, | ||
urldate = {2024-08-29}, | ||
journal = {Acta Crystallographica Section A}, | ||
author = {Hall, S. R. and Allen, F. H. and Brown, I. D.}, | ||
year = {1991}, | ||
note = {\_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1107/S010876739101067X}, | ||
pages = {655--685}, | ||
file = {Full Text PDF:/Users/macbook/Zotero/storage/QU4JZMZE/Hall et al. - 1991 - The crystallographic information file (CIF) a new.pdf:application/pdf;Snapshot:/Users/macbook/Zotero/storage/Z8KVTFL6/S010876739101067X.html:text/html}, | ||
} | ||
|
||
@article{larsen_atomic_2017, | ||
title = {The atomic simulation environment—a {Python} library for working with atoms}, | ||
volume = {29}, | ||
issn = {0953-8984}, | ||
url = {https://dx.doi.org/10.1088/1361-648X/aa680e}, | ||
doi = {10.1088/1361-648X/aa680e}, | ||
abstract = {The atomic simulation environment (ASE) is a software package written in the Python programming language with the aim of setting up, steering, and analyzing atomistic simulations. In ASE, tasks are fully scripted in Python. The powerful syntax of Python combined with the NumPy array library make it possible to perform very complex simulation tasks. For example, a sequence of calculations may be performed with the use of a simple ‘for-loop’ construction. Calculations of energy, forces, stresses and other quantities are performed through interfaces to many external electronic structure codes or force fields using a uniform interface. On top of this calculator interface, ASE provides modules for performing many standard simulation tasks such as structure optimization, molecular dynamics, handling of constraints and performing nudged elastic band calculations.}, | ||
language = {en}, | ||
number = {27}, | ||
urldate = {2024-08-29}, | ||
journal = {Journal of Physics: Condensed Matter}, | ||
author = {Larsen, Ask Hjorth and Mortensen, Jens Jørgen and Blomqvist, Jakob and Castelli, Ivano E. and Christensen, Rune and Dułak, Marcin and Friis, Jesper and Groves, Michael N. and Hammer, Bjørk and Hargus, Cory and Hermes, Eric D. and Jennings, Paul C. and Jensen, Peter Bjerre and Kermode, James and Kitchin, John R. and Kolsbjerg, Esben Leonhard and Kubal, Joseph and Kaasbjerg, Kristen and Lysgaard, Steen and Maronsson, Jón Bergmann and Maxson, Tristan and Olsen, Thomas and Pastewka, Lars and Peterson, Andrew and Rostgaard, Carsten and Schiøtz, Jakob and Schütt, Ole and Strange, Mikkel and Thygesen, Kristian S. and Vegge, Tejs and Vilhelmsen, Lasse and Walter, Michael and Zeng, Zhenhua and Jacobsen, Karsten W.}, | ||
month = jun, | ||
year = {2017}, | ||
note = {Publisher: IOP Publishing}, | ||
pages = {273002}, | ||
file = {IOP Full Text PDF:/Users/macbook/Zotero/storage/R2HBZEV6/Larsen et al. - 2017 - The atomic simulation environment—a Python library.pdf:application/pdf}, | ||
} | ||
|
||
@article{barua_interpretable_2024, | ||
title = {Interpretable {Machine} {Learning} {Model} on {Thermal} {Conductivity} {Using} {Publicly} {Available} {Datasets} and {Our} {Internal} {Lab} {Dataset}}, | ||
volume = {36}, | ||
issn = {0897-4756}, | ||
url = {https://doi.org/10.1021/acs.chemmater.4c01696}, | ||
doi = {10.1021/acs.chemmater.4c01696}, | ||
abstract = {Machine learning (ML), a subdiscipline of artificial intelligence studies, has gained importance in predicting or suggesting efficient thermoelectric materials. Previous ML studies have used different literature sources or density functional theory calculations as input. In this work, we develop a ML pipeline trained with multivariable inputs on a massive public dataset of ∼200,000 data utilizing a high-performance computing cluster to predict the thermal conductivity (κ) using four test sets: three publicly available datasets and a dataset built using previously published data from our own group. By taking advantage of this massive dataset, our model presents an opportunity to further expand the understanding of the selection of features with various thermoelectric materials. Among the several supervised ML models implemented, the eXtreme Gradient Boosting algorithm (XGBoost) turned out to be the best method during the 5-fold cross-validation method, with their averaged evaluation coefficients of R2 = 0.96, root mean squared error (RMSE) = 0.38 W m−1K−1, and mean absolute error (MAE) = 0.23 W m−1K−1. Additionally, with the aid of feature selection and importance analysis, useful chemical features were chosen that ultimately led to reasonably good accuracy in the series of test sets measured as per the evaluation coefficients of R2, RMSE, and MAE, with values ranging from 0.72 to 0.89, 0.52 to 1.08, and 0.40 to 0.66 W m−1K−1, respectively. Checking the worst outliers led to the discovery of some errors in the literature. Postmodel prediction, the SHapley Additive exPlanations (SHAP) algorithm was implemented on the XGBoost model to analyze the features that were the key drivers for the model’s decisions. Overall, the developed interpretable methodology produces the prediction of κ of a large variety of materials through the influence of chemical and physical property features. The conclusions drawn apply to the research and applications of thermoelectric and heat insulation materials.}, | ||
number = {14}, | ||
urldate = {2024-08-29}, | ||
journal = {Chemistry of Materials}, | ||
author = {Barua, Nikhil K. and Hall, Evan and Cheng, Yifei and Oliynyk, Anton O. and Kleinke, Holger}, | ||
month = jul, | ||
year = {2024}, | ||
note = {Publisher: American Chemical Society}, | ||
pages = {7089--7100}, | ||
file = {Full Text PDF:/Users/macbook/Zotero/storage/UQ6UBCJS/Barua et al. - 2024 - Interpretable Machine Learning Model on Thermal Co.pdf:application/pdf}, | ||
} | ||
|
||
@article{lee_machine_2024, | ||
title = {Machine learning descriptors in materials chemistry used in multiple experimentally validated studies: {Oliynyk} elemental property dataset}, | ||
volume = {53}, | ||
issn = {2352-3409}, | ||
shorttitle = {Machine learning descriptors in materials chemistry used in multiple experimentally validated studies}, | ||
url = {https://www.data-in-brief.com/article/S2352-3409(24)00149-5/fulltext}, | ||
doi = {10.1016/j.dib.2024.110178}, | ||
language = {English}, | ||
urldate = {2024-08-29}, | ||
journal = {Data in Brief}, | ||
author = {Lee, Sangjoon and Chen, Clio and Garcia, Griheydi and Oliynyk, Anton}, | ||
month = apr, | ||
year = {2024}, | ||
note = {Publisher: Elsevier}, | ||
keywords = {Feature engineering, Machine learning, Materials chemistry, Materials informatics}, | ||
file = {Full Text PDF:/Users/macbook/Zotero/storage/LT3CRPZS/Lee et al. - 2024 - Machine learning descriptors in materials chemistr.pdf:application/pdf}, | ||
} | ||
|
||
@article{tyvanchuk_crystal_2024, | ||
title = {The crystal and electronic structure of \textit{{RE}}{23Co6}.{7In20}.3 (\textit{{RE}} = {Gd}–{Tm}, {Lu}): {A} new structure type based on intergrowth of {AlB2}- and {CsCl}-type related slabs}, | ||
volume = {976}, | ||
issn = {0925-8388}, | ||
shorttitle = {The crystal and electronic structure of \textit{{RE}}{23Co6}.{7In20}.3 (\textit{{RE}} = {Gd}–{Tm}, {Lu})}, | ||
url = {https://www.sciencedirect.com/science/article/pii/S0925838823045449}, | ||
doi = {10.1016/j.jallcom.2023.173241}, | ||
abstract = {New ternary rare-earth indides RE23Co6.7In20.3 (RE = Gd–Tm, Lu) have been synthesized by arc-melting the elements under argon and subsequent annealing at 870 K for 1200 h. Single-crystal X-ray diffraction revealed Er23Co6.7In20.3 to crystallize in a new structure type in oP100, space group Pbam and Wyckoff sequence h11g13da with a = 23.203(5), b = 28.399(5), c = 3.5306(6) Å. The crystal structures of RE23Co6.7In20.3 (RE = Tb, Ho, Er and Tm) were determined from single crystal and powder X-ray diffraction data and further investigated by DFT methods. The compounds belong to a large family of ternary rare-earth indides with intergrowth of the AlB2- and CsCl-type related slabs. In the Er23Co6.7In20.3 structure, four types of fragments REIn and RET of CsCl-type, as well as RET2 and REIn2 of AlB2-type, are present simultaneously. A simple Python tool was developed to determine the coordination number for each crystallographic site with various methods and tested on the complex structure of RE23Co6.7In20.3.}, | ||
urldate = {2024-08-29}, | ||
journal = {Journal of Alloys and Compounds}, | ||
author = {Tyvanchuk, Yuriy and Babizhetskyy, Volodymyr and Baran, Stanisław and Szytuła, Andrzej and Smetana, Volodymyr and Lee, Sangjoon and Oliynyk, Anton O. and Mudring, Anja-Verena}, | ||
month = mar, | ||
year = {2024}, | ||
keywords = {Bonding, Electronic structure, Indide, Intermetallics, Rare earth}, | ||
pages = {173241}, | ||
file = {ScienceDirect Snapshot:/Users/macbook/Zotero/storage/I9T7BI7X/S0925838823045449.html:text/html}, | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
--- | ||
title: "cifkit: A User-Friendly Python Package for High-throughput CIF Analysis" | ||
tags: | ||
- Python | ||
- CIF | ||
- crystallography | ||
- materials science | ||
- solid state chemistry | ||
- crystal structure | ||
- machine learning | ||
|
||
authors: | ||
- name: Sangjoon Lee | ||
orcid: 0000-0002-2367-3932 | ||
corresponding: true | ||
affiliation: 1 | ||
- name: Anton O. Oliynyk | ||
- orcid: 0000-0003-0732-7340 | ||
affiliation: "2, 3" | ||
affiliations: | ||
- name: | ||
Department of Applied Physics and Applied Mathematics, Columbia | ||
University, New York, NY 10027, USA | ||
index: 1 | ||
- name: | ||
Department of Chemistry, Hunter College, City University of New York, New | ||
York, NY 10065, USA | ||
index: 2 | ||
- name: | ||
Ph.D. Program in Chemistry, The Graduate Center of the City University of | ||
New York, New York, NY 10016, USA | ||
index: 3 | ||
date: 29 August 2024 | ||
bibliography: paper.bib | ||
--- | ||
|
||
# Summary | ||
|
||
`cifkit` is designed to provide a set of intuitive utility functions and | ||
variables for processing large datasets, on the order of tens of thousands of | ||
.cif files. `cifkit` serves as an engine for building Python applications that | ||
automate crystal structure analysis, enabling the extraction of physics-based | ||
information crucial for understanding geometric configurations and identifying | ||
irregularities. `cifkit` also offers various tools for determining coordination | ||
numbers, plotting coordination geometry-based polyhedron from each site, | ||
calculating bond fractions, moving and copying .cif files based on a set of | ||
attributes, and determining atomic mixing information. | ||
|
||
# Statement of need | ||
|
||
In solid state chemistry and materials science, the Crystallographic Information | ||
File (CIF) [@hall_crystallographic_1991] is the predominant file format used to | ||
store and distribute crystal structure information. There are open-source Python | ||
packages that read, edit, and create CIF files. Python Materials Genomics | ||
(pymatgen) [@ong_python_2013] offers functionalities beyond the aforementioned | ||
features, such as generating electronic structure properties and phase diagrams. | ||
Similarly, the Atomic Simulation Environment (ASE) [@larsen_atomic_2017] | ||
provides a suite of powerful tools for generating and running atomistic | ||
simulations. However, both projects are tailored for users with prior | ||
programming experience, often necessitating the use of the tools with API | ||
documentation. | ||
|
||
Given that .cif files are primarily generated by experimentalists, often with | ||
the help of software, `cifkit` is designed for users with limited programming | ||
backgrounds to lower the entry barrier for the experimentalist community. This | ||
tool allows members to analyze their synthesized materials using numerical | ||
descriptions, including complex tasks such as extracting descriptors for | ||
geometry-based polyhedra from each atomic site. `cifkit` helps systematically | ||
describe materials and automatically extract intuitive, physics-based, and | ||
measurable information from .cif files, tasks that may require time-consuming | ||
manual work with GUI-based tools like VESTA, Diamond, and CrystalMaker. | ||
Furthermore, `cifkit` offers unique functionalities to visualize the | ||
distribution of file counts based on attributes, extract atomic mixing | ||
information at the bond pair level, preprocess .cif files to parse the atomic | ||
site element from the atomic site label, and identify and separate ill-formatted | ||
files in a few lines of code. By simplifying user interactions while maintaining | ||
robust functionality, `cifkit` enables a broader range of scientists to leverage | ||
computational tools in their research. | ||
|
||
# Examples | ||
|
||
The following example code demonstrates the user-friendly nature of the tool for | ||
users with limited programming experience. The full installation process can be | ||
executed via a Jupyter notebook, which is distributed through the Google Colab | ||
URL provided in the official documentation. | ||
|
||
```python | ||
from cifkit import Cif, Example | ||
|
||
# Initalize with the .cif file path | ||
cif = Cif(Example.Er10Co9In20_file_path) | ||
cif.formula | ||
cif.structure | ||
cif.unique_elements | ||
cif.unitcell_lengths | ||
cif.unitcell_angles | ||
``` | ||
|
||
To extract information from a set of .cif files: | ||
|
||
```python | ||
from cifkit import CifEnsemble, Example | ||
|
||
# Initalize with the folder path containing .cif files | ||
ensemble = CifEnsemble(Example.ErCoIn_big_folder_path) | ||
ensemble.unique_formulas | ||
ensemble.unique_structures | ||
ensemble.unique_elements | ||
|
||
# Determine shortest pair distance per .cif file | ||
ensemble.minimum_distances | ||
|
||
# Filter .cif by formula | ||
ensemble.filter_by_formulas(["LaRu2Ge2"]) | ||
|
||
# Filter .cif by space group name | ||
ensemble.filter_by_space_group_names("Im-3m") | ||
``` | ||
|
||
# Applications | ||
|
||
`cifkit` serves as a core package for building applications used by academic and | ||
national laboratories for crystal structural analysis and machine learning | ||
studies. CIF Bond Analyzer (CBA) utilizes `cifkit` to extract coordination | ||
geometry information for newly a discovered phase [@tyvanchuk_crystal_2024]. The | ||
Structure Analysis/Featurizer (SAF) employs `cifkit` to construct and extract | ||
physics-based geometric features for binary and ternary compounds. Furthermore, | ||
geometric features generated with `cifkit` are being incorporated into a | ||
follow-up study on thermoelectric materials [@barua_interpretable_2024], | ||
building upon the compositional properties explored in [@lee_machine_2024]. | ||
|
||
# Testing and documentation | ||
|
||
97 percent of the code is covered according to Codecov. The documentation is | ||
provided at https://bobleesj.github.io/cifkit. | ||
|
||
# Acknowledgement | ||
|
||
We acknowledge the initial testing done by Nishant Yadav, Siddha Sankalpa Sethi, | ||
and Arnab Dutta from the Indian Institute of Technology, Kharagpur. We also | ||
thank Emil Jaffal, Danila Shiryaev, and Alex Vtorov from CUNY Hunter College for | ||
testing. We acknowledge Fabian Zills for his recommendations on Python tooling. | ||
|
||
# References |