-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
243 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
## Review by @espottesmith | ||
|
||
**Paper comments:** | ||
|
||
> The summary is generally appropriate. From context, a non-expert might realize what *.cif files are (since crystal structures are mentioned in the second sentence), but this could be more explicitly stated in the first sentence. | ||
Thank you for your suggestion. The summary begins with | ||
|
||
``` | ||
`cifkit` provides higher-level functions and properties for coordination | ||
geometry and atomic site analysis from .cif files, which are standard file | ||
formats for storing crystallographic data such as atomic fractional coordinates, | ||
symmetry operations, and unit cell dimensions. | ||
``` | ||
|
||
> I come from a computational background, so I'm not the target audience, but I am unconvinced that the basic functionality (i.e., parsing CIF files and extracting information like structures and formulas) is easier in cifkit than in pymatgen/ASE. Similar numbers of lines of code would be required, at least comparing pymatgen and the example given in the cifkit paper (see https://pymatgen.org/usage.html#reading-and-writing-structuresmolecules). Considering higher-level features of cifkit, including filtering groups of CIFs and visualization, I think that cifkit distinguishes itself from other codes. In this area, the authors' argument is more convincing. Perhaps a slight reframing of the Statement of Need is appropriate? | ||
This is a valid point. I have modified the "Statement of Need" and emphaszie the utility of higher-level functions by modifying as follow: | ||
|
||
``` | ||
`cifkit` distinguishes itself from existing libraries by offering higher-level | ||
functions and variables that allow solid-state synthesists to obtain intuitive, | ||
yet physically impactful properties. It facilitates the visualization of | ||
coordination geometry from each site using four coordination determination | ||
methods and extracts physics-based features like volume and packing | ||
efficiency—crucial for structural analysis in machine learning tasks. Moreover, | ||
`cifkit` extracts atomic mixing information at the bond pair level, tasks that | ||
would otherwise require extensive manual effort using GUI-based tools like | ||
VESTA, Diamond, and CrystalMaker. These functions can be further developed | ||
on-demand, as demonstrated by `cifkit`'s ability to extract coordination | ||
geometry information based on four coordination number determination methods for | ||
a newly discovered phase [@tyvanchuk_crystal_2024]. | ||
``` | ||
|
||
> Otherwise, the paper looks good! | ||
Thank you! | ||
|
||
**Code comments:** | ||
|
||
> The folder "[occupacny"](https://github.com/bobleesj/cifkit/tree/main/src/cifkit/occupacny) should probably be "occupancy" | ||
Modified. Thank you. | ||
|
||
> I initially tried installing with Python 3.9 (as far as I can tell, the "pyproject.toml" and the docs don't specify a version). This caused problems, as cifkit is using type annotation syntax that is not supported in Python 3.9. I now see on PYPI that cifkit only supports Python 3.10, 3.11, and 3.12. Please make this more clear for users. | ||
In `pyproject.toml`, I have added the following `requires-python = '>=3.10, <3.13` to support 3.10, 3.11, 3.12. Per your suggestion, I have included the supported Python badged in the documentation under the Getting Started page. | ||
|
||
> In terms of code style, more extensive docstrings (for instance, explaining what the inputs and outputs should be) would be appreciated. It seems there are already some "TODO"s to this effect (see e.g., models/cif.py), so perhaps the authors are planning on making these changes soon. Function bodies are typically well commented (note that I have not read every file) | ||
I have added numpy-style docstrings for the main classes `Cif` and `CifEnsemble` here: https://github.com/bobleesj/cifkit/pull/49. I've also identified some code methods that could be refactored, as noted here: https://github.com/bobleesj/cifkit/issues/53. My next step is to further refine these docstrings by consolidating them into reusable methods. | ||
|
||
**Documentation comments:** | ||
|
||
> Examples in the docs aren't the most consistent. For instance, some example lines of code use "ensemble" as the variable, and some use "ensemble_test". It may also be better if the commented outputs reflected the example folder ("Example.ErCoIn_big_folder_path") so users can more easily verify correctness. Finally, incorporating the visualization tools in the docs (currently they're demonstrated only in the README) would be appreciated. | ||
Thank you for your suggestion. I have now maintained the isntance name `ensemble` consistently. The actual ouput using the Jupyter notebook has been used for all examples. | ||
|
||
> The statement of need could be made more clear in the docs | ||
I have added the statement of need in the beginning of the docs: https://bobleesj.github.io/cifkit/. | ||
|
||
> Community guidelines basically amount to "submit a PR or talk to the lead author", but for a project at this scale, perhaps that's appropriate. | ||
Thank you. I have created a PR template https://github.com/bobleesj/cifkit/blob/main/.github/pull_request_template.md and instructions on how to contribute here https://github.com/bobleesj/cifkit/blob/main/CONTRIBUTING.md. | ||
|
||
**Overall comments:** | ||
|
||
> In terms of "substantial scholarly effort", this is a reasonably small, simple, and young project. That said, it is clearly already being used for scholarly research, and it is somewhat unique in terms of the features that it offers. As such, I think it should be considered substantial. | ||
Thank you. For one of the projects mentioned, SAF (Structure/Composition Analyzer), which uses approximately 100 features generated using cifkit, we have a pre-print version available: https://chemrxiv.org/engage/chemrxiv/article-details/670aa269cec5d6c142f3b11a. I will continue to maintain the software for ongoing research efforts. | ||
|
||
> The paper does not contain original data or results | ||
Yes, .cif files must be provided for us to conduct analysis. | ||
|
||
> The authors do not make significant performance claims | ||
I have added information to the documentation stating that processing approximately 10,000 .cif files on a standard laptop (iMac with M1 chip) takes about 30 to 60 minutes. At this rate, we can process nearly all .cif files within 1–2 days. However, I've decided not to include performance metrics in the manuscript. Our code may evolve further—for instance, by using matrix multiplications to compute distances, as in ASE—and performance hasn't been a major bottleneck for our research. | ||
|
||
> With some relatively small tweaks, I think this will make a good addition to JOSS. | ||
Thank you for your review! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
## Review by @lancekavalsky | ||
|
||
**Code comments:** | ||
|
||
> For io related methods, e.g. the ones that generate histograms, more clarity is needed regarding where they write things by default. Running the histogram tutorial from the documentation, it wrote the histograms to a folder deep in my conda environment site-packages, which would likely not be intuitive to many users (particularly as this package is presented as being catered to users with less coding experience) and may cause issues on shared resources. | ||
Thank you for pointing that out. Yes, the histogram images were saved to the Anaconda environment, as that was where the .cif files were provided by default in the package. In the docs, I have included information about options to set the output path for users: | ||
|
||
```python | ||
# Optional: Specify the output directory where the .png file will be saved. | ||
ensemble.generate_site_mixing_type_histogram(output_dir="path/to/directory") | ||
|
||
# Optional: Call plt.show() to display the histogram on screen. | ||
ensemble.generate_site_mixing_type_histogram(display=False) | ||
``` | ||
|
||
For API users, I have include numpy-stlye docstrings: | ||
|
||
```python | ||
def generate_supercell_size_histogram( | ||
self, display=False, output_dir=None | ||
): | ||
"""Generate a histogram of the 'supercell_count' property from CIF files. | ||
This method creates a histogram based on the 'supercell_count' statistics of | ||
the CIF files. If 'output_dir' is specified, the histogram image (.png) will be | ||
saved to that directory. If 'output_dir' is not specified, the image will be saved | ||
to the directory specified by 'self.dir_path'. | ||
Parameters | ||
---------- | ||
display : bool, optional | ||
If True, the plot is displayed using plt.show(). Default is False. | ||
output_dir : str, optional | ||
The directory path where the histogram should be saved. If None, | ||
the histogram is saved in the directory defined by 'self.dir_path'. | ||
""" | ||
``` | ||
|
||
> I will echo the previous comment that more docstrings would be invaluable in helping code clarity. In particular, I would urge adopting a standardized approach to this, such as Numpy, to be more in line with community standards | ||
Per your suggestion, I have used Numpy docstrings to the core `Cif` and `CifEnsemble` classes. I will continue to udpate the docs after I make someo f the functions more dmoularlized like combinignign generating histogram functions into a singlo to avoid verbosity. https://github.com/bobleesj/cifkit/pull/54 | ||
|
||
> I have opened a technical issue I encountered here | ||
Thank you. I have addressed the issue via this PR: https://github.com/bobleesj/cifkit/pull/47. To ensure compatibility, I created a test function to ensure that raw .cif files sourced from ICSD, COD can be all parsed and the supercell can be generated: | ||
|
||
```python | ||
@pytest.mark.parametrize( | ||
"cif_folder_path, expected_file_count, expected_supercell_stats", | ||
[ | ||
("tests/data/cif/sources/ICSD", 4, {216: 2, 307: 1, 320: 1}), | ||
("tests/data/cif/sources/COD", 2, {519: 1, 1383: 1}), | ||
("tests/data/cif/sources/MP", 2, {108: 1, 594: 1}), | ||
("tests/data/cif/sources/PCD", 1, {364: 1}), | ||
("tests/data/cif/sources/MS", 1, {2988: 1}), | ||
("tests/data/cif/sources/CCDC", 1, {3844: 1}), | ||
], | ||
) | ||
``` | ||
|
||
**Documentation comments:** | ||
|
||
> As the core classes for this package are Cif and CifEnsemble, more explicit explanations as to the inputs and parameters would be helpful -- especially for CifEnsemble. For example, unless I'm missing it, a comprehensive list of everything that the preprocess parameter triggers is not mentioned in the documentation. | ||
Thank you. I have included docstrings for Cif and CifEnsemble classes to provide explicit parameters as shown below and the preprocessing triggers in the docstrings. The docs can be further improved. I hope it serves the purpose for now. I plan on maintaining this package during my graduate school. | ||
|
||
```python | ||
class CifEnsemble: | ||
def __init__( | ||
self, | ||
cif_dir_path: str, | ||
add_nested_files=False, | ||
preprocess=True, | ||
logging_enabled=False, | ||
) -> None: | ||
"""Initialize a CifEnsemble object, containing a collection of Cif | ||
objects. | ||
Parameters | ||
---------- | ||
cif_dir_path : str | ||
Path to the folder path containing .cif file(s). | ||
add_nested_files : bool, optional | ||
Option to include .cif files contained in sub-directories within cif_dir_path | ||
, by default False | ||
preprocess : bool, optional | ||
Option to edit .cif files before initializing each into a Cif object, | ||
by default True. Preprocess modifies atomic site labels in | ||
atom_site_label. Some site labels may contain a comma or a symbol like M | ||
due to atomic mixing. It reformats each atom_site_label so it can be | ||
parsed into an element type matching atom_site_type_symbol. For PCD | ||
databases, addresses in publ_author_address often have an incorrect | ||
format requiring manual modifications. It also relocates any ill-formatted | ||
files, such as those with duplicate labels in atom_site_label, missing | ||
fractional coordinates, or files requiring supercell generation. | ||
logging_enabled : bool, optional | ||
Option to log while pre-processing Cif objects, by default False | ||
Attributes | ||
---------- | ||
dir_path: str | ||
Path to the folder containing .cif files | ||
file_paths: list[str] | ||
List of file paths to .cif files | ||
cifs: list[Cif] | ||
List of Cif objects | ||
file_count: int | ||
Number of .cif files in the folder | ||
logging_enabled: bool | ||
Option to log while pre-processing Cif objects | ||
""" | ||
``` | ||
|
||
> In the documentation there are a couple instances of general clean-up required. One example is the first box on Getting Started uses a CIF method ensemble.cif_folder_path which gives an error when run. Another example is under the CIF specific documentation which refers to a README.md for complete documentation, but it is unclear where this file is located (since that info doesn't appear to be the main README in the repo?). | ||
I have rewritten the documentation to improve clarity and personally tested each example to ensure there are no bugs. I have also added the following to indiate where the default files are located as a comment: | ||
|
||
```python | ||
# In `cifkit` we provide .cif files that can be accessed through `from cifkit import Example` as shown below. For advancuser, these example .cif files are located under `src/cifkit/data` in the package. | ||
|
||
from cifkit import Example | ||
from cifkit import Cif | ||
|
||
# Initialize with the example file provided | ||
cif = Cif(Example.Er10Co9In20_file_path) | ||
|
||
# Print attributes | ||
print("File name:", cif.file_name) | ||
print("Formula:", cif.formula) | ||
print("Unique element:", cif.unique_elements) | ||
``` | ||
|
||
> There are options in the Cif class to use either the by_d_min_method or by_best_methods. Please refer to the README.md for complete documentation. | ||
I have included detailed documentations in the API. https://github.com/bobleesj/cifkit/blob/dbaf32400b70f323ba5965526193704b2613ea7b/src/cifkit/models/cif.py#L619 and https://github.com/bobleesj/cifkit/blob/dbaf32400b70f323ba5965526193704b2613ea7b/src/cifkit/models/cif.py#L682. | ||
|
||
> This is relatively minor, but on the documentation website it wasn't immediately clear to me what would be contained under the Notebooks tab at the top. Given this outlines several of the core functionalities of cifkit, I would consider renaming it to something more descriptive. | ||
Thank you for your feedback. I have replaced the name "Notebooks" but into 4 sections: `Getting started` `Cif` `CifEnsemble` and `API References`. | ||
|
||
> The contributing guidelines at this stage are somewhat vague. For example, is a minimum code coverage required for added features? | ||
I have included a PR request template (https://github.com/bobleesj/cifkit/blob/main/.github/pull_request_template.md) as well as the `CONTRIBUTING.md` for how to fork, clone, commit, etc. | ||
|
||
**Paper comments:** | ||
|
||
> In the summary it mentions that this package is designed to process datasets on "the order of tens of thousands". It is not clear to me where this is coming from and what exactly causes the bottleneck for going beyond this estimate. Details regarding what determined this limitation would be helpful to judge high-throughput performance | ||
I have added the following to the front page of the documentation: "Based on the Apple M1 iMac chip, processing 150 .cif files takes around 1 to 2 minutes, though this depends on the size of the unit cells in the .cif files. Processing 10,000 or more .cif files may take about an hour or two, but this also depends on the computing device. We expect that we can process all .cif files within 1-2 days." | ||
|
||
I am reluctant to add this information to the paper at the moment since the processing speed can vary based on the supercell size. We might come up with an improved method for computing connections using matrices, similar to how ASE does, instead of loops to iteratively compute distances between all atoms. | ||
|
||
> The examples in the paper are presented with limited explanation as to what they are showing. While the comments help in the second example, and some of the methods are self-explanatory by their naming, more comments would help cement the clarity here. | ||
Thank you. I have added higher-level functions that are considered novel in this paper and added comments. | ||
|
||
> Overall, this work would make a great addition to JOSS pending the minor revisions described above. | ||
Thank you for your review and recommendation! |