Skip to content

Commit

Permalink
Merge pull request #10 from CDDLeiden/dev
Browse files Browse the repository at this point in the history
Create v3.0.1 Release
  • Loading branch information
HellevdM authored Feb 28, 2024
2 parents b28f843 + 2e4cf64 commit 468ee11
Show file tree
Hide file tree
Showing 6 changed files with 116 additions and 248 deletions.
142 changes: 7 additions & 135 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,18 @@
# Change Log

From v2.1.1 to v3.0.0
From v3.0.0 to v3.0.1

## Fixes

- Fixed random seeds to give reproducible results. Each dataset is initialized with a
single random state (either from the constructor or a random number generator) which
is used in all subsequent random operations. Each model is initialized with a single
random state as well: it uses the random state from the dataset, unless it's overriden
in the constructor. When a dataset is saved to a file so is its random state, which is
used by the dataset when the dataset is reloaded.
- fixed error with serialization of the `DNNModel.params` attribute, when no parameters
are set.
- Fix bug with saving predictions from classification model
when `ModelAssessor.useProba` set to `False`.
- Add missing implementation of `QSPRDataset.removeProperty`
- Improved behavior of the Papyrus data source (does not attempt to connect to the
internet if the data set already exists).
- It is now possible to define new descriptor sets outside the package without errors.
- Basic consistency of models is also checked in the unit test suite, including in
the `qsprpred.extra` package.
- Fixed a problem with feature standardizer being retrained on prediction data when a
prediction from SMILES was invoked. This affected all versions of the package higher
or equal to `v2.1.0`.
- Fixes to the `fromMolTable` method in various data set implementations, in particular
in copying of the feature standardizer and other settings.
- Fixed not working `cluster` split and `--imputation` from `data_CLI.py`.
- Fixed a problem with `ProteinDescriptorSet.getDescriptors` returning descriptors in
wrong order with `Pandas <v2.2.0`.
- Fixed a bug in `QSPRDataset` where property transformations were not applied.

## Changes

- The model is now independent of data sets. This means that the model no longer
contains a reference to the data set it was trained on.
- The `fitAttached` method was replaced with `fitDataset`, which takes the data set
as
an argument.
- Assessors now also accept a data set as a second argument. Therefore, the same
assessor
can be used to assess different data sets with the same model settings.
- The monitoring API was also slightly modified to reflect this change.
- If a model requires initialization of some settings from data, this can be done in
its `initFromDataset` method, which takes the data set as an argument. This method
is called automatically before fitting, model assessment, and hyperparameter
optimization.
- The whole package was refactored to simplify certain commonly used imports. The
tutorial code was adjusted to reflect that.
- The jupyter notebooks in the tutorial now pass a random state to ensure consistent
results.
- The default parameter values for `STFullyConnected` have changed from `n_epochs` =
1000 to `n_epochs` = 100, from `neurons_h1` = 4000 to `neurons_h1` = 256
and `neurons_hx` = 1000 to `neurons_hx` = 128.
- Rename `HyperParameterOptimization` to `HyperparameterOptimization`.
- `TargetProperty.fromList` and `TargetProperty.fromDict` now accept a both a string and
a `TargetTask` as the `task` argument,
without having to set the `task_from_str` argument, which is now deprecated.
- Make `EarlyStopping.mode` flexible for `QSPRModel.fitDataset`.
- `save_params` argument added to `OptunaOptimization` to save the best hyperparameters
to the model (default: `True`).
- We now use `jsonpickle` for object serialization, which is more flexible than the
non-standard approach before, but it also means previous models will not be compatible
with this version.
- `SklearnMetric` was renamed to `SklearnMetrics`, it now also accepts an scikit-learn
scorer name as input.
- `QSPRModel.fitDataset` now accepts a `save_model` (default: `True`)
and `save_dataset` (default: `False`) argument to save the model and dataset to a file
after fitting.
- Tutorials were completely rewritten and expanded. They can now be found in
the `tutorials` folder instead of the `tutorial` folder.
- `MetricsPlot` now supports multi-class and multi-task classification models.
- `CorrelationPlot` now supports multi-task regression models.
- The behaviour of `QSPRDataset` was changed with regards to target properties. It now
remembers the original state of any target property and all changes are performed in
place on the original property column (i.e. conversion to multi-class classification).
This is to always maintain the same property name and always have the option to reset
it to the raw original state (i.e. if we switch to regression or want to repeat a
transformation).
- The default log level for the package was changed from `INFO` to `WARNING`. A new
tutorial
was added to explain how to change the log level.
- `RepeatsFilter` argument `year_name` renamed to `time_col` and
arugment `additional_cols` added.
- The `perc` argument of `BorutaPy` can now be set from the CLI.
- Descriptor calculators (previously used to aggregate and manage descriptor sets) were
completely removed from the API and descriptor sets can now be added directly to the
molecule tables.
- The rdkit-like descriptor and fingerprint retrieval functions were removed from the
API because they complicated implementation of customized descriptors.
- The `apply` method was simplified and a new API was clearly defined for parallel
processing of properties over data sets. To improve molecule processing,
a `processMols` method was added to `MoleculeTable`.
- renamed `PandasDataTable.transform` to `PandasDataTable.transformProperties`
- moved `imputeProperties`, `dropEmptyProperties` and `hasProperty` from `MoleculeTable`
to `PandasDataTable`.
- removed `getProperties`, `addProperty`, `removeProperty`, now use `PandasDataTable`
methods directly.

## New Features

- The `qsprpred.benchmarks` module was added, which contains functions to easily
benchmark
models on datasets.
- Most unit tests now have a variant that checks whether using a fixed random seed gives
reproducible results.
- The build pipeline now contains a check that the jupyter notebooks give the same
results as ones that were observed before.
- Added `FitMonitor`, `AssessorMonitor`, and `HyperparameterOptimizationMonitor` base
classes to monitor the progress of fitting, assessing, and hyperparameter
optimization, respectively.
- Added `BaseMonitor` class to internally keep track of the progress of a fitting,
assessing, or hyperparameter optimization process.
- Added `FileMonitor` class to save the progress of a fitting, assessing, or
hyperparameter optimization process to files.
- Added `WandBMonitor` class to save the progress of a fitting, assessing, or
hyperparameter optimization process to [Weights & Biases](https://wandb.ai/).
- Added `NullMonitor` class to ignore the progress of a fitting, assessing, or
hyperparameter optimization process.
- Added `ListMonitor` class to combine multiple monitors.
- Cross-validation, testing, hyperparameter optimization and early-stopping were made
more flexible by allowing custom splitting and fold generation strategies. A tutorial
showcasing these features was created.
- Added a `reset` method to `QSPRDataset`, which resets splits and loads all descriptors
into the training set matrix again.
- Added `ConfusionMatrixPlot` to plot confusion matrices.
- Added the `searchWithIndex`, `searchOnProperty`, `searchWithSMARTS` and `sample`
to `MoleculeTable` to facilitate more advanced sampling from data.
- Assessors now have the `split_multitask_scores` flag that can be used to evaluate each
task seperately with single-task metrics.
- `MoleculeDataSet`s now has the `smiles` property to easily get smiles.
- A Docker-based runner in `testing/runner` can now be used to test GPU-enabled features
and run the full CI pipeline.
- It is now possible to save `PandasDataTable`s to a CSV file instead of the default
pickle format (slower, but more human-readable).
- New `RegressionPlot` class `WilliamsPlot` added to plot Williams plots.
- Data sets can now be optionally stored in the `csv` format and not just as a pickle
file. This makes it easier to debug and share data sets, but it is slower to load and
save.
- Added `ApplicabilityDomain` class to calculate applicability domain and filter
outliers from test sets.

## Removed Features

- The `Metric` interface has been simplified in order to make it easier to implement
custom metrics. The `Metric` interface now only requires the implementation of
the `__call__` method, which takes predictions and returns a `float`. The `Metric`
interface no longer requires the implementation
of `needsDiscreteToScore`, `needsProbaToScore` and `supportsTask`. However, this means
the base functionality of `checkMetricCompatibility`, `isClassificationMetric`
and `isRegressionMetric` are no longer available.
- Default hyperparameter search space file, no longer available.
14 changes: 10 additions & 4 deletions qsprpred/data/tables/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,16 @@ def removeProperty(self, name: str):
name (str): The name of the property.
"""

@abstractmethod
def transformProperties(self, names, transformers):
"""Transform property values using a transformer function.
Args:
targets (list[str]): list of column names to transform.
transformer (Callable): Function that transforms the data in target columns
to a new representation.
"""

@abstractmethod
def getSubset(self, prefix: str):
"""Get a subset of the dataset.
Expand All @@ -90,10 +100,6 @@ def apply(
func_kwargs (dict, optional): The keyword arguments of the function.
"""

@abstractmethod
def transform(self, targets, transformers):
pass

@abstractmethod
def filter(self, table_filters: list[Callable]):
"""Filter the dataset.
Expand Down
70 changes: 0 additions & 70 deletions qsprpred/data/tables/mol.py
Original file line number Diff line number Diff line change
Expand Up @@ -679,37 +679,10 @@ def dropDescriptors(
self.descriptors[idx].clearFiles()
self.descriptors.pop(idx)

def imputeProperties(self, names: list[str], imputer: Callable):
"""Impute missing property values.
Args:
names (list):
List of property names to impute.
imputer (Callable):
imputer object implementing the `fit_transform`
method from scikit-learn API.
"""
assert hasattr(imputer, "fit_transform"), (
"Imputer object must implement the `fit_transform` "
"method from scikit-learn API."
)
assert all(
name in self.df.columns for name in names
), "Not all target properties in dataframe columns."
names_old = [f"{name}_before_impute" for name in names]
self.df[names_old] = self.df[names]
self.df[names] = imputer.fit_transform(self.df[names])
logger.debug(f"Imputed missing values for properties: {names}")
logger.debug(f"Old values saved in: {names_old}")

def dropEmptySmiles(self):
"""Drop rows with empty SMILES from the data set."""
self.df.dropna(subset=[self.smilesCol], inplace=True)

def dropEmptyProperties(self, names: list[str]):
"""Drop rows with empty target property value from the data set."""
self.df.dropna(subset=names, how="all", inplace=True)

def attachDescriptors(
self,
calculator: DescriptorSet,
Expand Down Expand Up @@ -874,49 +847,6 @@ def smiles(self) -> Generator[str, None, None]:
"""
return iter(self.df[self.smilesCol].values)

def getProperties(self):
"""Get names of all properties/variables saved in the data frame (all columns).
Returns:
list: list of property names.
"""
return self.df.columns.tolist()

def hasProperty(self, name):
"""Check whether a property is present in the data frame.
Args:
name (str): Name of the property.
Returns:
bool: Whether the property is present.
"""
return name in self.df.columns

def addProperty(self, name: str, data: list):
"""Add a property to the data frame.
Args:
name (str): Name of the property.
data (list): list of property values.
"""
if isinstance(data, pd.Series):
if not np.array_equal(data.index.txt, self.df.index.txt):
logger.info(
f"Adding property '{name}' to data set might be introducing 'nan' "
"values due to index with pandas series. Make sure the index of "
"the data frame and the series match or convert series to list."
)
self.df[name] = data

def removeProperty(self, name):
"""Remove a property from the data frame.
Args:
name (str): Name of the property to delete.
"""
del self.df[name]

def addScaffolds(
self,
scaffolds: list[Scaffold],
Expand Down
Loading

0 comments on commit 468ee11

Please sign in to comment.