Merge pull request #10 from CDDLeiden/dev

Create v3.0.1 Release
CDDLeiden · Feb 28, 2024 · 468ee11 · 468ee11
2 parents b28f843 + 2e4cf64
commit 468ee11
Show file tree

Hide file tree

Showing 6 changed files with 116 additions and 248 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,146 +1,18 @@
 # Change Log
 
-From v2.1.1 to v3.0.0
+From v3.0.0 to v3.0.1
 
 ## Fixes
-
-- Fixed random seeds to give reproducible results. Each dataset is initialized with a
-  single random state (either from the constructor or a random number generator) which
-  is used in all subsequent random operations. Each model is initialized with a single
-  random state as well: it uses the random state from the dataset, unless it's overriden
-  in the constructor. When a dataset is saved to a file so is its random state, which is
-  used by the dataset when the dataset is reloaded.
-- fixed error with serialization of the `DNNModel.params` attribute, when no parameters
-  are set.
-- Fix bug with saving predictions from classification model
-  when `ModelAssessor.useProba` set to `False`.
-- Add missing implementation of `QSPRDataset.removeProperty`
-- Improved behavior of the Papyrus data source (does not attempt to connect to the
-  internet if the data set already exists).
-- It is now possible to define new descriptor sets outside the package without errors.
-- Basic consistency of models is also checked in the unit test suite, including in
-  the `qsprpred.extra` package.
-- Fixed a problem with feature standardizer being retrained on prediction data when a
-  prediction from SMILES was invoked. This affected all versions of the package higher
-  or equal to `v2.1.0`.
-- Fixes to the  `fromMolTable` method in various data set implementations, in particular
-  in copying of the feature standardizer and other settings.
-- Fixed not working `cluster` split and `--imputation` from `data_CLI.py`.
-- Fixed a problem with `ProteinDescriptorSet.getDescriptors` returning descriptors in
-  wrong order with `Pandas <v2.2.0`. 
+- Fixed a bug in `QSPRDataset` where property transformations were not applied.
 
 ## Changes
 
-- The model is now independent of data sets. This means that the model no longer
-  contains a reference to the data set it was trained on.
-    - The `fitAttached` method was replaced with `fitDataset`, which takes the data set
-      as
-      an argument.
-    - Assessors now also accept a data set as a second argument. Therefore, the same
-      assessor
-      can be used to assess different data sets with the same model settings.
-    - The monitoring API was also slightly modified to reflect this change.
-    - If a model requires initialization of some settings from data, this can be done in
-      its `initFromDataset` method, which takes the data set as an argument. This method
-      is called automatically before fitting, model assessment, and hyperparameter
-      optimization.
-- The whole package was refactored to simplify certain commonly used imports. The
-  tutorial code was adjusted to reflect that.
-- The jupyter notebooks in the tutorial now pass a random state to ensure consistent
-  results.
-- The default parameter values for `STFullyConnected` have changed from `n_epochs` =
-  1000 to `n_epochs` = 100, from `neurons_h1` = 4000 to `neurons_h1` = 256
-  and `neurons_hx` = 1000 to `neurons_hx` = 128.
-- Rename `HyperParameterOptimization` to `HyperparameterOptimization`.
-- `TargetProperty.fromList` and `TargetProperty.fromDict` now accept a both a string and
-  a `TargetTask` as the `task` argument,
-  without having to set the `task_from_str` argument, which is now deprecated.
-- Make `EarlyStopping.mode` flexible for `QSPRModel.fitDataset`.
-- `save_params` argument added to `OptunaOptimization` to save the best hyperparameters
-  to the model (default: `True`).
-- We now use `jsonpickle` for object serialization, which is more flexible than the
-  non-standard approach before, but it also means previous models will not be compatible
-  with this version.
-- `SklearnMetric` was renamed to `SklearnMetrics`, it now also accepts an scikit-learn
-  scorer name as input.
-- `QSPRModel.fitDataset` now accepts a `save_model` (default: `True`)
-  and `save_dataset` (default: `False`) argument to save the model and dataset to a file
-  after fitting.
-- Tutorials were completely rewritten and expanded. They can now be found in
-  the `tutorials` folder instead of the `tutorial` folder.
-- `MetricsPlot` now supports multi-class and multi-task classification models.
-- `CorrelationPlot` now supports multi-task regression models.
-- The behaviour of `QSPRDataset` was changed with regards to target properties. It now
-  remembers the original state of any target property and all changes are performed in
-  place on the original property column (i.e. conversion to multi-class classification).
-  This is to always maintain the same property name and always have the option to reset
-  it to the raw original state (i.e. if we switch to regression or want to repeat a
-  transformation).
-- The default log level for the package was changed from `INFO` to `WARNING`. A new
-  tutorial
-  was added to explain how to change the log level.
-- `RepeatsFilter` argument `year_name` renamed to `time_col` and
-  arugment `additional_cols` added.
-- The `perc` argument of `BorutaPy` can now be set from the CLI.
-- Descriptor calculators (previously used to aggregate and manage descriptor sets) were
-  completely removed from the API and descriptor sets can now be added directly to the
-  molecule tables.
-- The rdkit-like descriptor and fingerprint retrieval functions were removed from the
-  API because they complicated implementation of customized descriptors.
-- The `apply` method was simplified and a new API was clearly defined for parallel
-  processing of properties over data sets. To improve molecule processing,
-  a `processMols` method was added to `MoleculeTable`.
+- renamed `PandasDataTable.transform` to `PandasDataTable.transformProperties`
+- moved `imputeProperties`, `dropEmptyProperties` and `hasProperty` from `MoleculeTable`
+  to `PandasDataTable`.
+- removed `getProperties`, `addProperty`, `removeProperty`, now use `PandasDataTable`
+  methods directly.
 
 ## New Features
 
-- The `qsprpred.benchmarks` module was added, which contains functions to easily
-  benchmark
-  models on datasets.
-- Most unit tests now have a variant that checks whether using a fixed random seed gives
-  reproducible results.
-- The build pipeline now contains a check that the jupyter notebooks give the same
-  results as ones that were observed before.
-- Added `FitMonitor`, `AssessorMonitor`, and `HyperparameterOptimizationMonitor` base
-  classes to monitor the progress of fitting, assessing, and hyperparameter
-  optimization, respectively.
-- Added `BaseMonitor` class to internally keep track of the progress of a fitting,
-  assessing, or hyperparameter optimization process.
-- Added `FileMonitor` class to save the progress of a fitting, assessing, or
-  hyperparameter optimization process to files.
-- Added `WandBMonitor` class to save the progress of a fitting, assessing, or
-  hyperparameter optimization process to [Weights & Biases](https://wandb.ai/).
-- Added `NullMonitor` class to ignore the progress of a fitting, assessing, or
-  hyperparameter optimization process.
-- Added `ListMonitor` class to combine multiple monitors.
-- Cross-validation, testing, hyperparameter optimization and early-stopping were made
-  more flexible by allowing custom splitting and fold generation strategies. A tutorial
-  showcasing these features was created.
-- Added a `reset` method to `QSPRDataset`, which resets splits and loads all descriptors
-  into the training set matrix again.
-- Added `ConfusionMatrixPlot` to plot confusion matrices.
-- Added the `searchWithIndex`, `searchOnProperty`, `searchWithSMARTS` and `sample`
-  to `MoleculeTable` to facilitate more advanced sampling from data.
-- Assessors now have the `split_multitask_scores` flag that can be used to evaluate each
-  task seperately with single-task metrics.
-- `MoleculeDataSet`s now has the `smiles` property to easily get smiles.
-- A Docker-based runner in `testing/runner` can now be used to test GPU-enabled features
-  and run the full CI pipeline.
-- It is now possible to save `PandasDataTable`s to a CSV file instead of the default
-  pickle format (slower, but more human-readable).
-- New `RegressionPlot` class  `WilliamsPlot` added to plot Williams plots.
-- Data sets can now be optionally stored in the `csv` format and not just as a pickle
-  file. This makes it easier to debug and share data sets, but it is slower to load and
-  save.
-- Added `ApplicabilityDomain` class to calculate applicability domain and filter
-  outliers from test sets.
-
 ## Removed Features
-
-- The `Metric` interface has been simplified in order to make it easier to implement
-  custom metrics. The `Metric` interface now only requires the implementation of
-  the `__call__` method, which takes predictions and returns a `float`. The `Metric`
-  interface no longer requires the implementation
-  of `needsDiscreteToScore`, `needsProbaToScore` and `supportsTask`. However, this means
-  the base functionality of `checkMetricCompatibility`, `isClassificationMetric`
-  and `isRegressionMetric` are no longer available.
-- Default hyperparameter search space file, no longer available.
diff --git a/qsprpred/data/tables/base.py b/qsprpred/data/tables/base.py
@@ -64,6 +64,16 @@ def removeProperty(self, name: str):
             name (str): The name of the property.
         """
 
+    @abstractmethod
+    def transformProperties(self, names, transformers):
+        """Transform property values using a transformer function.
+
+        Args:
+            targets (list[str]): list of column names to transform.
+            transformer (Callable): Function that transforms the data in target columns
+                to a new representation.
+        """
+
     @abstractmethod
     def getSubset(self, prefix: str):
         """Get a subset of the dataset.
@@ -90,10 +100,6 @@ def apply(
             func_kwargs (dict, optional): The keyword arguments of the function.
         """
 
-    @abstractmethod
-    def transform(self, targets, transformers):
-        pass
-
     @abstractmethod
     def filter(self, table_filters: list[Callable]):
         """Filter the dataset.

diff --git a/qsprpred/data/tables/mol.py b/qsprpred/data/tables/mol.py
@@ -679,37 +679,10 @@ def dropDescriptors(
             self.descriptors[idx].clearFiles()
             self.descriptors.pop(idx)
 
-    def imputeProperties(self, names: list[str], imputer: Callable):
-        """Impute missing property values.
-
-        Args:
-            names (list):
-                List of property names to impute.
-            imputer (Callable):
-                imputer object implementing the `fit_transform`
-                 method from scikit-learn API.
-        """
-        assert hasattr(imputer, "fit_transform"), (
-            "Imputer object must implement the `fit_transform` "
-            "method from scikit-learn API."
-        )
-        assert all(
-            name in self.df.columns for name in names
-        ), "Not all target properties in dataframe columns."
-        names_old = [f"{name}_before_impute" for name in names]
-        self.df[names_old] = self.df[names]
-        self.df[names] = imputer.fit_transform(self.df[names])
-        logger.debug(f"Imputed missing values for properties: {names}")
-        logger.debug(f"Old values saved in: {names_old}")
-
     def dropEmptySmiles(self):
         """Drop rows with empty SMILES from the data set."""
         self.df.dropna(subset=[self.smilesCol], inplace=True)
 
-    def dropEmptyProperties(self, names: list[str]):
-        """Drop rows with empty target property value from the data set."""
-        self.df.dropna(subset=names, how="all", inplace=True)
-
     def attachDescriptors(
         self,
         calculator: DescriptorSet,
@@ -874,49 +847,6 @@ def smiles(self) -> Generator[str, None, None]:
         """
         return iter(self.df[self.smilesCol].values)
 
-    def getProperties(self):
-        """Get names of all properties/variables saved in the data frame (all columns).
-
-        Returns:
-            list: list of property names.
-        """
-        return self.df.columns.tolist()
-
-    def hasProperty(self, name):
-        """Check whether a property is present in the data frame.
-
-        Args:
-            name (str): Name of the property.
-
-        Returns:
-            bool: Whether the property is present.
-        """
-        return name in self.df.columns
-
-    def addProperty(self, name: str, data: list):
-        """Add a property to the data frame.
-
-        Args:
-            name (str): Name of the property.
-            data (list): list of property values.
-        """
-        if isinstance(data, pd.Series):
-            if not np.array_equal(data.index.txt, self.df.index.txt):
-                logger.info(
-                    f"Adding property '{name}' to data set might be introducing 'nan' "
-                    "values due to index with pandas series. Make sure the index of "
-                    "the data frame and the series match or convert series to list."
-                )
-        self.df[name] = data
-
-    def removeProperty(self, name):
-        """Remove a property from the data frame.
-
-        Args:
-            name (str): Name of the property to delete.
-        """
-        del self.df[name]
-
     def addScaffolds(
         self,
         scaffolds: list[Scaffold],