Effective samples (#285)

* changed channel_capacity() to effective_samples() to allow for kish estimate * bumped beta version number * updating doc string * updated tests for effective samples code * flak8 on tests * updating effective samples function * docstyle testing and flake8 * updating entropy calcualtion for effective samples * overloading gamma and updating tests * flake8 * pydocstyle * Update version in README.rst * Update _version.py * updating treatment of gamma="kish", changing gamma=1 treatment and bug fixing sample_compression_1d test * bug fixing in tests and propagating gamma in utils.py * Update README.rst * Update _version.py * gamma --> beta * propagating up beta parameter for functions that call smaple_sompression_1d and compress_weights * running flake8 * allow `str(float)` input for `beta`, useful as optional input to `ncompress` in other functions to be piped through to `effective_samples` * change `utils.compress_weights` to only use `ncompress` kwarg (not `beta`), but if `ncompress` is a string pass on as `ncompress=effective_samples(w, beta=ncompress)` * update `effective_samples` docstring that `beta` also takes string input * use channel capacity (aka `'entropy'`) for compression in `compress_weights` * use `ncompress='entropy'` as default * map `ncompress=True` to `ncompress='entropy'` * change `utils.sample_compression_1d` to only use `ncompress` kwarg (not `beta`) * if `ncompress` is a string pass it on as `ncompress=effective_samples(w, beta=ncompress)` * map `ncompress=True` to `ncompress='entropy'` * update docstring accordingly * fix typos in `effective_samples` docstring * adapt changes to compression in `utils.py` also in `weighted_pandas.py` * use only `ncompress` (not `beta`) in `WeightedDataFrame.compress` and `WeightedSeries.compress`, but allow string input for `ncompress` which gets piped into `beta` * update docstrings accordingly * remove no longer used `beta` kwarg in `sample_compression_1d` in `kde_plot_1d` * update docstring for `ncompress` parameter in `plot.py` for new optional string input * allow string input for `n` in `triangular_sample_compression_2d` to determine `n` through `effective_samples` * add check to `test_triangular_sample_compression_2d` for string input to `n` * change default back to `ncompress=True` * expand `test_effective_samples` to also check for `str(int)` and `str(float)` input * rewrite string option checks for `beta==1` in `effective_samples` to also work for `'1.00'` * add tests to `test_effective_samples` for `beta='1.'` and `beta='1.00'` * changed effective_samples -> neff * Completed coverage * reverting change to test_sample_compression_1d * properly reverting change to test_sample_compression_1d * clarify error message for `ncompress<0` and make more in line with similar error messages about anesthetic 1.0 behaviour * unify and clean up `neff` docstring * further simplify neff docstring by reducing repetition * change defaults for `ncompress` to `'equal'` `'equal'` corresponds to a less arbitrary value than 1000 and a smaller value than for `'entropy'` ensuring speedy explorative plotting. * adapt `test_weighted_pandas` to new `ncompress` defaults for scatter plots * version bump * add test for `ncompress>len(w)` in `test_triangular_sample_compression_2d` to complete coverage --------- Co-authored-by: Lukas Hergt <[email protected]> Co-authored-by: Lukas Hergt <[email protected]> Co-authored-by: Will Handley <[email protected]> Co-authored-by: AdamOrmondroyd <[email protected]>
handley-lab · Jun 14, 2023 · 8a93e0b · 8a93e0b
1 parent 7a64948
commit 8a93e0b
Show file tree

Hide file tree

Showing 9 changed files with 187 additions and 56 deletions.
diff --git a/README.rst b/README.rst
@@ -2,7 +2,7 @@
 anesthetic: nested sampling post-processing
 ===========================================
 :Authors: Will Handley and Lukas Hergt
-:Version: 2.0.0-beta.33
+:Version: 2.0.0-beta.34
 :Homepage: https://github.com/handley-lab/anesthetic
 :Documentation: http://anesthetic.readthedocs.io/
 

diff --git a/anesthetic/_version.py b/anesthetic/_version.py
@@ -1 +1 @@
-__version__ = '2.0.0b33'
+__version__ = '2.0.0b34'
diff --git a/anesthetic/plot.py b/anesthetic/plot.py
@@ -808,11 +808,16 @@ def kde_plot_1d(ax, data, *args, **kwargs):
     weights : np.array, optional
         Sample weights.
 
-    ncompress : int, default=False
+    ncompress : int, str, default=False
         Degree of compression.
-        If int: number of samples returned.
-        If True: compresses to the channel capacity.
-        If False: no compression.
+
+        * If ``False``: no compression.
+        * If ``True``: compresses to the channel capacity, equivalent to
+          ``ncompress='entropy'``.
+        * If ``int``: desired number of samples after compression.
+        * If ``str``: determine number from the Huggins-Roy family of
+          effective samples in :func:`anesthetic.utils.neff`
+          with ``beta=ncompress``.
 
     nplot_1d : int, default=100
         Number of plotting points to use.
@@ -837,6 +842,9 @@ def kde_plot_1d(ax, data, *args, **kwargs):
     bw_method : str, scalar or callable, optional
         Forwarded to :class:`scipy.stats.gaussian_kde`.
 
+    beta : int, float, default = 1
+        The value of beta used to calculate the number of effective samples
+
     Returns
     -------
     lines : :class:`matplotlib.lines.Line2D`
@@ -1103,11 +1111,16 @@ def kde_contour_plot_2d(ax, data_x, data_y, *args, **kwargs):
         Has to be ordered from outermost to innermost contour.
         Default: [0.95, 0.68]
 
-    ncompress : int, default=1000
+    ncompress : int, str, default='equal'
         Degree of compression.
-        If int: number of samples returned.
-        If True: compresses to the channel capacity.
-        If False: no compression.
+
+        * If ``int``: desired number of samples after compression.
+        * If ``False``: no compression.
+        * If ``True``: compresses to the channel capacity, equivalent to
+          ``ncompress='entropy'``.
+        * If ``str``: determine number from the Huggins-Roy family of
+          effective samples in :func:`anesthetic.utils.neff`
+          with ``beta=ncompress``.
 
     nplot_2d : int, default=1000
         Number of plotting points to use.
@@ -1133,7 +1146,7 @@ def kde_contour_plot_2d(ax, data_x, data_y, *args, **kwargs):
         data_y = data_y[weights != 0]
         weights = weights[weights != 0]
 
-    ncompress = kwargs.pop('ncompress', 1000)
+    ncompress = kwargs.pop('ncompress', 'equal')
     nplot = kwargs.pop('nplot_2d', 1000)
     bw_method = kwargs.pop('bw_method', None)
     label = kwargs.pop('label', None)
@@ -1294,6 +1307,17 @@ def scatter_plot_2d(ax, data_x, data_y, *args, **kwargs):
     data_x, data_y : np.array
         x and y coordinates of uniformly weighted samples to plot.
 
+    ncompress : int, str, default='equal'
+        Degree of compression.
+
+        * If ``int``: desired number of samples after compression.
+        * If ``False``: no compression.
+        * If ``True``: compresses to the channel capacity, equivalent to
+          ``ncompress='entropy'``.
+        * If ``str``: determine number from the Huggins-Roy family of
+          effective samples in :func:`anesthetic.utils.neff`
+          with ``beta=ncompress``.
+
     Returns
     -------
     lines : :class:`matplotlib.lines.Line2D`

diff --git a/anesthetic/plotting/_matplotlib/core.py b/anesthetic/plotting/_matplotlib/core.py
@@ -61,7 +61,7 @@ def _get_xticks(self, convert_period: bool = False):
 
 def _compress_weights(kwargs, data):
     if isinstance(data, _WeightedObject):
-        return data.compress(kwargs.pop('ncompress', True))
+        return data.compress(kwargs.pop('ncompress', 'equal'))
     else:
         return data
 

diff --git a/anesthetic/samples.py b/anesthetic/samples.py
@@ -1100,7 +1100,7 @@ def live_points(self, logL=None):
 
     def posterior_points(self, beta=1):
         """Get equally weighted posterior points at temperature beta."""
-        return self.set_beta(beta).compress(-1)
+        return self.set_beta(beta).compress('equal')
 
     def prior_points(self, params=None):
         """Get equally weighted prior points."""

diff --git a/anesthetic/utils.py b/anesthetic/utils.py
@@ -3,7 +3,7 @@
 import pandas
 from scipy import special
 from scipy.interpolate import interp1d
-from scipy.stats import kstwobign
+from scipy.stats import kstwobign, entropy
 from matplotlib.tri import Triangulation
 import contextlib
 import inspect
@@ -32,21 +32,77 @@ def logsumexp(a, axis=None, b=None, keepdims=False, return_sign=False):
                              return_sign=return_sign)
 
 
-def channel_capacity(w):
-    r"""Channel capacity (effective sample size).
+def neff(w, beta=1):
+    r"""Calculate effective number of samples.
 
-    .. math::
+    Using the Huggins-Roy family of effective samples
+    (https://aakinshin.net/posts/huggins-roy-ess/).
+
+    Parameters
+    ----------
+    beta : int, float, str, default = 1
+        The value of beta used to calculate the number of effective samples
+        according to
+
+        .. math::
+
+            N_{eff} &= \bigg(\sum_{i=0}^n w_i^\beta \bigg)^{\frac{1}{1-\beta}}
+
+            w_i &= \frac{w_i}{\sum_j w_j}
+
+        Beta can take any positive value. Larger beta corresponds to a greater
+        compression such that:
+
+        .. math::
+
+            \beta_1 < \beta_2 \Rightarrow N_{eff}(\beta_1) > N_{eff}(\beta_2)
+
+        Alternatively, beta can take one of the following strings as input:
+
+        * If 'inf' or 'equal' is supplied (equivalent to beta=inf), then the
+          resulting number of samples is the number of samples when compressed
+          to equal weights, and given by:
+
+        .. math::
+
+            w_i &= \frac{w_i}{\sum_j w_j}
 
-        H = \sum_i p_i \ln p_i
+            N_{eff} &= \frac{1}{\max_i[w_i]}
 
-        p_i = \frac{w_i}{\sum_j w_j}
+        * If 'entropy' is supplied (equivalent to beta=1), then the estimate
+          is determined via the entropy based calculation, also referred to as
+          the channel capacity:
+
+        .. math::
+
+            H &= -\sum_i p_i \ln p_i
+
+            p_i &= \frac{w_i}{\sum_j w_j}
+
+            N_{eff} &= e^{H}
+
+        * If 'kish' is supplied (equivalent to beta=2), then a Kish estimate
+          is computed (Kish, Leslie (1965). Survey Sampling.
+          New York: John Wiley & Sons, Inc. ISBN 0-471-10949-5):
+
+        .. math::
+
+            N_{eff} = \frac{(\sum_i w_i)^2}{\sum_i w_i^2}
+
+        * str(float) input gets converted to the corresponding float value.
 
-        N = e^{-H}
     """
-    with np.errstate(divide='ignore', invalid='ignore'):
-        W = np.array(w)/sum(w)
-        H = np.nansum(np.log(W)*W)
-        return np.exp(-H)
+    w = w / np.sum(w)
+    if beta == np.inf or beta == 'inf' or beta == 'equal':
+        return 1 / np.max(w)
+    elif beta == 'entropy' or beta != 'kish' and str(float(beta)) == '1.0':
+        return np.exp(entropy(w))
+    else:
+        if beta == 'kish':
+            beta = 2
+        elif isinstance(beta, str):
+            beta = float(beta)
+        return np.sum(w**beta)**(1/(1-beta))
 
 
 def compress_weights(w, u=None, ncompress=True):
@@ -57,15 +113,21 @@ def compress_weights(w, u=None, ncompress=True):
     if w is None:
         w = np.ones_like(u)
 
-    if ncompress is True:
-        ncompress = channel_capacity(w)
+    if ncompress is True or isinstance(ncompress, str):
+        if ncompress is True:
+            ncompress = 'entropy'
+        ncompress = neff(w, beta=ncompress)
     elif ncompress is False:
         return w
 
-    if ncompress <= 0:
-        W = w/w.max()
-    else:
-        W = w * ncompress / w.sum()
+    # TODO: remove this in version >= 2.1
+    if ncompress < 0:
+        raise ValueError(
+            "ncompress<0 is anesthetic 1.0 behaviour. For equally weighted "
+            "samples you should now use ncompress='inf' or ncompress='equal'."
+            )
+
+    W = w * ncompress / w.sum()
 
     fraction, integer = np.modf(W)
     extra = (u < fraction).astype(int)
@@ -328,6 +390,9 @@ def triangular_sample_compression_2d(x, y, cov, w=None, n=1000):
     if w is None:
         w = pandas.Series(index=x.index, data=np.ones_like(x))
 
+    if isinstance(n, str):
+        n = int(neff(w, beta=n))
+
     # Select samples for triangulation
     if (w != 0).sum() < n:
         i = x.index
@@ -366,9 +431,12 @@ def sample_compression_1d(x, w=None, ncompress=True):
     ncompress : int, default=True
         Degree of compression.
 
-        * If int: number of samples returned.
-        * If True: compresses to the channel capacity.
-        * If False: no compression.
+        * If ``int``: number of samples returned.
+        * If ``True``: compresses to the channel capacity
+          (same as ``ncompress='entropy'``).
+        * If ``False``: no compression.
+        * If ``str``: determine number from the Huggins-Roy family of effective
+          samples in :func:`neff` with ``beta=ncompress``.
 
     Returns
     -------
@@ -377,16 +445,18 @@ def sample_compression_1d(x, w=None, ncompress=True):
     """
     if ncompress is False:
         return x, w
-    elif ncompress is True:
-        ncompress = channel_capacity(w)
+    elif ncompress is True or isinstance(ncompress, str):
+        if ncompress is True:
+            ncompress = 'entropy'
+        ncompress = neff(w, beta=ncompress)
     x = np.array(x)
     if w is None:
         w = np.ones_like(x)
     w = np.array(w)
 
     # Select inner samples for triangulation
     if len(x) > ncompress:
-        x_ = np.random.choice(x, size=ncompress, replace=False)
+        x_ = np.random.choice(x, size=int(ncompress), replace=False)
     else:
         x_ = x.copy()
     x_.sort()

diff --git a/anesthetic/weighted_pandas.py b/anesthetic/weighted_pandas.py
@@ -10,7 +10,7 @@
 from pandas.util._exceptions import find_stack_level
 from pandas.util import hash_pandas_object
 from numpy.ma import masked_array
-from anesthetic.utils import (compress_weights, channel_capacity, quantile,
+from anesthetic.utils import (compress_weights, neff as neff_, quantile,
                               temporary_seed, adjust_docstrings)
 from pandas.core.dtypes.missing import notna
 
@@ -225,7 +225,7 @@ def reset_index(self, level=None, drop=False, inplace=False,
     def neff(self, axis=0):
         """Effective number of samples."""
         if self.isweighted(axis):
-            return channel_capacity(self.get_weights(axis))
+            return neff_(self.get_weights(axis))
         else:
             return self.shape[axis]
 
@@ -326,10 +326,17 @@ def compress(self, ncompress=True):
 
         Parameters
         ----------
-        ncompress : int, optional
-            effective number of samples after compression. If not supplied
-            (or True), then reduce to the channel capacity (theoretical optimum
-            compression). If <=0, then compress so that all weights are unity.
+        ncompress : int, str, default=True
+            Degree of compression.
+
+            * If ``True`` (default): reduce to the channel capacity
+              (theoretical optimum compression), equivalent to
+              ``ncompress='entropy'``.
+            * If ``> 0``: desired number of samples after compression.
+            * If ``<= 0``: compress so that all remaining weights are unity.
+            * If ``str``: determine number from the Huggins-Roy family of
+              effective samples in :func:`anesthetic.utils.neff`
+              with ``beta=ncompress``.
 
         """
         i = compress_weights(self.get_weights(), self._rand(), ncompress)
@@ -559,10 +566,17 @@ def compress(self, ncompress=True, axis=0):
 
         Parameters
         ----------
-        ncompress : int, optional
-            effective number of samples after compression. If not supplied
-            (or True), then reduce to the channel capacity (theoretical optimum
-            compression). If <=0, then compress so that all weights are unity.
+        ncompress : int, str, default=True
+            Degree of compression.
+
+            * If ``True`` (default): reduce to the channel capacity
+              (theoretical optimum compression), equivalent to
+              ``ncompress='entropy'``.
+            * If ``> 0``: desired number of samples after compression.
+            * If ``<= 0``: compress so that all remaining weights are unity.
+            * If ``str``: determine number from the Huggins-Roy family of
+              effective samples in :func:`anesthetic.utils.neff`
+              with ``beta=ncompress``.
 
         """
         if self.isweighted(axis):