Core Classes

These are the standard classes for computing statistics on datasets. If your data contains NaN values, they will be handled by removing the entire sample (row) that contains them.

Non-Weighted Statistics

BatchSum

class batchstats.stats.BatchSum(axis=0)

Class for calculating the sum of batches of data.

The algorithm used is a simple cumulative sum. Each time a new batch is added, the sum of the new batch is added to the existing sum.

import numpy as np
from batchstats import BatchSum

# create some data
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[5, 6], [7, 8]])

# create a BatchSum object
bs = BatchSum()

# update with the first batch
bs.update_batch(data1)

# update with the second batch
bs.update_batch(data2)

# get the sum
total_sum = bs()

# verify the result
expected_sum = np.array([16, 20])
np.testing.assert_allclose(total_sum, expected_sum)

Example with multiple axes and data > 2 dimensions

# create some 3d data
data1 = np.arange(24).reshape(2, 3, 4)
data2 = np.arange(24, 48).reshape(2, 3, 4)

# create a BatchSum object to sum over the last two axes
bs = BatchSum(axis=(1, 2))

# update with the first batch
bs.update_batch(data1)

# update with the second batch
bs.update_batch(data2)

# get the sum
total_sum = bs()

# verify the result
expected_sum = data1.sum(axis=(1,2)) + data2.sum(axis=(1,2))
np.testing.assert_allclose(total_sum, expected_sum)

__call__() → ndarray

Calculate the sum.

Returns:: Sum of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(axis=0)

update_batch(batch, assume_valid=False)

Update the sum with a new batch of data.

Parameters:

batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.

Returns:

Updated BatchSum object.

Return type:

BatchSum

BatchMean

class batchstats.stats.BatchMean(axis=0)

Class for calculating the mean of batches of data.

The algorithm uses an incremental mean calculation. The new mean is computed from the previous mean, the new data, and the number of samples.

import numpy as np
from batchstats import BatchMean

# create some data
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[5, 6], [7, 8]])

# create a BatchMean object
bm = BatchMean()

# update with the first batch
bm.update_batch(data1)

# update with the second batch
bm.update_batch(data2)

# get the mean
total_mean = bm()

# verify the result
expected_mean = np.array([4., 5.])
np.testing.assert_allclose(total_mean, expected_mean)

Example with multiple axes and data > 2 dimensions

# create some 3d data
data1 = np.arange(24).reshape(2, 3, 4)
data2 = np.arange(24, 48).reshape(2, 3, 4)

# create a BatchMean object to get the mean over the last two axes
bm = BatchMean(axis=(1, 2))

# update with the first batch
bm.update_batch(data1)

# update with the second batch
bm.update_batch(data2)

# get the mean
total_mean = bm()

# verify the result
d = np.concatenate((data1, data2))
expected_mean = d.mean(axis=(1,2))
np.testing.assert_allclose(total_mean, expected_mean)

__call__() → ndarray

Calculate the mean.

Returns:: Mean of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(axis=0)

update_batch(batch, assume_valid=False)

Update the mean with a new batch of data.

Parameters:

batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.

Returns:

Updated BatchMean object.

Return type:

BatchMean

BatchVar

class batchstats.stats.BatchVar(axis=0, ddof=0)

Class for calculating the variance of batches of data.

The algorithm is an implementation of Welford’s online algorithm for computing variance. It is numerically stable and avoids a two-pass approach.

Parameters:: ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.

import numpy as np
from batchstats import BatchVar

# create some data
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[5, 6], [7, 8]])

# create a BatchVar object
bv = BatchVar()

# update with the first batch
bv.update_batch(data1)

# update with the second batch
bv.update_batch(data2)

# get the variance
total_var = bv()

# verify the result
expected_var = np.array([5., 5.])
np.testing.assert_allclose(total_var, expected_var)

Example with multiple axes and data > 2 dimensions

# create some 3d data
data1 = np.arange(24).reshape(2, 3, 4)
data2 = np.arange(24, 48).reshape(2, 3, 4)

# create a BatchVar object to get the var over the last two axes
bv = BatchVar(axis=(1, 2))

# update with the first batch
bv.update_batch(data1)

# update with the second batch
bv.update_batch(data2)

# get the var
total_var = bv()

# verify the result
d = np.concatenate((data1, data2))
expected_var = d.var(axis=(1,2))
np.testing.assert_allclose(total_var, expected_var)

__call__() → ndarray

Calculate the variance.

Returns:: Variance of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(axis=0, ddof=0)

update_batch(batch, assume_valid=False)

Update the variance with a new batch of data.

Parameters:

batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.

Returns:

Updated BatchVar object.

Return type:

BatchVar

BatchStd

class batchstats.stats.BatchStd(axis=0, ddof=0)

Class for calculating the standard deviation of batches of data.

This class uses BatchVar internally and takes the square root of the result.

Parameters:: ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.

import numpy as np
from batchstats import BatchStd

# create some data
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[5, 6], [7, 8]])

# create a BatchStd object
bs = BatchStd()

# update with the first batch
bs.update_batch(data1)

# update with the second batch
bs.update_batch(data2)

# get the standard deviation
total_std = bs()

# verify the result
expected_std = np.array([2.23606798, 2.23606798])
np.testing.assert_allclose(total_std, expected_std)

Example with multiple axes and data > 2 dimensions

# create some 3d data
data1 = np.arange(24).reshape(2, 3, 4)
data2 = np.arange(24, 48).reshape(2, 3, 4)

# create a BatchStd object to get the std over the last two axes
bs = BatchStd(axis=(1, 2))

# update with the first batch
bs.update_batch(data1)

# update with the second batch
bs.update_batch(data2)

# get the std
total_std = bs()

# verify the result
d = np.concatenate((data1, data2))
expected_std = d.std(axis=(1,2))
np.testing.assert_allclose(total_std, expected_std)

__call__() → ndarray

Calculate the standard deviation.

Returns:: Standard deviation of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(axis=0, ddof=0)

update_batch(batch, assume_valid=False)

Update the standard deviation with a new batch of data.

Parameters:

batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.

Returns:

Updated BatchStd object.

Return type:

BatchStd

BatchCov

BatchCov expects 2D inputs (samples x features).

class batchstats.stats.BatchCov(ddof=0)

Class for calculating the covariance of batches of data.

The algorithm is an implementation of an online covariance calculation. It is numerically stable and avoids a two-pass approach.

Parameters:: ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.

import numpy as np
from batchstats import BatchCov

# create some data
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[5, 6], [7, 8]])

# create a BatchCov object
bc = BatchCov()

# update with the first batch
bc.update_batch(data1)

# update with the second batch
bc.update_batch(data2)

# get the covariance
total_cov = bc()

# verify the result
expected_cov = np.array([[5., 5.], [5., 5.]])
np.testing.assert_allclose(total_cov, expected_cov)

__call__() → ndarray

Calculate the covariance.

Returns:: Covariance of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(ddof=0)

update_batch(batch1, batch2=None, assume_valid=False)

Update the covariance with new batches of data.

Parameters:

batch1 (numpy.ndarray) – Input batch 1.
batch2 (numpy.ndarray, optional) – Input batch 2. Default is None.
assume_valid (bool, optional) – If True, assumes all elements in the batches are valid. Default is False.

Returns:

Updated BatchCov object.

Return type:

BatchCov

BatchCorr

BatchCorr expects 2D inputs (samples x features).

class batchstats.stats.BatchCorr(ddof=0)

Class for calculating the correlation of batches of data.

The algorithm is an implementation of an online correlation calculation. It is numerically stable and avoids a two-pass approach.

Parameters:: ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.

import numpy as np
from batchstats import BatchCorr

# create some data
data1 = np.array([[1, 2], [3, 4]])
data2 = np.array([[5, 6], [7, 8]])

# create a BatchCorr object
bc = BatchCorr()

# update with the first batch
bc.update_batch(data1)

# update with the second batch
bc.update_batch(data2)

# get the correlation
total_corr = bc()

__call__() → ndarray

Calculate the correlation.

Returns:: Correlation of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(ddof=0)

update_batch(batch1, batch2=None, assume_valid=False)

Update the correlation with new batches of data.

Parameters:

batch1 (numpy.ndarray) – Input batch 1.
batch2 (numpy.ndarray, optional) – Input batch 2. Default is None.
assume_valid (bool, optional) – If True, assumes all elements in the batches are valid. Default is False.

Returns:

Updated BatchCorr object.

Return type:

BatchCorr

BatchMin

class batchstats.stats.BatchMin(axis=0)

Class for calculating the minimum of batches of data.

The algorithm keeps track of the element-wise minimum. When a new batch is added, the element-wise minimum between the current minimum and the new batch’s minimum is computed.

import numpy as np
from batchstats import BatchMin

# create some data
data1 = np.array([[1, 8], [3, 4]])
data2 = np.array([[5, 6], [7, 2]])

# create a BatchMin object
bm = BatchMin()

# update with the first batch
bm.update_batch(data1)

# update with the second batch
bm.update_batch(data2)

# get the minimum
total_min = bm()

# verify the result
expected_min = np.array([1, 2])
np.testing.assert_allclose(total_min, expected_min)

Example with multiple axes and data > 2 dimensions

# create some 3d data
data1 = np.arange(24).reshape(2, 3, 4)
data2 = np.arange(24, 48).reshape(2, 3, 4)

# create a BatchMin object to get the min over the last two axes
bm = BatchMin(axis=(1, 2))

# update with the first batch
bm.update_batch(data1)

# update with the second batch
bm.update_batch(data2)

# get the min
total_min = bm()

# verify the result
expected_min = np.minimum(data1.min(axis=(1,2)), data2.min(axis=(1,2)))
np.testing.assert_allclose(total_min, expected_min)

__call__() → ndarray

Calculate the minimum.

Returns:: Minimum of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(axis=0)

update_batch(batch, assume_valid=False)

Update the minimum with a new batch of data.

Parameters:

batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.

Returns:

Updated BatchMin object.

Return type:

BatchMin

BatchMax

class batchstats.stats.BatchMax(axis=0)

Class for calculating the maximum of batches of data.

The algorithm keeps track of the element-wise maximum. When a new batch is added, the element-wise maximum between the current maximum and the new batch’s maximum is computed.

import numpy as np
from batchstats import BatchMax

# create some data
data1 = np.array([[1, 8], [3, 4]])
data2 = np.array([[5, 6], [7, 2]])

# create a BatchMax object
bm = BatchMax()

# update with the first batch
bm.update_batch(data1)

# update with the second batch
bm.update_batch(data2)

# get the maximum
total_max = bm()

# verify the result
expected_max = np.array([7, 8])
np.testing.assert_allclose(total_max, expected_max)

Example with multiple axes and data > 2 dimensions

# create some 3d data
data1 = np.arange(24).reshape(2, 3, 4)
data2 = np.arange(24, 48).reshape(2, 3, 4)

# create a BatchMax object to get the max over the last two axes
bm = BatchMax(axis=(1, 2))

# update with the first batch
bm.update_batch(data1)

# update with the second batch
bm.update_batch(data2)

# get the max
total_max = bm()

# verify the result
expected_max = np.maximum(data1.max(axis=(1,2)), data2.max(axis=(1,2)))
np.testing.assert_allclose(total_max, expected_max)

__call__() → ndarray

Calculate the maximum.

Returns:: Maximum of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(axis=0)

update_batch(batch, assume_valid=False)

Update the maximum with a new batch of data.

Parameters:

batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.

Returns:

Updated BatchMax object.

Return type:

BatchMax

BatchPeakToPeak

class batchstats.stats.BatchPeakToPeak(axis=0)

Class for calculating the peak-to-peak (max - min) of batches of data.

This class uses BatchMax and BatchMin internally to keep track of the element-wise maximum and minimum values. The peak-to-peak value is the difference between the maximum and minimum.

import numpy as np
from batchstats import BatchPeakToPeak

# create some data
data1 = np.array([[1, 8], [3, 4]])
data2 = np.array([[5, 6], [7, 2]])

# create a BatchPeakToPeak object
bpp = BatchPeakToPeak()

# update with the first batch
bpp.update_batch(data1)

# update with the second batch
bpp.update_batch(data2)

# get the peak-to-peak
total_ptp = bpp()

# verify the result
expected_ptp = np.array([6, 6])
np.testing.assert_allclose(total_ptp, expected_ptp)

Example with multiple axes and data > 2 dimensions

# create some 3d data
data1 = np.arange(24).reshape(2, 3, 4)
data2 = np.arange(24, 48).reshape(2, 3, 4)

# create a BatchPeakToPeak object to get the ptp over the last two axes
bpp = BatchPeakToPeak(axis=(1, 2))

# update with the first batch
bpp.update_batch(data1)

# update with the second batch
bpp.update_batch(data2)

# get the ptp
total_ptp = bpp()

# verify the result
d = np.concatenate((data1, data2))
expected_ptp = d.max(axis=(1,2)) - d.min(axis=(1,2))
np.testing.assert_allclose(total_ptp, expected_ptp)

__call__() → ndarray

Calculate the peak-to-peak.

Returns:: Peak-to-peak of the batches.
Return type:: numpy.ndarray
Raises:: NoValidSamplesError – If no valid samples are available.

__init__(axis=0)

update_batch(batch, assume_valid=False)

Update the peak-to-peak with a new batch of data.

Parameters:

batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.

Returns:

Updated BatchPeakToPeak object.

Return type:

BatchPeakToPeak

Weighted Statistics

These classes are used for computing weighted statistics on datasets.

BatchWeightedSum

class batchstats.stats.BatchWeightedSum(axis=0)

Class for calculating the weighted sum of batches of data.

The algorithm used is a simple cumulative sum. Each time a new batch is added, the weighted sum of the new batch is added to the existing sum.

Warning

When axis 0 (the batch axis) is not part of the reduced axes, the per-batch results are accumulated in a list, so memory grows linearly with the number of batches instead of staying constant.

import numpy as np
from batchstats.stats.weighted_sum import BatchWeightedSum

# create some data
data1 = np.array([[1, 2], [3, 4]])
weights1 = np.array([[0.1, 0.2], [0.3, 0.4]])
data2 = np.array([[5, 6], [7, 8]])
weights2 = np.array([[0.5, 0.6], [0.7, 0.8]])

# create a BatchWeightedSum object
bws = BatchWeightedSum()

# update with the first batch
bws.update_batch(data1, weights=weights1)

# update with the second batch
bws.update_batch(data2, weights=weights2)

# get the weighted sum
total_weighted_sum = bws()

# verify the result
expected_sum = np.array([8.4, 12.0])
np.testing.assert_allclose(total_weighted_sum, expected_sum)

__call__() → ndarray: Calculate the weighted sum.

__init__(axis=0)

update_batch(batch, weights)

Update the weighted sum with a new batch of data.

Passing weights=None sums the batch as-is (weights of 1) without materializing a batch-sized product.

BatchWeightedMean

class batchstats.stats.BatchWeightedMean(axis=0)

Class for calculating the weighted mean of batches of data. It computes sum(w*x) / sum(w) by using two BatchWeightedSum instances.

__call__() → ndarray: Calculate the weighted mean.

__init__(axis=0)

update_batch(batch, weights): Update the weighted mean with a new batch of data.