Core Classes
These are the standard classes for computing statistics on datasets. If your data contains NaN values, they will be handled by removing the entire sample (row) that contains them.
Non-Weighted Statistics
BatchSum
- class batchstats.stats.BatchSum(axis=0)
Class for calculating the sum of batches of data.
The algorithm used is a simple cumulative sum. Each time a new batch is added, the sum of the new batch is added to the existing sum.
import numpy as np from batchstats import BatchSum # create some data data1 = np.array([[1, 2], [3, 4]]) data2 = np.array([[5, 6], [7, 8]]) # create a BatchSum object bs = BatchSum() # update with the first batch bs.update_batch(data1) # update with the second batch bs.update_batch(data2) # get the sum total_sum = bs() # verify the result expected_sum = np.array([16, 20]) np.testing.assert_allclose(total_sum, expected_sum)Example with multiple axes and data > 2 dimensions
# create some 3d data data1 = np.arange(24).reshape(2, 3, 4) data2 = np.arange(24, 48).reshape(2, 3, 4) # create a BatchSum object to sum over the last two axes bs = BatchSum(axis=(1, 2)) # update with the first batch bs.update_batch(data1) # update with the second batch bs.update_batch(data2) # get the sum total_sum = bs() # verify the result expected_sum = data1.sum(axis=(1,2)) + data2.sum(axis=(1,2)) np.testing.assert_allclose(total_sum, expected_sum)- __call__() ndarray
Calculate the sum.
- Returns:
Sum of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(axis=0)
- update_batch(batch, assume_valid=False)
Update the sum with a new batch of data.
- Parameters:
batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.
- Returns:
Updated BatchSum object.
- Return type:
BatchMean
- class batchstats.stats.BatchMean(axis=0)
Class for calculating the mean of batches of data.
The algorithm uses an incremental mean calculation. The new mean is computed from the previous mean, the new data, and the number of samples.
import numpy as np from batchstats import BatchMean # create some data data1 = np.array([[1, 2], [3, 4]]) data2 = np.array([[5, 6], [7, 8]]) # create a BatchMean object bm = BatchMean() # update with the first batch bm.update_batch(data1) # update with the second batch bm.update_batch(data2) # get the mean total_mean = bm() # verify the result expected_mean = np.array([4., 5.]) np.testing.assert_allclose(total_mean, expected_mean)Example with multiple axes and data > 2 dimensions
# create some 3d data data1 = np.arange(24).reshape(2, 3, 4) data2 = np.arange(24, 48).reshape(2, 3, 4) # create a BatchMean object to get the mean over the last two axes bm = BatchMean(axis=(1, 2)) # update with the first batch bm.update_batch(data1) # update with the second batch bm.update_batch(data2) # get the mean total_mean = bm() # verify the result d = np.concatenate((data1, data2)) expected_mean = d.mean(axis=(1,2)) np.testing.assert_allclose(total_mean, expected_mean)- __call__() ndarray
Calculate the mean.
- Returns:
Mean of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(axis=0)
- update_batch(batch, assume_valid=False)
Update the mean with a new batch of data.
- Parameters:
batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.
- Returns:
Updated BatchMean object.
- Return type:
BatchVar
- class batchstats.stats.BatchVar(axis=0, ddof=0)
Class for calculating the variance of batches of data.
The algorithm is an implementation of Welford’s online algorithm for computing variance. It is numerically stable and avoids a two-pass approach.
- Parameters:
ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
import numpy as np from batchstats import BatchVar # create some data data1 = np.array([[1, 2], [3, 4]]) data2 = np.array([[5, 6], [7, 8]]) # create a BatchVar object bv = BatchVar() # update with the first batch bv.update_batch(data1) # update with the second batch bv.update_batch(data2) # get the variance total_var = bv() # verify the result expected_var = np.array([5., 5.]) np.testing.assert_allclose(total_var, expected_var)Example with multiple axes and data > 2 dimensions
# create some 3d data data1 = np.arange(24).reshape(2, 3, 4) data2 = np.arange(24, 48).reshape(2, 3, 4) # create a BatchVar object to get the var over the last two axes bv = BatchVar(axis=(1, 2)) # update with the first batch bv.update_batch(data1) # update with the second batch bv.update_batch(data2) # get the var total_var = bv() # verify the result d = np.concatenate((data1, data2)) expected_var = d.var(axis=(1,2)) np.testing.assert_allclose(total_var, expected_var)- __call__() ndarray
Calculate the variance.
- Returns:
Variance of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(axis=0, ddof=0)
- update_batch(batch, assume_valid=False)
Update the variance with a new batch of data.
- Parameters:
batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.
- Returns:
Updated BatchVar object.
- Return type:
BatchStd
- class batchstats.stats.BatchStd(axis=0, ddof=0)
Class for calculating the standard deviation of batches of data.
This class uses BatchVar internally and takes the square root of the result.
- Parameters:
ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
import numpy as np from batchstats import BatchStd # create some data data1 = np.array([[1, 2], [3, 4]]) data2 = np.array([[5, 6], [7, 8]]) # create a BatchStd object bs = BatchStd() # update with the first batch bs.update_batch(data1) # update with the second batch bs.update_batch(data2) # get the standard deviation total_std = bs() # verify the result expected_std = np.array([2.23606798, 2.23606798]) np.testing.assert_allclose(total_std, expected_std)Example with multiple axes and data > 2 dimensions
# create some 3d data data1 = np.arange(24).reshape(2, 3, 4) data2 = np.arange(24, 48).reshape(2, 3, 4) # create a BatchStd object to get the std over the last two axes bs = BatchStd(axis=(1, 2)) # update with the first batch bs.update_batch(data1) # update with the second batch bs.update_batch(data2) # get the std total_std = bs() # verify the result d = np.concatenate((data1, data2)) expected_std = d.std(axis=(1,2)) np.testing.assert_allclose(total_std, expected_std)- __call__() ndarray
Calculate the standard deviation.
- Returns:
Standard deviation of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(axis=0, ddof=0)
- update_batch(batch, assume_valid=False)
Update the standard deviation with a new batch of data.
- Parameters:
batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.
- Returns:
Updated BatchStd object.
- Return type:
BatchCov
BatchCov expects 2D inputs (samples x features).
- class batchstats.stats.BatchCov(ddof=0)
Class for calculating the covariance of batches of data.
The algorithm is an implementation of an online covariance calculation. It is numerically stable and avoids a two-pass approach.
- Parameters:
ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
import numpy as np from batchstats import BatchCov # create some data data1 = np.array([[1, 2], [3, 4]]) data2 = np.array([[5, 6], [7, 8]]) # create a BatchCov object bc = BatchCov() # update with the first batch bc.update_batch(data1) # update with the second batch bc.update_batch(data2) # get the covariance total_cov = bc() # verify the result expected_cov = np.array([[5., 5.], [5., 5.]]) np.testing.assert_allclose(total_cov, expected_cov)- __call__() ndarray
Calculate the covariance.
- Returns:
Covariance of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(ddof=0)
- update_batch(batch1, batch2=None, assume_valid=False)
Update the covariance with new batches of data.
- Parameters:
batch1 (numpy.ndarray) – Input batch 1.
batch2 (numpy.ndarray, optional) – Input batch 2. Default is None.
assume_valid (bool, optional) – If True, assumes all elements in the batches are valid. Default is False.
- Returns:
Updated BatchCov object.
- Return type:
BatchCorr
BatchCorr expects 2D inputs (samples x features).
- class batchstats.stats.BatchCorr(ddof=0)
Class for calculating the correlation of batches of data.
The algorithm is an implementation of an online correlation calculation. It is numerically stable and avoids a two-pass approach.
- Parameters:
ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
import numpy as np from batchstats import BatchCorr # create some data data1 = np.array([[1, 2], [3, 4]]) data2 = np.array([[5, 6], [7, 8]]) # create a BatchCorr object bc = BatchCorr() # update with the first batch bc.update_batch(data1) # update with the second batch bc.update_batch(data2) # get the correlation total_corr = bc()- __call__() ndarray
Calculate the correlation.
- Returns:
Correlation of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(ddof=0)
- update_batch(batch1, batch2=None, assume_valid=False)
Update the correlation with new batches of data.
- Parameters:
batch1 (numpy.ndarray) – Input batch 1.
batch2 (numpy.ndarray, optional) – Input batch 2. Default is None.
assume_valid (bool, optional) – If True, assumes all elements in the batches are valid. Default is False.
- Returns:
Updated BatchCorr object.
- Return type:
BatchMin
- class batchstats.stats.BatchMin(axis=0)
Class for calculating the minimum of batches of data.
The algorithm keeps track of the element-wise minimum. When a new batch is added, the element-wise minimum between the current minimum and the new batch’s minimum is computed.
import numpy as np from batchstats import BatchMin # create some data data1 = np.array([[1, 8], [3, 4]]) data2 = np.array([[5, 6], [7, 2]]) # create a BatchMin object bm = BatchMin() # update with the first batch bm.update_batch(data1) # update with the second batch bm.update_batch(data2) # get the minimum total_min = bm() # verify the result expected_min = np.array([1, 2]) np.testing.assert_allclose(total_min, expected_min)Example with multiple axes and data > 2 dimensions
# create some 3d data data1 = np.arange(24).reshape(2, 3, 4) data2 = np.arange(24, 48).reshape(2, 3, 4) # create a BatchMin object to get the min over the last two axes bm = BatchMin(axis=(1, 2)) # update with the first batch bm.update_batch(data1) # update with the second batch bm.update_batch(data2) # get the min total_min = bm() # verify the result expected_min = np.minimum(data1.min(axis=(1,2)), data2.min(axis=(1,2))) np.testing.assert_allclose(total_min, expected_min)- __call__() ndarray
Calculate the minimum.
- Returns:
Minimum of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(axis=0)
- update_batch(batch, assume_valid=False)
Update the minimum with a new batch of data.
- Parameters:
batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.
- Returns:
Updated BatchMin object.
- Return type:
BatchMax
- class batchstats.stats.BatchMax(axis=0)
Class for calculating the maximum of batches of data.
The algorithm keeps track of the element-wise maximum. When a new batch is added, the element-wise maximum between the current maximum and the new batch’s maximum is computed.
import numpy as np from batchstats import BatchMax # create some data data1 = np.array([[1, 8], [3, 4]]) data2 = np.array([[5, 6], [7, 2]]) # create a BatchMax object bm = BatchMax() # update with the first batch bm.update_batch(data1) # update with the second batch bm.update_batch(data2) # get the maximum total_max = bm() # verify the result expected_max = np.array([7, 8]) np.testing.assert_allclose(total_max, expected_max)Example with multiple axes and data > 2 dimensions
# create some 3d data data1 = np.arange(24).reshape(2, 3, 4) data2 = np.arange(24, 48).reshape(2, 3, 4) # create a BatchMax object to get the max over the last two axes bm = BatchMax(axis=(1, 2)) # update with the first batch bm.update_batch(data1) # update with the second batch bm.update_batch(data2) # get the max total_max = bm() # verify the result expected_max = np.maximum(data1.max(axis=(1,2)), data2.max(axis=(1,2))) np.testing.assert_allclose(total_max, expected_max)- __call__() ndarray
Calculate the maximum.
- Returns:
Maximum of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(axis=0)
- update_batch(batch, assume_valid=False)
Update the maximum with a new batch of data.
- Parameters:
batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.
- Returns:
Updated BatchMax object.
- Return type:
BatchPeakToPeak
- class batchstats.stats.BatchPeakToPeak(axis=0)
Class for calculating the peak-to-peak (max - min) of batches of data.
This class uses BatchMax and BatchMin internally to keep track of the element-wise maximum and minimum values. The peak-to-peak value is the difference between the maximum and minimum.
import numpy as np from batchstats import BatchPeakToPeak # create some data data1 = np.array([[1, 8], [3, 4]]) data2 = np.array([[5, 6], [7, 2]]) # create a BatchPeakToPeak object bpp = BatchPeakToPeak() # update with the first batch bpp.update_batch(data1) # update with the second batch bpp.update_batch(data2) # get the peak-to-peak total_ptp = bpp() # verify the result expected_ptp = np.array([6, 6]) np.testing.assert_allclose(total_ptp, expected_ptp)Example with multiple axes and data > 2 dimensions
# create some 3d data data1 = np.arange(24).reshape(2, 3, 4) data2 = np.arange(24, 48).reshape(2, 3, 4) # create a BatchPeakToPeak object to get the ptp over the last two axes bpp = BatchPeakToPeak(axis=(1, 2)) # update with the first batch bpp.update_batch(data1) # update with the second batch bpp.update_batch(data2) # get the ptp total_ptp = bpp() # verify the result d = np.concatenate((data1, data2)) expected_ptp = d.max(axis=(1,2)) - d.min(axis=(1,2)) np.testing.assert_allclose(total_ptp, expected_ptp)- __call__() ndarray
Calculate the peak-to-peak.
- Returns:
Peak-to-peak of the batches.
- Return type:
numpy.ndarray
- Raises:
NoValidSamplesError – If no valid samples are available.
- __init__(axis=0)
- update_batch(batch, assume_valid=False)
Update the peak-to-peak with a new batch of data.
- Parameters:
batch (numpy.ndarray) – Input batch.
assume_valid (bool, optional) – If True, assumes all elements in the batch are valid. Default is False.
- Returns:
Updated BatchPeakToPeak object.
- Return type:
Weighted Statistics
These classes are used for computing weighted statistics on datasets.
BatchWeightedSum
- class batchstats.stats.BatchWeightedSum(axis=0)
Class for calculating the weighted sum of batches of data.
The algorithm used is a simple cumulative sum. Each time a new batch is added, the weighted sum of the new batch is added to the existing sum.
import numpy as np from batchstats.stats.weighted_sum import BatchWeightedSum # create some data data1 = np.array([[1, 2], [3, 4]]) weights1 = np.array([[0.1, 0.2], [0.3, 0.4]]) data2 = np.array([[5, 6], [7, 8]]) weights2 = np.array([[0.5, 0.6], [0.7, 0.8]]) # create a BatchWeightedSum object bws = BatchWeightedSum() # update with the first batch bws.update_batch(data1, weights=weights1) # update with the second batch bws.update_batch(data2, weights=weights2) # get the weighted sum total_weighted_sum = bws() # verify the result expected_sum = np.array([8.4, 12.0]) np.testing.assert_allclose(total_weighted_sum, expected_sum)- __call__() ndarray
Calculate the weighted sum.
- __init__(axis=0)
- update_batch(batch, weights)
Update the weighted sum with a new batch of data.
BatchWeightedMean
- class batchstats.stats.BatchWeightedMean(axis=0)
Class for calculating the weighted mean of batches of data. It computes sum(w*x) / sum(w) by using two BatchWeightedSum instances.
- __call__() ndarray
Calculate the weighted mean.
- __init__(axis=0)
- update_batch(batch, weights)
Update the weighted mean with a new batch of data.