swectral.denoiser.ArrayOutlier#

class swectral.denoiser.ArrayOutlier(test_method='iqr', to='neighbor', axis=0, *, dixon_alpha=0.05, iqr_multiplier=1.5, modified_z_threshold=3.5, numtype='float32', generate_report=False)[source]#

Identify and replace outliers in 1D data lines in the dataframe or 2D array.

Attributes:
test_methodstr

The method of outlier test. Available options:

  • “dixon” - Dixon’s Q test,

  • “iqr” - interquartile range,

  • “modified_z” - Modified Z-score.

The default is “iqr”.

tostr

The outlier replacement strategy. The outlier can be replaced by:

  • “nan” - the outlier is removed and not calculated.

  • “mean” - mean value of nonoutliers.

  • “median” - median of nonoutliers.

  • “neighbor” - the closest nonoutlier value of the outlier. If two are availble, average of the two neighbors are used.

The default is “neighbor”.

axisint

Calculate along the axis. The default is 0.

dixon_alphafloat

Two-tail significance level for Dixon’s Q test, the default is 0.05.

iqr_multiplierfloat, optional

Multiplier applied to the interquartile range (IQR) to define the lower and upper bounds for outlier detection.

The default is 1.5.

modified_z_thresholdfloat, optional

Threshold value used in modified z-score–based outlier detection. Observations with an absolute modified z-score exceeding this value are classified as outliers.

The default is 3.5.

numtypestr

Numpy-supported numeric data type for test computation and output, default is “float32”.

generate_reportbool

Whether to generate reports of outlier tests.

The generation can be time-consuming for large datasets. Repeated calls to ArrayOutlier.replace() accumulate reports in the ArrayOutlier.report, which can lead to significant memory growth.

The default is False.

reportlist or list

List of reports of each “replace” exection if generate_report is True.

Methods

iqr(data_series)

Identify outliers using the Interquartile Range (IQR) criterion and return their indices.

modified_z(data_series)

Identify outliers using the modified z score approach and return their indices.

replace(data)

Replace outliers in a 2D array or dataframe of 1D data series.

dixon_q

Apply Dixon’s Q test to get outlier and nonoutlier indices of 1D data series.

Examples

Use default settings:

>>> outlier = ArrayOutlier()

Specify outlier detection method:

>>> outlier = ArrayOutlier(test_method='dixon')

Customize outlier detection method:

>>> outlier = ArrayOutlier(test_method='dixon', dixon_alpha=0.1)

Specify replacement strategy:

>>> outlier = ArrayOutlier(to='median')

Retrieve report in addition to result of replacement:

>>> outlier = ArrayOutlier(generate_report=True)
>>> report = outlier.report
__init__(test_method='iqr', to='neighbor', axis=0, *, dixon_alpha=0.05, iqr_multiplier=1.5, modified_z_threshold=3.5, numtype='float32', generate_report=False)[source]#

Methods

__init__([test_method, to, axis, ...])

dixon(data_series)

Perform Dixon's Q test to identify outliers in a dataset.

iqr(data_series)

Identify outliers using the Interquartile Range (IQR) criterion and return their indices.

modified_z(data_series)

Identify outliers using the modified z score approach and return their indices.

replace(data)

Replace outliers in a 2D array or dataframe of 1D data series.

dixon(data_series)[source]#

Perform Dixon’s Q test to identify outliers in a dataset.

Parameters:
data_serieslist or 1D array_like

Series of data for outlier detection. Dixon’s Q Test requires a sample size between 3~30.

Returns:
A tuple of:
outlier_indicesnumpy.ndarray

Numpy arrays of outlier indices.

nonoutlier_indicesnumpy.ndarray

Numpy arrays of non-outlier indices.

test_reportlist or None

List of test report if generated.

Raises:
ValueError

If sample size beyond range 3~30.

Return type:

tuple[ndarray, ndarray, Optional[list]]

Examples

>>> outlier = ArrayOutlier()
>>> outlier_ind, non_outlier_ind, report = outlier.dixon([1, 2, 3, 99, 5, 6])
iqr(data_series)[source]#

Identify outliers using the Interquartile Range (IQR) criterion and return their indices.

Parameters:
data_serieslist or 1D array_like

List or 1D array of a data series for outlier detection. The length must be at least 5.

Returns:
A tuple of:
outlier_indicesnumpy.ndarray

Numpy arrays of outlier indices.

nonoutlier_indicesnumpy.ndarray

Numpy arrays of non-outlier indices.

test_reportlist or None

List of test report if generated.

Raises:
ValueError

If sample size < 5.

Return type:

tuple[ndarray, ndarray, Optional[list]]

Examples

>>> outlier = ArrayOutlier()
>>> outlier_ind, non_outlier_ind, report = outlier.iqr([1, 2, 3, 99, 5, 6])
modified_z(data_series)[source]#

Identify outliers using the modified z score approach and return their indices.

Parameters:
data_series1D list or numpy array

The data series to test for outliers. The lenth should be at least 12.

Please be aware that the function does not check for data normality that is required by the approach.

Returns:
A tuple of:
outlier_indicesnumpy.ndarray

Numpy arrays of outlier indices.

nonoutlier_indicesnumpy.ndarray

Numpy arrays of non-outlier indices.

test_reportlist or None

List of test report if generated.

Raises:
ValueError

If sample size < 5.

Return type:

tuple[ndarray, ndarray, Optional[list]]

Warning

UserWarning

If sample size >= 5 but < 12. Applicable but result may not be reliable due to normality identification.

Examples

>>> outlier = ArrayOutlier()
>>> outlier_ind, non_outlier_ind, report = outlier.modified_z([1, 2, 3, 99, 5, 6])
replace(data)[source]#

Replace outliers in a 2D array or dataframe of 1D data series.

Parameters:
datanumpy.ndarray or pandas.DataFrame

2D array or dataframe of 1D data.

Returns:
numpy.ndarray

Data with outlier replaced.

Raises:
ValueError

If input data is not 2D numpy array or pandas dataframe.

ValueError

Unknown absence of replace value in outlier replacement.

Return type:

ndarray

Examples

>>> outlier = ArrayOutlier()
>>> outlier.replace([[1, 2, 3, 99, 5, 6], [2, 2, 4, 4, 6, 6]])