swectral.denoiser.ArrayOutlier#
- class swectral.denoiser.ArrayOutlier(test_method='iqr', to='neighbor', axis=0, *, dixon_alpha=0.05, iqr_multiplier=1.5, modified_z_threshold=3.5, numtype='float32', generate_report=False)[source]#
Identify and replace outliers in 1D data lines in the dataframe or 2D array.
- Attributes:
- test_method
str The method of outlier test. Available options:
“dixon” - Dixon’s Q test,
“iqr” - interquartile range,
“modified_z” - Modified Z-score.
The default is “iqr”.
- to
str The outlier replacement strategy. The outlier can be replaced by:
“nan” - the outlier is removed and not calculated.
“mean” - mean value of nonoutliers.
“median” - median of nonoutliers.
“neighbor” - the closest nonoutlier value of the outlier. If two are availble, average of the two neighbors are used.
The default is “neighbor”.
- axis
int Calculate along the axis. The default is 0.
- dixon_alpha
float Two-tail significance level for Dixon’s Q test, the default is 0.05.
- iqr_multiplier
float,optional Multiplier applied to the interquartile range (IQR) to define the lower and upper bounds for outlier detection.
The default is 1.5.
- modified_z_threshold
float,optional Threshold value used in modified z-score–based outlier detection. Observations with an absolute modified z-score exceeding this value are classified as outliers.
The default is 3.5.
- numtype
str Numpy-supported numeric data type for test computation and output, default is “float32”.
- generate_reportbool
Whether to generate reports of outlier tests.
The generation can be time-consuming for large datasets. Repeated calls to ArrayOutlier.replace() accumulate reports in the ArrayOutlier.report, which can lead to significant memory growth.
The default is False.
- report
listorlist List of reports of each “replace” exection if generate_report is True.
- test_method
Methods
iqr(data_series)Identify outliers using the Interquartile Range (IQR) criterion and return their indices.
modified_z(data_series)Identify outliers using the modified z score approach and return their indices.
replace(data)Replace outliers in a 2D array or dataframe of 1D data series.
dixon_q
Apply Dixon’s Q test to get outlier and nonoutlier indices of 1D data series.
Examples
Use default settings:
>>> outlier = ArrayOutlier()
Specify outlier detection method:
>>> outlier = ArrayOutlier(test_method='dixon')
Customize outlier detection method:
>>> outlier = ArrayOutlier(test_method='dixon', dixon_alpha=0.1)
Specify replacement strategy:
>>> outlier = ArrayOutlier(to='median')
Retrieve report in addition to result of replacement:
>>> outlier = ArrayOutlier(generate_report=True) >>> report = outlier.report
- __init__(test_method='iqr', to='neighbor', axis=0, *, dixon_alpha=0.05, iqr_multiplier=1.5, modified_z_threshold=3.5, numtype='float32', generate_report=False)[source]#
Methods
__init__([test_method, to, axis, ...])dixon(data_series)Perform Dixon's Q test to identify outliers in a dataset.
iqr(data_series)Identify outliers using the Interquartile Range (IQR) criterion and return their indices.
modified_z(data_series)Identify outliers using the modified z score approach and return their indices.
replace(data)Replace outliers in a 2D array or dataframe of 1D data series.
- dixon(data_series)[source]#
Perform Dixon’s Q test to identify outliers in a dataset.
- Parameters:
- data_series
listor 1D array_like Series of data for outlier detection. Dixon’s Q Test requires a sample size between 3~30.
- data_series
- Returns:
Atupleof:- outlier_indicesnumpy.ndarray
Numpy arrays of outlier indices.
- nonoutlier_indicesnumpy.ndarray
Numpy arrays of non-outlier indices.
- test_reportlist or None
List of test report if generated.
- Raises:
ValueErrorIf sample size beyond range 3~30.
- Return type:
Examples
>>> outlier = ArrayOutlier() >>> outlier_ind, non_outlier_ind, report = outlier.dixon([1, 2, 3, 99, 5, 6])
- iqr(data_series)[source]#
Identify outliers using the Interquartile Range (IQR) criterion and return their indices.
- Parameters:
- data_series
listor 1D array_like List or 1D array of a data series for outlier detection. The length must be at least 5.
- data_series
- Returns:
Atupleof:- outlier_indicesnumpy.ndarray
Numpy arrays of outlier indices.
- nonoutlier_indicesnumpy.ndarray
Numpy arrays of non-outlier indices.
- test_reportlist or None
List of test report if generated.
- Raises:
ValueErrorIf sample size < 5.
- Return type:
Examples
>>> outlier = ArrayOutlier() >>> outlier_ind, non_outlier_ind, report = outlier.iqr([1, 2, 3, 99, 5, 6])
- modified_z(data_series)[source]#
Identify outliers using the modified z score approach and return their indices.
- Parameters:
- Returns:
Atupleof:- outlier_indicesnumpy.ndarray
Numpy arrays of outlier indices.
- nonoutlier_indicesnumpy.ndarray
Numpy arrays of non-outlier indices.
- test_reportlist or None
List of test report if generated.
- Raises:
ValueErrorIf sample size < 5.
- Return type:
Warning
- UserWarning
If sample size >= 5 but < 12. Applicable but result may not be reliable due to normality identification.
Examples
>>> outlier = ArrayOutlier() >>> outlier_ind, non_outlier_ind, report = outlier.modified_z([1, 2, 3, 99, 5, 6])
- replace(data)[source]#
Replace outliers in a 2D array or dataframe of 1D data series.
- Parameters:
- data
numpy.ndarrayorpandas.DataFrame 2D array or dataframe of 1D data.
- data
- Returns:
numpy.ndarrayData with outlier replaced.
- Raises:
ValueErrorIf input data is not 2D numpy array or pandas dataframe.
ValueErrorUnknown absence of replace value in outlier replacement.
- Return type:
Examples
>>> outlier = ArrayOutlier() >>> outlier.replace([[1, 2, 3, 99, 5, 6], [2, 2, 4, 4, 6, 6]])