Usage#

1. Data preparation#

Setup a demo directory in current working directory

import os
demo_dir = os.getcwd() + "/SpecPipeDemo/"

Create a data directory and download real-world demo data

data_dir = demo_dir + "demo_data/"
os.makedirs(data_dir)

from swectral import download_demo_data
download_demo_data(data_dir)

Create a directory for pipeline results

report_dir = demo_dir + "/demo_results_classification/"
os.makedirs(report_dir)

2. Data configuration#

Create a SpecExp instance:

from swectral import SpecExp
exp = SpecExp(report_dir)

The instance stores and organizes the data loading configurations of an experiment, which faciliates lazy-loading.

Check report directory:

exp.report_directory

Output:

'~/SpecPipeDemo/demo_results_classification/'

Add experiment groups:
```
exp.add_groups(['group_1', 'group_2'])
```

Add raster images:

exp.add_images_by_name(image_name="demo.", image_directory=data_dir, group="group_1")
exp.add_images_by_name("demo.", data_dir, "group_2")

Output:

Following image items are added:
    Group    Image    Mask
0   group_1  demo.tiff

Load image ROIs using suffix to image names:

# By parameter name
exp.add_rois_by_suffix(roi_filename_suffix="_[12].xml", search_directory=data_dir, group="group_1")
# Or by parameter position
exp.add_rois_by_suffix("_[345].xml", data_dir, "group_2")

Output:

Following ROI items loaded:
   Group    Image    ROI_name    ROI_type    ROI_source_file
0  group_1  demo.tiff      1-1   sample      demo_1.xml
1  group_1  demo.tiff      1-2   sample      demo_1.xml
...
9  group_1  demo.tiff      2-5   sample      demo_2.xml

Show raster RGB preview with associated ROIs:

exp.show_image("demo.tiff", "group_1", rgb_band_index=(19, 12, 6), output_path=report_dir + "demo_rast_rgb1.png")

Output:

exp.show_image("demo.tiff", "group_2", rgb_band_index=(19, 12, 6), output_path=report_dir + "demo_rast_rgb2.png")

Output:

2.5. Sample labels and target values#

2.5.1 Set sample labels#

Get current sample label dataframe:
```
labels = exp.ls_labels()
```

Set new sample labels in the dataframe:

Here we use sample ROI names as sample labels:

labels.iloc[:, 1] = exp.ls_rois_sample(return_dataframe=True, print_result=False)["ROI_name"]

Update sample labels:
```
exp.sample_labels = labels
```

Check sample labels:

exp.ls_labels()["Label"]

Output:

   1-1
   1-2
...
  5-5

2.5.2 Set target values#

List target value dataframe:
```
targets = exp.ls_sample_targets()
```
Create mock target values for regression and update target dataframe:

Here we use leaf number:
```
targets["Target_value"] = [f"leaf_{labl[0]}" for labl in targets['Label']]
```
Load target values from updated target dataframe:
```
exp.sample_targets_from_df(targets)
```

Check target values:

exp.ls_targets()[["Label", "Target_value"]]

Output:

    Label Target_value
0    1-1       leaf_1
1    1-2       leaf_1
...
24   5-5       leaf_5

3. Design testing pipelines#

SpecPipe follows a structured data processing workflow with these sequential data levels:
```
Raster image data -> ROI spectra -> ROI statistics -> Traits to model
```

The data levels in SpecPipe includes:

Raster images:
    0 - "image", input image path and output processed image path.

    1 - "pixel_spec", if the process callable is applied to 1D spectrum of image pixel

    2 - "pixel_specs_array", if the process callable is applied to 2D spectra array of image pixels

    3 - "pixel_specs_tensor", if the process callable is applied to 3D spectra tensor of image pixels

    4 - "pixel_hyperspecs_tensor", same as "pixel_specs_tensor" but optimized for hyperspectral images

ROI spectra:
    5 - "image_roi", raster with sample ROIs, for spectrum extraction

    6 - "roispecs", 2D array of ROI spectra

ROI statistics:
    7 - "spec1d", arbitrary 1D data of samples, e.g. 1D spectra, flattened spectra statistical metrics

Sample data:
    8 - "assembly", sample data list for cross-sample interaction

Models:
    9 - "model", model evaluation with standard report output as files

The corresponding data processing workflow is:

Raster image processing:           0 ~ 4
    ↓
Extract ROI spectra:               5 - "image_roi"
    ↓
ROI spectra manipulation:          6 - "roispecs"
    ↓
Summarized ROI spectra:            7 - "spec1d"
    ↓
Sample assembly:                   8 - "assembly"
    ↓
Modeling and model evaluation:     9 - "model"

The processing functions are incorporated in the pipeline according to the specified “data levels”. Parallel processes can be added with identical “data level” and “application sequence”, and they are arranged using full-factorial approach in the pipeline.

3.1 Create processing pipeline#

Create processing pipeline from SpecExp instance configured above:
```
from swectral import SpecPipe
pipe = SpecPipe(exp)
```

3.2 Image processing#

Create some image processing functions, such as:
Standard normal variate:
```
from swectral.functions import snv
```
Pass-through method for comparison:
```
def raw(v): return v
```

3.3 ROI statistics#

Import spectral statistic metrics for ROI summary:
```
from swectral import roi_mean, roi_median
```

3.4 Add models to the pipeline#

Create some models:

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

rf_classifier = RandomForestRegressor(n_estimators=10)
knn_classifier = KNeighborsRegressor(n_neighbors=3)

3.5 Compose and check pipelines#

Compose pipelines:

pipe.build_pipeline(
    [
        # 1 Image-wide baseline correction
        ((2, 2), [raw, snv]),
        # 2 ROI statistics
        ((5, 7), [roi_mean, roi_median]),
        # 3 Models (Feature selector included)
        ((7, 9), [rf_classifier, knn_classifier], {'validation_method': '2-fold'})
    ]
)

Check all processes including models:

pipe.ls_process()

Output:

     Step_0    Step_1         Step_2
  snv       roi_mean       KNeighborsClassifier
  snv       roi_mean       RandomForestClassifier
  snv       roi_median     KNeighborsClassifier
  snv       roi_median     RandomForestClassifier
  raw       roi_mean       KNeighborsClassifier
  raw       roi_mean       RandomForestClassifier
  raw       roi_median     KNeighborsClassifier
  raw       roi_median     RandomForestClassifier

4 Execute pipelines#

Run:
```
pipe.run()
```

5 Generated reports#

Pipeline execution data is saved to local storage, use the methods to retrieve reports in the console:
```
result_summary = pipe.report_summary()
chain_results = pipe.report_chains()
```

Check summary reports

The summary reports include:

result_summary.keys()

Output:

dict_keys([
    'Macro_avg_performance_summary',
    'Marginal_macro_avg_AUC_stats_step_0',
    'Marginal_macro_avg_AUC_stats_step_1',
    'Marginal_macro_avg_AUC_stats_step_2',
    'Marginal_micro_avg_AUC_stats_step_0',
    'Marginal_micro_avg_AUC_stats_step_1',
    'Marginal_micro_avg_AUC_stats_step_2',
    'Micro_avg_performance_summary',
    'sample_targets_stats'])

Demonstration of macro-average performance metrics of classification:

result_summary['Macro_avg_performance_summary']

Output:

    Step_0   Step_1   Step_2  Precision  Recall  F1_Score  Accuracy    AUC
0  2_0_%#1  5_0_%#1  7_0_%#1       0.86    0.84      0.84      0.94   0.95
...
7  2_0_%#2  5_0_%#2  7_0_%#2       0.77    0.72      0.68      0.89   0.83

Demonstration of marginal macro-average performance metrics of classification:

result_summary['Marginal_macro_avg_AUC_stats_step_0']

Output:

         Process_ID       All   2_0_%#1   2_0_%#2
   Process_label       All       snv       raw
       n_records         8         4         4
  Mean_AUC_macro      0.85      0.95      0.76
   Min_AUC_macro      0.63      0.94      0.63
Median_AUC_macro      0.91      0.95      0.76
   Max_AUC_macro      0.97      0.97      0.87
        p_vs_All      1.00      0.20      0.20
        p_vs_raw      0.20      1.00      0.03
        p_vs_snv      0.20      0.03      1.00
   effect_vs_All      0.00      0.46      0.46
  effect_vs_raw      0.46      0.00      0.94
  effect_vs_snv      0.46      0.94      0.00

The processes of the step (here raw image and standard normal variates) are compared using non-parametric Wilcoxon signed-rank test.

Demonstration of Receiver-Operating-Characteristic curve:

chain_results[0]['ROC_curve']

Output:

Demo receiver operating characteristic curve

6 Regression demonstration#

6.1 Create a directory for regression results#

Create a directory for regression results

report_dir_reg = demo_dir + "/demo_results_regression/"
os.makedirs(report_dir_reg)

6.2 Copy and update the previous pipelines to regression#

Copy and update SpecExp and SpecPipe instances

import copy

exp_reg = copy.deepcopy(exp)
pipe_reg = copy.deepcopy(pipe)
targets_reg = copy.deepcopy(targets)

Update report directory of SpecExp

exp_reg.report_directory = report_dir_reg

Modify targets to numeric, here the numbers approaximate the age of the leaves

targets_reg["Target_value"] = [(5 - int(labl[0])) for labl in targets['Label']]

Specify the ROIs within a same leaf to a validation group to prevent data leakage

targets_reg["Validation_group"] = [f"leaf_{labl[0]}" for labl in targets['Label']]

Update target information using the modified target dataframe
```
exp_reg.sample_targets_from_df(targets_reg)
```

Check target values and validation groups

exp_reg.ls_targets()[["Label", "Target_value", "Validation_group"]]

6.3 Update the pipeline models to regressors#

Check and remove classification models

pipe_reg.ls_model()
pipe_reg.rm_model()

Update the data manager
```
pipe_reg.spec_exp = exp_reg
```

Add regressors to the pipeline

Add some regressors:

from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

rf_regressor = RandomForestRegressor(n_estimators=10)
knn_regressor = KNeighborsRegressor(n_neighbors=3)

pipe_reg.add_model([knn_regressor, rf_regressor], validation_method="2-fold")

6.4 Execute regression pipelines#

Run:
```
pipe_reg.run()
```

6.5 Check results of regression pipelines#

Retrieve reports in console

result_summary_reg = pipe_reg.report_summary()
chain_results_reg = pipe_reg.report_chains()

Check summary reports

The summary reports include:

result_summary_reg.keys()

Output:

dict_keys([
    'Marginal_R2_stats_step_0',
    'Marginal_R2_stats_step_1',
    'Marginal_R2_stats_step_2',
    'Performance_summary',
    'sample_targets_stats'])

Demonstration of performance summary content:

result_summary_reg['Performance_summary'].columns

Output:

Index([
    'Step_0', 'Step_1', 'Step_2',
    'Mean_Error', 'Standard_Deviation_of_Error', 'Mean_Absolute_Error',
    'Normalized_MAE', 'CV_MAE',
    'Mean_Squared_Error', 'Root_Mean_Squared_Error',
    'Normalized_RMSE', 'CV_RMSE',
    'Residual_Prediction_Deviation', 'R2'
], dtype='object')

Check processing chain reports

For each chain, the reports include:

chain_results_reg[0].keys()

Output:

dict_keys([
    'Chain_processes',
    'Regression_performance',
    'Residual_analysis',
    'Residual_plot',
    'Scatter_plot',
    'Validation_results'])

Demonstration of the scatter plot of the processing chain:

chain_results_reg[0]['Scatter_plot']

Output:

7 Feature engineering fittable tests#

Feature engineering and resampling fittables (data transformers and resamplers) are fitted during the model validation process and function as integrated parts of the model. To incorporate these transformers, use the model connector functions combine_classifier or combine_regressor (similar to sklearn.pipeline.Pipeline but more flexible and enable swectral pipeline analysis).

This module includes a composer that generates batchwise combined models using a full factorial design:

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from swectral import IdentityTransformer  # Passthrough transformer for comparison

selector1 = SelectKBest(f_classif, k=5)  # Select 5 features
selector2 = IdentityTransformer()  # For passthrough (no selection)

from swectral import factorial_model_chains

models = factorial_model_chains(
    [StandardScaler(), IdentityTransformer()],  # Model step 1: test data scalers
    {'Feat5': selector1, 'FeatAll': selector2},  # Model step 2: test feature selection fittables
    # ...
    estimators={'KNN': knn_classifier, 'RF': rf_classifier},  # Estimators (specify custom labels using dictionary input)
    is_regression=False
)
print(models)

Output:

[CombinedClassifier_StandardScaler_Feat5_KNN,
 CombinedClassifier_StandardScaler_Feat5_RF,
 CombinedClassifier_StandardScaler_FeatAll_KNN,
 CombinedClassifier_StandardScaler_FeatAll_RF,
 CombinedClassifier_IdentityTransformer_Feat5_KNN,
 CombinedClassifier_IdentityTransformer_Feat5_RF,
 CombinedClassifier_IdentityTransformer_FeatAll_KNN,
 CombinedClassifier_IdentityTransformer_FeatAll_RF]

Add the generated models to your pipeline:

pipe.add_model(models, validation_method="2-fold")

Tutorials / Demos#

You can try out SpecPipe using the following example scripts:

Basic Usage — demo_script_of_readme.py
Typical workflow — demo_script_typical_workflow.py
Parallel for Windows — demo_script_windows_parallel.py