Skip to content

Evaluation and Result

This page explains the detailed usage of the Evaluator and Result classes.

Evaluator

Evaluator is a class for running evaluations on collected data and managing past evaluation results.

Initialization

from ml_debugger.evaluator import Evaluator

evaluator = Evaluator(
    model_name="resnet18",
    version_name="v1",
)

Running Evaluation

result = evaluator.request_evaluation()

Optional Parameters

result = evaluator.request_evaluation(
    result_name="my_evaluation",    # Evaluation result identifier (optional)
    n_epoch="latest",               # Target epoch (optional)
)
Parameter Description Default
result_name Evaluation result identifier. Auto-generated if omitted Auto-generated
n_epoch Epoch to evaluate. "latest" for latest epoch All epochs

Explicitly specifying n_epoch is recommended

By default, all epochs are evaluated. However, explicitly specifying n_epoch is recommended to avoid unintended data mixing. "latest" determines the latest epoch based on timestamp, so restarting epochs with the same version_name may cause unintended data to be used. See model_name / version_name for details.

Get Evaluation Results List

results_df = evaluator.list_results()

Optional Parameters

results_df = evaluator.list_results(
    result_name="my_evaluation",    # Filter by specific result_name (optional)
    method_name="default",          # Filter by evaluation method name (optional)
    n_epoch="latest",               # Filter by target epoch (optional)
)
Parameter Type Default Description
result_name Optional[str] None Filter by specific result_name. None to get all results
method_name Optional[str] None Filter by evaluation method name. "default" alias available
n_epoch Union[str, int, None] "latest" Filter by target epoch. "latest" for latest, "all" for all epochs, integer for specific epoch

Example Output

                                    result_name        method_name n_epoch  \
0  resnet18_v1_classification_v1_20251219  classification_v1    None

                    created_at                 completed_at options     status
0  2025-12-19T07:49:28.706773Z  2025-12-19T07:49:30.994131Z      {}  Completed

Get Specific Evaluation Result

result = evaluator.get_result(result_name="resnet18_v1_classification_v1_20251219")

Result

Result is a class representing a single evaluation result. You can get details of metrics and error codes.

Basic Information

print(result.result_name)   # Evaluation result identifier
print(result.model_name)    # Model name
print(result.version_name)  # Version name

Metrics Summary

print(result.metrics_summary())

# Filter by specific dataset_type
print(result.metrics_summary(dataset_type="train"))
Parameter Type Default Description
dataset_type Optional[str] None Filter output by dataset_type. None to show all dataset_types

Example output (Classification task):

dataset  counts  accuracy  auroc  auprc  net_entropy_loss  net_entropy  error_proba_auroc  error_proba_auprc
train    5000    0.098     0.457  0.094  12875.500         10480.190    0.857              0.453

The output metrics vary depending on the task.

Metric Description
counts Number of data points
accuracy Accuracy
auroc Area Under ROC Curve
auprc Area Under PR Curve
net_entropy_loss Total entropy loss
net_entropy Total entropy
error_proba_auroc AUROC of error probability estimation
error_proba_auprc AUPRC of error probability estimation

AP / AR metrics are computed based on COCOeval with max detections = [1, 10, 100].

Metric Description
counts Number of detections (predicted + GT)
counts_pred_boxes / counts_gt_boxes Total predicted / ground truth bounding boxes
counts_img Number of images
matched_ratio Ratio of matched detections
AP / AP_50 / AP_75 Average Precision (overall / IoU=0.50 / IoU=0.75)
AP_small / AP_medium / AP_large AP by object size
AR_1 / AR_10 / AR_100 Average Recall by max detections
AR_small / AR_medium / AR_large AR by object size
average_iou_matched / average_iou Average IoU for matched / all detections
average_score Average prediction score
accuracy Classification accuracy
net_entropy / net_entropy_loss Total entropy / entropy loss
net_focal_loss Total focal loss
average_logit_margin Average logit margin
error_proba_auroc / error_proba_auprc AUROC / AUPRC of error probability estimation

In addition to common OD metrics (counts, matched_ratio, accuracy, entropy, focal loss, logit margin, error_proba, etc.), AP/AR metrics are replaced with the following 3D-specific metrics.

Metric Description
mAP_3D / mAP_BEV 3D / BEV mean Average Precision
AP_3D_025 / AP_3D_05 / AP_3D_07 3D AP (IoU threshold 0.25 / 0.50 / 0.70)
AP_BEV_025 / AP_BEV_05 / AP_BEV_07 BEV AP (IoU threshold 0.25 / 0.50 / 0.70)
average_iou_3d / average_iou_bev Average 3D IoU / Average BEV IoU

Issue Category Summary

print(result.issue_category_summary())

# Filter by specific dataset_type
print(result.issue_category_summary(dataset_type="train"))
Parameter Type Default Description
dataset_type Optional[str] None Filter output by dataset_type. None to show all dataset_types

Example output:

dataset  stable_coverage_ratio  operational_coverage_ratio  hotspot_ratio  recessive_hotspot_ratio  critical_hotspot_ratio  aleatoric_hotspot_ratio
train    0.000                  0.001                       0.753          0.226                    0.019                   0.006

Issue Category Description

MLdebugger classifies data points into 6 categories based on the stability of internal features and error probability.

Category Description Recommended Action
stable_coverage (Highly Stable) Region where internal features are stable and prediction reliability is high No action needed
operational_coverage (Stable) Region where model confidence is not high but internal features are stable and acceptable for operational use Add data (low priority), continuous monitoring
hotspot (Unstable) Region where internal features are unstable and prediction reliability cannot be guaranteed Add data, retrain, Data Augmentation
recessive_hotspot (Under-Confidence) Region where model confidence is low and errors are easier to anticipate Confidence filtering, Human-in-the-loop, model branching
critical_hotspot (Over-Confidence) Region where errors are likely to occur frequently based on internal features Highest priority: Add data, Hard Negative Mining, consider ensemble
aleatoric_hotspot (Outlier) Region where features are insufficient and model learning is difficult Fix annotations, reconsider task definition, exclude via OOD detection

Detailed Summary

result.get_summary()

# Filter by specific dataset_type
result.get_summary(dataset_type="train")
Parameter Type Default Description
dataset_type Optional[str] None Filter output by dataset_type. None to show all dataset_types

This method displays a detailed summary including:

  • Evaluation information (model, version, result_name)
  • Metrics summary
  • Issue Category summary
  • Detailed error code distribution for each category

Get Issue List

issues_df = result.get_issues()

Get a list of all Issues (error codes) as a DataFrame.

Columns

Column Description
issue_id Issue ID
dataset_type Dataset type
category Issue Category
error_code Error code
debug_code Debug code (epistemic/aleatoric)
counts_ratio Ratio to total
counts Number of data points
Others Task-specific metrics

For Object Detection (2D / 3D), additional columns are included: diagnosis (diagnosis result), counts_pred_boxes / counts_gt_boxes (predicted/GT bbox counts), average_iou, and average_score. For 3D OD, average_iou_3d and average_iou_bev are also included.

Task-specific Error Codes

The format of error_code and debug_code varies by task. See Error Codes - Classification and Error Codes - Object Detection for details.

Get Custom View

A feature for visualizing Issue status by arbitrary conditions. If you recorded custom metadata (additional columns) during data collection with Tracer, you can visualize the differences in Issue distribution for each metadata value.

view_df = result.get_view(
    groupby=["category", "error_code"],
    adjustby="category",
)

Parameters

Parameter Type Description
groupby List[str] List of columns to group by
adjustby Optional[str] Reference column for ratio adjustment
query Optional[str] Filtering condition (pandas query format)

Basic Usage Examples

# Get only Hotspot category
hotspot_view = result.get_view(
    query="category == 'hotspot'",
    groupby=["error_code"],
)

# Aggregate by dataset_type
dataset_view = result.get_view(
    groupby=["dataset_type", "category"],
)

Analysis with Custom Metadata

If you defined additional columns during data collection with Tracer, you can group by those columns to analyze Issue distribution.

# Object Detection: Check error distribution by object size
size_view = result.get_view(
    groupby=["object_size", "category"],
    adjustby="object_size",
)

# Classification: Check error distribution by shooting conditions
condition_view = result.get_view(
    groupby=["lighting_condition", "category"],
    adjustby="lighting_condition",
)

Utilizing Custom Metadata

By defining additional_columns during data collection with Tracer, you can record arbitrary metadata. This enables the following types of analysis:

  • Object Detection: Error occurrence by object size, occlusion presence, etc.
  • Classification: Error distribution by shooting environment, data source, etc.
  • Common: Condition-based analysis by time of day, device, region, etc.

About Error Codes

The format and interpretation of error codes vary by task.

Complete Sample Code

from ml_debugger.evaluator import Evaluator

# Initialize Evaluator
evaluator = Evaluator(
    model_name="resnet18",
    version_name="v1",
)

# Run evaluation
result = evaluator.request_evaluation(
    result_name="experiment_20251219",
    n_epoch="latest",
)

# Review results
print("=== Metrics Summary ===")
print(result.metrics_summary())

print("\n=== Issue Category Summary ===")
print(result.issue_category_summary())

print("\n=== Detailed Summary ===")
result.get_summary()

# Get Issue list as DataFrame
issues_df = result.get_issues()
print(f"\nTotal issues: {len(issues_df)}")

# Check Hotspot details
hotspot_issues = issues_df[issues_df["category"] == "hotspot"]
print(f"Hotspot issues: {len(hotspot_issues)}")

# Get custom view
view = result.get_view(
    groupby=["category"],
    adjustby=None,
)
print("\n=== Category Distribution ===")
print(view)

Next Steps