Evaluation and Result

This page explains the detailed usage of the Evaluator and Result classes.

Evaluator

Evaluator is a class for running evaluations on collected data and managing past evaluation results.

Initialization

from ml_debugger.evaluator import Evaluator

evaluator = Evaluator(
    model_name="resnet18",
    version_name="v1",
)

Running Evaluation

result = evaluator.request_evaluation()

Optional Parameters

result = evaluator.request_evaluation(
    result_name="my_evaluation",    # Evaluation result identifier (optional)
    n_epoch="latest",               # Target epoch (optional)
)

Parameter	Description	Default
`result_name`	Evaluation result identifier. Auto-generated if omitted	Auto-generated
`n_epoch`	Epoch to evaluate. `"latest"` for latest epoch	All epochs

Explicitly specifying n_epoch is recommended

By default, all epochs are evaluated. However, explicitly specifying n_epoch is recommended to avoid unintended data mixing. "latest" determines the latest epoch based on timestamp, so restarting epochs with the same version_name may cause unintended data to be used. See model_name / version_name for details.

Get Evaluation Results List

results_df = evaluator.list_results()

Optional Parameters

results_df = evaluator.list_results(
    result_name="my_evaluation",    # Filter by specific result_name (optional)
    method_name="default",          # Filter by evaluation method name (optional)
    n_epoch="latest",               # Filter by target epoch (optional)
)

Parameter	Type	Default	Description
`result_name`	`Optional[str]`	`None`	Filter by specific result_name. `None` to get all results
`method_name`	`Optional[str]`	`None`	Filter by evaluation method name. `"default"` alias available
`n_epoch`	`Union[str, int, None]`	`"latest"`	Filter by target epoch. `"latest"` for latest, `"all"` for all epochs, integer for specific epoch

Example Output

                                    result_name        method_name n_epoch  \
0  resnet18_v1_classification_v1_20251219  classification_v1    None

                    created_at                 completed_at options     status
0  2025-12-19T07:49:28.706773Z  2025-12-19T07:49:30.994131Z      {}  Completed

Get Specific Evaluation Result

result = evaluator.get_result(result_name="resnet18_v1_classification_v1_20251219")

Result

Result is a class representing a single evaluation result. You can get details of metrics and error codes.

Basic Information

print(result.result_name)   # Evaluation result identifier
print(result.model_name)    # Model name
print(result.version_name)  # Version name

Metrics Summary

print(result.metrics_summary())

# Filter by specific dataset_type
print(result.metrics_summary(dataset_type="train"))

Parameter	Type	Default	Description
`dataset_type`	`Optional[str]`	`None`	Filter output by dataset_type. `None` to show all dataset_types

Example output (Classification task):

dataset  counts  accuracy  auroc  auprc  net_entropy_loss  net_entropy  error_proba_auroc  error_proba_auprc
train    5000    0.098     0.457  0.094  12875.500         10480.190    0.857              0.453

The output metrics vary depending on the task.

ClassificationObject Detection3D Object Detection

Metric	Description
`counts`	Number of data points
`accuracy`	Accuracy
`auroc`	Area Under ROC Curve
`auprc`	Area Under PR Curve
`net_entropy_loss`	Total entropy loss
`net_entropy`	Total entropy
`error_proba_auroc`	AUROC of error probability estimation
`error_proba_auprc`	AUPRC of error probability estimation

AP / AR metrics are computed based on COCOeval with max detections = [1, 10, 100].

Metric	Description
`counts`	Number of detections (predicted + GT)
`counts_pred_boxes` / `counts_gt_boxes`	Total predicted / ground truth bounding boxes
`counts_img`	Number of images
`matched_ratio`	Ratio of matched detections
`AP` / `AP_50` / `AP_75`	Average Precision (overall / IoU=0.50 / IoU=0.75)
`AP_small` / `AP_medium` / `AP_large`	AP by object size
`AR_1` / `AR_10` / `AR_100`	Average Recall by max detections
`AR_small` / `AR_medium` / `AR_large`	AR by object size
`average_iou_matched` / `average_iou`	Average IoU for matched / all detections
`average_score`	Average prediction score
`accuracy`	Classification accuracy
`net_entropy` / `net_entropy_loss`	Total entropy / entropy loss
`net_focal_loss`	Total focal loss
`average_logit_margin`	Average logit margin
`error_proba_auroc` / `error_proba_auprc`	AUROC / AUPRC of error probability estimation

In addition to common OD metrics (counts, matched_ratio, accuracy, entropy, focal loss, logit margin, error_proba, etc.), AP/AR metrics are replaced with the following 3D-specific metrics.

Metric	Description
`mAP_3D` / `mAP_BEV`	3D / BEV mean Average Precision
`AP_3D_025` / `AP_3D_05` / `AP_3D_07`	3D AP (IoU threshold 0.25 / 0.50 / 0.70)
`AP_BEV_025` / `AP_BEV_05` / `AP_BEV_07`	BEV AP (IoU threshold 0.25 / 0.50 / 0.70)
`average_iou_3d` / `average_iou_bev`	Average 3D IoU / Average BEV IoU

Issue Category Summary

print(result.issue_category_summary())

# Filter by specific dataset_type
print(result.issue_category_summary(dataset_type="train"))

Parameter	Type	Default	Description
`dataset_type`	`Optional[str]`	`None`	Filter output by dataset_type. `None` to show all dataset_types

Example output:

dataset  stable_coverage_ratio  operational_coverage_ratio  hotspot_ratio  recessive_hotspot_ratio  critical_hotspot_ratio  aleatoric_hotspot_ratio
train    0.000                  0.001                       0.753          0.226                    0.019                   0.006

Issue Category Description

MLdebugger classifies data points into 6 categories based on the stability of internal features and error probability.

Category	Description	Recommended Action
`stable_coverage` (Highly Stable)	Region where internal features are stable and prediction reliability is high	No action needed
`operational_coverage` (Stable)	Region where model confidence is not high but internal features are stable and acceptable for operational use	Add data (low priority), continuous monitoring
`hotspot` (Unstable)	Region where internal features are unstable and prediction reliability cannot be guaranteed	Add data, retrain, Data Augmentation
`recessive_hotspot` (Under-Confidence)	Region where model confidence is low and errors are easier to anticipate	Confidence filtering, Human-in-the-loop, model branching
`critical_hotspot` (Over-Confidence)	Region where errors are likely to occur frequently based on internal features	Highest priority: Add data, Hard Negative Mining, consider ensemble
`aleatoric_hotspot` (Outlier)	Region where features are insufficient and model learning is difficult	Fix annotations, reconsider task definition, exclude via OOD detection

Detailed Summary

result.get_summary()

# Filter by specific dataset_type
result.get_summary(dataset_type="train")

Parameter	Type	Default	Description
`dataset_type`	`Optional[str]`	`None`	Filter output by dataset_type. `None` to show all dataset_types

This method displays a detailed summary including:

Evaluation information (model, version, result_name)
Metrics summary
Issue Category summary
Detailed error code distribution for each category

Get Issue List

issues_df = result.get_issues()

Get a list of all Issues (error codes) as a DataFrame.

Columns

Column	Description
`issue_id`	Issue ID
`dataset_type`	Dataset type
`category`	Issue Category
`error_code`	Error code
`debug_code`	Debug code (epistemic/aleatoric)
`counts_ratio`	Ratio to total
`counts`	Number of data points
Others	Task-specific metrics

For Object Detection (2D / 3D), additional columns are included: diagnosis (diagnosis result), counts_pred_boxes / counts_gt_boxes (predicted/GT bbox counts), average_iou, and average_score. For 3D OD, average_iou_3d and average_iou_bev are also included.

Task-specific Error Codes

The format of error_code and debug_code varies by task. See Error Codes - Classification and Error Codes - Object Detection for details.

Get Custom View

A feature for visualizing Issue status by arbitrary conditions. If you recorded custom metadata (additional columns) during data collection with Tracer, you can visualize the differences in Issue distribution for each metadata value.

view_df = result.get_view(
    groupby=["category", "error_code"],
    adjustby="category",
)

Parameters

Parameter	Type	Description
`groupby`	`List[str]`	List of columns to group by
`adjustby`	`Optional[str]`	Reference column for ratio adjustment
`query`	`Optional[str]`	Filtering condition (pandas query format)

Basic Usage Examples

# Get only Hotspot category
hotspot_view = result.get_view(
    query="category == 'hotspot'",
    groupby=["error_code"],
)

# Aggregate by dataset_type
dataset_view = result.get_view(
    groupby=["dataset_type", "category"],
)

Analysis with Custom Metadata

If you defined additional columns during data collection with Tracer, you can group by those columns to analyze Issue distribution.

# Object Detection: Check error distribution by object size
size_view = result.get_view(
    groupby=["object_size", "category"],
    adjustby="object_size",
)

# Classification: Check error distribution by shooting conditions
condition_view = result.get_view(
    groupby=["lighting_condition", "category"],
    adjustby="lighting_condition",
)

Utilizing Custom Metadata

By defining additional_columns during data collection with Tracer, you can record arbitrary metadata. This enables the following types of analysis:

Object Detection: Error occurrence by object size, occlusion presence, etc.
Classification: Error distribution by shooting environment, data source, etc.
Common: Condition-based analysis by time of day, device, region, etc.

About Error Codes

The format and interpretation of error codes vary by task.

Complete Sample Code

from ml_debugger.evaluator import Evaluator

# Initialize Evaluator
evaluator = Evaluator(
    model_name="resnet18",
    version_name="v1",
)

# Run evaluation
result = evaluator.request_evaluation(
    result_name="experiment_20251219",
    n_epoch="latest",
)

# Review results
print("=== Metrics Summary ===")
print(result.metrics_summary())

print("\n=== Issue Category Summary ===")
print(result.issue_category_summary())

print("\n=== Detailed Summary ===")
result.get_summary()

# Get Issue list as DataFrame
issues_df = result.get_issues()
print(f"\nTotal issues: {len(issues_df)}")

# Check Hotspot details
hotspot_issues = issues_df[issues_df["category"] == "hotspot"]
print(f"Hotspot issues: {len(hotspot_issues)}")

# Get custom view
view = result.get_view(
    groupby=["category"],
    adjustby=None,
)
print("\n=== Category Distribution ===")
print(view)

Next Steps

Error Codes - Classification - Classification error code definitions
Error Codes - Object Detection - Object Detection error code definitions