Evaluation and Result
This page explains the detailed usage of the Evaluator and Result classes.
Evaluator
Evaluator is a class for running evaluations on collected data and managing past evaluation results.
Initialization
from ml_debugger.evaluator import Evaluator
evaluator = Evaluator(
model_name="resnet18",
version_name="v1",
)
Running Evaluation
result = evaluator.request_evaluation()
Optional Parameters
result = evaluator.request_evaluation(
result_name="my_evaluation", # Evaluation result identifier (optional)
n_epoch="latest", # Target epoch (optional)
)
| Parameter | Description | Default |
|---|---|---|
result_name |
Evaluation result identifier. Auto-generated if omitted | Auto-generated |
n_epoch |
Epoch to evaluate. "latest" for latest epoch |
All epochs |
Explicitly specifying n_epoch is recommended
By default, all epochs are evaluated. However, explicitly specifying n_epoch is recommended to avoid unintended data mixing.
"latest" determines the latest epoch based on timestamp, so restarting epochs with the same version_name may cause unintended data to be used.
See model_name / version_name for details.
Get Evaluation Results List
results_df = evaluator.list_results()
Optional Parameters
results_df = evaluator.list_results(
result_name="my_evaluation", # Filter by specific result_name (optional)
method_name="default", # Filter by evaluation method name (optional)
n_epoch="latest", # Filter by target epoch (optional)
)
| Parameter | Type | Default | Description |
|---|---|---|---|
result_name |
Optional[str] |
None |
Filter by specific result_name. None to get all results |
method_name |
Optional[str] |
None |
Filter by evaluation method name. "default" alias available |
n_epoch |
Union[str, int, None] |
"latest" |
Filter by target epoch. "latest" for latest, "all" for all epochs, integer for specific epoch |
Example Output
result_name method_name n_epoch \
0 resnet18_v1_classification_v1_20251219 classification_v1 None
created_at completed_at options status
0 2025-12-19T07:49:28.706773Z 2025-12-19T07:49:30.994131Z {} Completed
Get Specific Evaluation Result
result = evaluator.get_result(result_name="resnet18_v1_classification_v1_20251219")
Result
Result is a class representing a single evaluation result. You can get details of metrics and error codes.
Basic Information
print(result.result_name) # Evaluation result identifier
print(result.model_name) # Model name
print(result.version_name) # Version name
Metrics Summary
print(result.metrics_summary())
# Filter by specific dataset_type
print(result.metrics_summary(dataset_type="train"))
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset_type |
Optional[str] |
None |
Filter output by dataset_type. None to show all dataset_types |
Example output (Classification task):
dataset counts accuracy auroc auprc net_entropy_loss net_entropy error_proba_auroc error_proba_auprc
train 5000 0.098 0.457 0.094 12875.500 10480.190 0.857 0.453
The output metrics vary depending on the task.
| Metric | Description |
|---|---|
counts |
Number of data points |
accuracy |
Accuracy |
auroc |
Area Under ROC Curve |
auprc |
Area Under PR Curve |
net_entropy_loss |
Total entropy loss |
net_entropy |
Total entropy |
error_proba_auroc |
AUROC of error probability estimation |
error_proba_auprc |
AUPRC of error probability estimation |
AP / AR metrics are computed based on COCOeval with max detections = [1, 10, 100].
| Metric | Description |
|---|---|
counts |
Number of detections (predicted + GT) |
counts_pred_boxes / counts_gt_boxes |
Total predicted / ground truth bounding boxes |
counts_img |
Number of images |
matched_ratio |
Ratio of matched detections |
AP / AP_50 / AP_75 |
Average Precision (overall / IoU=0.50 / IoU=0.75) |
AP_small / AP_medium / AP_large |
AP by object size |
AR_1 / AR_10 / AR_100 |
Average Recall by max detections |
AR_small / AR_medium / AR_large |
AR by object size |
average_iou_matched / average_iou |
Average IoU for matched / all detections |
average_score |
Average prediction score |
accuracy |
Classification accuracy |
net_entropy / net_entropy_loss |
Total entropy / entropy loss |
net_focal_loss |
Total focal loss |
average_logit_margin |
Average logit margin |
error_proba_auroc / error_proba_auprc |
AUROC / AUPRC of error probability estimation |
In addition to common OD metrics (counts, matched_ratio, accuracy, entropy, focal loss, logit margin, error_proba, etc.), AP/AR metrics are replaced with the following 3D-specific metrics.
| Metric | Description |
|---|---|
mAP_3D / mAP_BEV |
3D / BEV mean Average Precision |
AP_3D_025 / AP_3D_05 / AP_3D_07 |
3D AP (IoU threshold 0.25 / 0.50 / 0.70) |
AP_BEV_025 / AP_BEV_05 / AP_BEV_07 |
BEV AP (IoU threshold 0.25 / 0.50 / 0.70) |
average_iou_3d / average_iou_bev |
Average 3D IoU / Average BEV IoU |
Issue Category Summary
print(result.issue_category_summary())
# Filter by specific dataset_type
print(result.issue_category_summary(dataset_type="train"))
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset_type |
Optional[str] |
None |
Filter output by dataset_type. None to show all dataset_types |
Example output:
dataset stable_coverage_ratio operational_coverage_ratio hotspot_ratio recessive_hotspot_ratio critical_hotspot_ratio aleatoric_hotspot_ratio
train 0.000 0.001 0.753 0.226 0.019 0.006
Issue Category Description
MLdebugger classifies data points into 6 categories based on the stability of internal features and error probability.
| Category | Description | Recommended Action |
|---|---|---|
stable_coverage (Highly Stable) |
Region where internal features are stable and prediction reliability is high | No action needed |
operational_coverage (Stable) |
Region where model confidence is not high but internal features are stable and acceptable for operational use | Add data (low priority), continuous monitoring |
hotspot (Unstable) |
Region where internal features are unstable and prediction reliability cannot be guaranteed | Add data, retrain, Data Augmentation |
recessive_hotspot (Under-Confidence) |
Region where model confidence is low and errors are easier to anticipate | Confidence filtering, Human-in-the-loop, model branching |
critical_hotspot (Over-Confidence) |
Region where errors are likely to occur frequently based on internal features | Highest priority: Add data, Hard Negative Mining, consider ensemble |
aleatoric_hotspot (Outlier) |
Region where features are insufficient and model learning is difficult | Fix annotations, reconsider task definition, exclude via OOD detection |
Detailed Summary
result.get_summary()
# Filter by specific dataset_type
result.get_summary(dataset_type="train")
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset_type |
Optional[str] |
None |
Filter output by dataset_type. None to show all dataset_types |
This method displays a detailed summary including:
- Evaluation information (model, version, result_name)
- Metrics summary
- Issue Category summary
- Detailed error code distribution for each category
Get Issue List
issues_df = result.get_issues()
Get a list of all Issues (error codes) as a DataFrame.
Columns
| Column | Description |
|---|---|
issue_id |
Issue ID |
dataset_type |
Dataset type |
category |
Issue Category |
error_code |
Error code |
debug_code |
Debug code (epistemic/aleatoric) |
counts_ratio |
Ratio to total |
counts |
Number of data points |
| Others | Task-specific metrics |
For Object Detection (2D / 3D), additional columns are included: diagnosis (diagnosis result), counts_pred_boxes / counts_gt_boxes (predicted/GT bbox counts), average_iou, and average_score.
For 3D OD, average_iou_3d and average_iou_bev are also included.
Task-specific Error Codes
The format of error_code and debug_code varies by task.
See Error Codes - Classification and Error Codes - Object Detection for details.
Get Custom View
A feature for visualizing Issue status by arbitrary conditions. If you recorded custom metadata (additional columns) during data collection with Tracer, you can visualize the differences in Issue distribution for each metadata value.
view_df = result.get_view(
groupby=["category", "error_code"],
adjustby="category",
)
Parameters
| Parameter | Type | Description |
|---|---|---|
groupby |
List[str] |
List of columns to group by |
adjustby |
Optional[str] |
Reference column for ratio adjustment |
query |
Optional[str] |
Filtering condition (pandas query format) |
Basic Usage Examples
# Get only Hotspot category
hotspot_view = result.get_view(
query="category == 'hotspot'",
groupby=["error_code"],
)
# Aggregate by dataset_type
dataset_view = result.get_view(
groupby=["dataset_type", "category"],
)
Analysis with Custom Metadata
If you defined additional columns during data collection with Tracer, you can group by those columns to analyze Issue distribution.
# Object Detection: Check error distribution by object size
size_view = result.get_view(
groupby=["object_size", "category"],
adjustby="object_size",
)
# Classification: Check error distribution by shooting conditions
condition_view = result.get_view(
groupby=["lighting_condition", "category"],
adjustby="lighting_condition",
)
Utilizing Custom Metadata
By defining additional_columns during data collection with Tracer, you can record arbitrary metadata.
This enables the following types of analysis:
- Object Detection: Error occurrence by object size, occlusion presence, etc.
- Classification: Error distribution by shooting environment, data source, etc.
- Common: Condition-based analysis by time of day, device, region, etc.
About Error Codes
The format and interpretation of error codes vary by task.
Complete Sample Code
from ml_debugger.evaluator import Evaluator
# Initialize Evaluator
evaluator = Evaluator(
model_name="resnet18",
version_name="v1",
)
# Run evaluation
result = evaluator.request_evaluation(
result_name="experiment_20251219",
n_epoch="latest",
)
# Review results
print("=== Metrics Summary ===")
print(result.metrics_summary())
print("\n=== Issue Category Summary ===")
print(result.issue_category_summary())
print("\n=== Detailed Summary ===")
result.get_summary()
# Get Issue list as DataFrame
issues_df = result.get_issues()
print(f"\nTotal issues: {len(issues_df)}")
# Check Hotspot details
hotspot_issues = issues_df[issues_df["category"] == "hotspot"]
print(f"Hotspot issues: {len(hotspot_issues)}")
# Get custom view
view = result.get_view(
groupby=["category"],
adjustby=None,
)
print("\n=== Category Distribution ===")
print(view)
Next Steps
- Error Codes - Classification - Classification error code definitions
- Error Codes - Object Detection - Object Detection error code definitions