Data Curation
This guide explains the concepts and strategies of data curation using MLdebugger.
Overview
In the AI product lifecycle, the following are important when continuously collecting data:
- Do not collect data that does not contribute to AI model growth (maximize specificity)
- Do not miss data with critical errors that the AI model should improve (maximize sensitivity)
- Collect diverse data within the data range targeted by the application
- Do not miss unknown data to respond to changes in data distribution (Out-of-Distribution Detection)
MLdebugger's ClassificationDataFilter, ObjectDetectionDataFilter, and ObjectDetection3DDataFilter achieve effective data selection based on analysis of internal features and error probability.
Data Classification Based on Issue Category
MLdebugger classifies data based on the model's internal features during inference by analyzing the relationship between error codes and internal features.
About Threshold Settings
The current version determines categories using fixed thresholds. As a result, when model maturity is low, errors may still exist within the Stable (Coverage) region. Additionally, for multi-class models, distributions on the Heatmap can vary per class, making it important to review evaluation results in the GUI.
By adjusting thresholds based on model maturity and the Heatmap, more fine-grained data collection settings are possible. In the future, dynamic threshold settings based on model maturity evaluated by MLdebugger are planned.
Coverage Region (No Collection Needed)
| Category | Characteristic | Data Collection |
|---|---|---|
stable_coverage (Highly Stable) |
High-quality output possible with stable internal features | Not needed |
operational_coverage (Stable) |
Acceptable for operation with stable internal features | Low priority |
Effect: Improves the ratio of effective data in collected data (specificity)
Hotspot Region (Collection Recommended)
| Category | Characteristic | Data Collection |
|---|---|---|
hotspot (Unstable) |
Unstable internal features with fluctuating predictions | Recommended |
recessive_hotspot (Under-Confidence) |
Low model confidence, errors easier to anticipate | Recommended |
Effect: Can collect data where the model has not sufficiently learned feature representations, enabling construction of more robust models
Critical/Aleatoric Region (Priority Collection)
| Category | Characteristic | Data Collection |
|---|---|---|
critical_hotspot (Over-Confidence) |
High error probability but predicts with high confidence | Highest priority |
aleatoric_hotspot (Outlier) |
Insufficient features, model learning is difficult | Review needed |
Effect: Prevents missing high-importance errors (maximize sensitivity)
Importance of Critical Hotspot
Critical Hotspot is difficult to detect during operation because the model makes mistakes with confidence. Prioritize collection and use for model improvement.
Using ClassificationDataFilter
Initialization
Initialize DataFilter by specifying the evaluation result (result_name).
from ml_debugger.data_filter import ClassificationDataFilter
data_filter = ClassificationDataFilter(
model,
model_name="my_model",
version_name="v1",
result_name="my_model_v1_classification_v1_20251219",
)
Query Strategies
The strategy parameter of the query() method accepts either a string or a dictionary.
Sort Strategy
Sorts data by the specified metric and retrieves the top N items.
| Strategy | Description | Use Case |
|---|---|---|
"high_error_proba" |
Prioritize data with high error probability | Hard Example Mining |
"low_error_proba" |
Prioritize data with low error probability | Collecting reliable data |
Filter Strategy
Retrieves up to N items that match the specified conditions. Conditions are specified in dictionary format for Issue Category or thresholds.
# Select data from Hotspot zones
queried_ids = data_filter.query(
n_data=100,
strategy={"target_zones": ["hotspot", "critical_hotspot"]},
)
# Custom selection with threshold conditions
queried_ids = data_filter.query(
n_data=100,
strategy={"conditions": [{"error_proba": ">=0.8"}]},
)
Benefits of Filter Strategy
Issue Category-based selection considers not only error probability but also the model's internal state (confidence, feature stability), enabling more effective data selection.
Real-time Filtering
In addition to batch post-processing, real-time filtering using filter_config is available. You can instantly determine which Issue Category a data point belongs to during inference.
Mapping between target_zones and Issue Category
| target_zone | Issue Category | Description |
|---|---|---|
stable_coverage |
Highly Stable | High-quality output possible with stable internal features |
operational_coverage |
Stable | Acceptable for operation with stable internal features |
hotspot |
Unstable | Unstable internal features with fluctuating predictions |
recessive_hotspot |
Under-Confidence | Low model confidence, errors easier to anticipate |
critical_hotspot |
Over-Confidence | High error probability but predicts with high confidence |
aleatoric_hotspot |
Outlier | Insufficient features, model learning is difficult |
Example: Real-time Detection of Critical Hotspot
Example of detecting Critical Hotspot (data where the model makes confident mistakes) in real-time in a production environment and immediately sending it to the data collection pipeline.
from ml_debugger.data_filter import ClassificationDataFilter
# Filter configuration targeting Critical Hotspot
data_filter = ClassificationDataFilter(
model,
model_name="my_model",
version_name="v1",
result_name="my_model_v1_classification_v1_20251219",
filter_config={"target_zones": ["critical_hotspot"]},
)
# Real-time filtering during inference
for image, _, idx in inference_dataloader:
model_output, filter_flags = data_filter(
image.to(device),
input_ids=[idx],
)
if filter_flags[0] is True:
# Critical Hotspot detected → Request priority labeling
send_to_labeling_queue(idx, priority="high")
Example: Filtering Out Low-contribution Data
Example of excluding Coverage region data (data the model has already learned sufficiently) to reduce storage costs.
# Target Coverage zones (= identify data to exclude)
data_filter = ClassificationDataFilter(
model,
model_name="my_model",
version_name="v1",
result_name=result.result_name,
filter_config={"target_zones": ["stable_coverage", "operational_coverage"]},
)
for image, _, idx in inference_dataloader:
model_output, filter_flags = data_filter(
image.to(device),
input_ids=[idx],
)
if filter_flags[0] is True:
# Coverage zone → Don't save (doesn't contribute to learning)
pass
else:
# Hotspot/Critical zone → Save (useful for learning)
save_to_storage(image, idx)
Object Detection Real-time Filtering
For Object Detection, you can pass a BBoxStrategy dict to filter_config to filter based on aggregated image-level error probability from per-bbox values.
from ml_debugger.data_filter import ObjectDetectionDataFilter
# Flag images where aggregated error probability >= 0.5
data_filter = ObjectDetectionDataFilter(
model,
model_name="yolov5",
version_name="v1",
result_name=result.result_name,
filter_config={
"img_error_threshold": 0.5,
"aggregation": "mean",
"target_column": "error_proba",
},
)
for image, _, idx in inference_dataloader:
model_output, filter_flags = data_filter(
image.to(device),
input_ids=[idx],
)
if filter_flags[0] is True:
# Image-level error probability above threshold → collect
save_to_storage(image, idx)
BBoxStrategy and filter_config Details
For details on BBoxStrategy (target_column, top_n, aggregation, etc.) and filter_config usage, see Getting Started - DataFiltering.
Use Cases
Active Learning
Improve model performance with lower labeling costs by selectively labeling the most informative data from an unlabeled data pool.
See Active Learning for detailed implementation.
Dataset Quality Improvement
Identify problematic data from existing datasets for annotation review.
Use Cases:
- Discovery and correction of annotation mistakes
- Identification and exclusion of ambiguous data
- Reduction of label noise
Optimizing Data Collection During Operation
Collect only data useful for model improvement from data inferred in production environment.
Effect:
- Storage cost reduction
- Labeling cost optimization
- Continuous model improvement
Cost-Effectiveness Trade-offs
By adjusting the weighting of collection strategies, users can control the amount and quality of data collection, which directly impacts costs.
| Strategy | Data Volume | Labeling Cost | Model Improvement Effect |
|---|---|---|---|
| Collect all data | Large | High | Medium |
| High Error Proba | Small | Low | High |
| Random | Medium | Medium | Medium |
Next Steps
- Getting Started - DataFiltering - Basic DataFilter operations
- Evaluation and Result - Issue Category details