Data Curation

This guide explains the concepts and strategies of data curation using MLdebugger.

Overview

In the AI product lifecycle, the following are important when continuously collecting data:

Do not collect data that does not contribute to AI model growth (maximize specificity)
Do not miss data with critical errors that the AI model should improve (maximize sensitivity)
Collect diverse data within the data range targeted by the application
Do not miss unknown data to respond to changes in data distribution (Out-of-Distribution Detection)

MLdebugger's ClassificationDataFilter, ObjectDetectionDataFilter, and ObjectDetection3DDataFilter achieve effective data selection based on analysis of internal features and error probability.

Data Classification Based on Issue Category

MLdebugger classifies data based on the model's internal features during inference by analyzing the relationship between error codes and internal features.

About Threshold Settings

The current version determines categories using fixed thresholds. As a result, when model maturity is low, errors may still exist within the Stable (Coverage) region. Additionally, for multi-class models, distributions on the Heatmap can vary per class, making it important to review evaluation results in the GUI.

By adjusting thresholds based on model maturity and the Heatmap, more fine-grained data collection settings are possible. In the future, dynamic threshold settings based on model maturity evaluated by MLdebugger are planned.

Coverage Region (No Collection Needed)

Category	Characteristic	Data Collection
`stable_coverage` (Highly Stable)	High-quality output possible with stable internal features	Not needed
`operational_coverage` (Stable)	Acceptable for operation with stable internal features	Low priority

Effect: Improves the ratio of effective data in collected data (specificity)

Hotspot Region (Collection Recommended)

Category	Characteristic	Data Collection
`hotspot` (Unstable)	Unstable internal features with fluctuating predictions	Recommended
`recessive_hotspot` (Under-Confidence)	Low model confidence, errors easier to anticipate	Recommended

Effect: Can collect data where the model has not sufficiently learned feature representations, enabling construction of more robust models

Critical/Aleatoric Region (Priority Collection)

Category	Characteristic	Data Collection
`critical_hotspot` (Over-Confidence)	High error probability but predicts with high confidence	Highest priority
`aleatoric_hotspot` (Outlier)	Insufficient features, model learning is difficult	Review needed

Effect: Prevents missing high-importance errors (maximize sensitivity)

Importance of Critical Hotspot

Critical Hotspot is difficult to detect during operation because the model makes mistakes with confidence. Prioritize collection and use for model improvement.

Using ClassificationDataFilter

Initialization

Initialize DataFilter by specifying the evaluation result (result_name).

from ml_debugger.data_filter import ClassificationDataFilter

data_filter = ClassificationDataFilter(
    model,
    model_name="my_model",
    version_name="v1",
    result_name="my_model_v1_classification_v1_20251219",
)

Query Strategies

The strategy parameter of the query() method accepts either a string or a dictionary.

Sort Strategy

Sorts data by the specified metric and retrieves the top N items.

Strategy	Description	Use Case
`"high_error_proba"`	Prioritize data with high error probability	Hard Example Mining
`"low_error_proba"`	Prioritize data with low error probability	Collecting reliable data

Filter Strategy

Retrieves up to N items that match the specified conditions. Conditions are specified in dictionary format for Issue Category or thresholds.

# Select data from Hotspot zones
queried_ids = data_filter.query(
    n_data=100,
    strategy={"target_zones": ["hotspot", "critical_hotspot"]},
)

# Custom selection with threshold conditions
queried_ids = data_filter.query(
    n_data=100,
    strategy={"conditions": [{"error_proba": ">=0.8"}]},
)

Benefits of Filter Strategy

Issue Category-based selection considers not only error probability but also the model's internal state (confidence, feature stability), enabling more effective data selection.

Real-time Filtering

In addition to batch post-processing, real-time filtering using filter_config is available. You can instantly determine which Issue Category a data point belongs to during inference.

Mapping between target_zones and Issue Category

target_zone	Issue Category	Description
`stable_coverage`	Highly Stable	High-quality output possible with stable internal features
`operational_coverage`	Stable	Acceptable for operation with stable internal features
`hotspot`	Unstable	Unstable internal features with fluctuating predictions
`recessive_hotspot`	Under-Confidence	Low model confidence, errors easier to anticipate
`critical_hotspot`	Over-Confidence	High error probability but predicts with high confidence
`aleatoric_hotspot`	Outlier	Insufficient features, model learning is difficult

Example: Real-time Detection of Critical Hotspot

Example of detecting Critical Hotspot (data where the model makes confident mistakes) in real-time in a production environment and immediately sending it to the data collection pipeline.

from ml_debugger.data_filter import ClassificationDataFilter

# Filter configuration targeting Critical Hotspot
data_filter = ClassificationDataFilter(
    model,
    model_name="my_model",
    version_name="v1",
    result_name="my_model_v1_classification_v1_20251219",
    filter_config={"target_zones": ["critical_hotspot"]},
)

# Real-time filtering during inference
for image, _, idx in inference_dataloader:
    model_output, filter_flags = data_filter(
        image.to(device),
        input_ids=[idx],
    )

    if filter_flags[0] is True:
        # Critical Hotspot detected → Request priority labeling
        send_to_labeling_queue(idx, priority="high")

Example: Filtering Out Low-contribution Data

Example of excluding Coverage region data (data the model has already learned sufficiently) to reduce storage costs.

# Target Coverage zones (= identify data to exclude)
data_filter = ClassificationDataFilter(
    model,
    model_name="my_model",
    version_name="v1",
    result_name=result.result_name,
    filter_config={"target_zones": ["stable_coverage", "operational_coverage"]},
)

for image, _, idx in inference_dataloader:
    model_output, filter_flags = data_filter(
        image.to(device),
        input_ids=[idx],
    )

    if filter_flags[0] is True:
        # Coverage zone → Don't save (doesn't contribute to learning)
        pass
    else:
        # Hotspot/Critical zone → Save (useful for learning)
        save_to_storage(image, idx)

Object Detection Real-time Filtering

For Object Detection, you can pass a BBoxStrategy dict to filter_config to filter based on aggregated image-level error probability from per-bbox values.

from ml_debugger.data_filter import ObjectDetectionDataFilter

# Flag images where aggregated error probability >= 0.5
data_filter = ObjectDetectionDataFilter(
    model,
    model_name="yolov5",
    version_name="v1",
    result_name=result.result_name,
    filter_config={
        "img_error_threshold": 0.5,
        "aggregation": "mean",
        "target_column": "error_proba",
    },
)

for image, _, idx in inference_dataloader:
    model_output, filter_flags = data_filter(
        image.to(device),
        input_ids=[idx],
    )

    if filter_flags[0] is True:
        # Image-level error probability above threshold → collect
        save_to_storage(image, idx)

BBoxStrategy and filter_config Details

For details on BBoxStrategy (target_column, top_n, aggregation, etc.) and filter_config usage, see Getting Started - DataFiltering.

Use Cases

Active Learning

Improve model performance with lower labeling costs by selectively labeling the most informative data from an unlabeled data pool.

See Active Learning for detailed implementation.

Dataset Quality Improvement

Identify problematic data from existing datasets for annotation review.

Use Cases:

Discovery and correction of annotation mistakes
Identification and exclusion of ambiguous data
Reduction of label noise

Optimizing Data Collection During Operation

Collect only data useful for model improvement from data inferred in production environment.

Effect:

Storage cost reduction
Labeling cost optimization
Continuous model improvement

Sampling Strategies

Use the sampling parameter of the query() method to customize data selection behavior.

method	Description	Use Case
`"random"`	Random selection from candidates	Ensure diversity in data collection
`"class_balanced"`	Even selection across classes	Address class imbalance

# Random sampling (randomly select from Hotspot zones)
queried_ids = data_filter.query(
    n_data=100,
    strategy={"target_zones": ["hotspot", "critical_hotspot"]},
    sampling={"method": "random", "seed": 42},
)

Cost-Effectiveness Trade-offs

By adjusting the weighting of collection strategies, users can control the amount and quality of data collection, which directly impacts costs.

Strategy	Data Volume	Labeling Cost	Model Improvement Effect
Collect all data	Large	High	Medium
High Error Proba	Small	Low	High
Class Balanced	Medium	Medium	High
Random	Medium	Medium	Medium

Next Steps

Getting Started - DataFiltering - Basic DataFilter operations
Evaluation and Result - Issue Category details