Skip to content

Data Curation

This guide explains the concepts and strategies of data curation using MLdebugger.

Overview

In the AI product lifecycle, the following are important when continuously collecting data:

  • Do not collect data that does not contribute to AI model growth (maximize specificity)
  • Do not miss data with critical errors that the AI model should improve (maximize sensitivity)
  • Collect diverse data within the data range targeted by the application
  • Do not miss unknown data to respond to changes in data distribution (Out-of-Distribution Detection)

MLdebugger's ClassificationDataFilter, ObjectDetectionDataFilter, and ObjectDetection3DDataFilter achieve effective data selection based on analysis of internal features and error probability.

Data Classification Based on Issue Category

MLdebugger classifies data based on the model's internal features during inference by analyzing the relationship between error codes and internal features.

About Threshold Settings

The current version determines categories using fixed thresholds. As a result, when model maturity is low, errors may still exist within the Stable (Coverage) region. Additionally, for multi-class models, distributions on the Heatmap can vary per class, making it important to review evaluation results in the GUI.

By adjusting thresholds based on model maturity and the Heatmap, more fine-grained data collection settings are possible. In the future, dynamic threshold settings based on model maturity evaluated by MLdebugger are planned.

Coverage Region (No Collection Needed)

Category Characteristic Data Collection
stable_coverage (Highly Stable) High-quality output possible with stable internal features Not needed
operational_coverage (Stable) Acceptable for operation with stable internal features Low priority

Effect: Improves the ratio of effective data in collected data (specificity)

Category Characteristic Data Collection
hotspot (Unstable) Unstable internal features with fluctuating predictions Recommended
recessive_hotspot (Under-Confidence) Low model confidence, errors easier to anticipate Recommended

Effect: Can collect data where the model has not sufficiently learned feature representations, enabling construction of more robust models

Critical/Aleatoric Region (Priority Collection)

Category Characteristic Data Collection
critical_hotspot (Over-Confidence) High error probability but predicts with high confidence Highest priority
aleatoric_hotspot (Outlier) Insufficient features, model learning is difficult Review needed

Effect: Prevents missing high-importance errors (maximize sensitivity)

Importance of Critical Hotspot

Critical Hotspot is difficult to detect during operation because the model makes mistakes with confidence. Prioritize collection and use for model improvement.

Using ClassificationDataFilter

Initialization

Initialize DataFilter by specifying the evaluation result (result_name).

from ml_debugger.data_filter import ClassificationDataFilter

data_filter = ClassificationDataFilter(
    model,
    model_name="my_model",
    version_name="v1",
    result_name="my_model_v1_classification_v1_20251219",
)

Query Strategies

The strategy parameter of the query() method accepts either a string or a dictionary.

Sort Strategy

Sorts data by the specified metric and retrieves the top N items.

Strategy Description Use Case
"high_error_proba" Prioritize data with high error probability Hard Example Mining
"low_error_proba" Prioritize data with low error probability Collecting reliable data

Filter Strategy

Retrieves up to N items that match the specified conditions. Conditions are specified in dictionary format for Issue Category or thresholds.

# Select data from Hotspot zones
queried_ids = data_filter.query(
    n_data=100,
    strategy={"target_zones": ["hotspot", "critical_hotspot"]},
)

# Custom selection with threshold conditions
queried_ids = data_filter.query(
    n_data=100,
    strategy={"conditions": [{"error_proba": ">=0.8"}]},
)

Benefits of Filter Strategy

Issue Category-based selection considers not only error probability but also the model's internal state (confidence, feature stability), enabling more effective data selection.

Real-time Filtering

In addition to batch post-processing, real-time filtering using filter_config is available. You can instantly determine which Issue Category a data point belongs to during inference.

Mapping between target_zones and Issue Category

target_zone Issue Category Description
stable_coverage Highly Stable High-quality output possible with stable internal features
operational_coverage Stable Acceptable for operation with stable internal features
hotspot Unstable Unstable internal features with fluctuating predictions
recessive_hotspot Under-Confidence Low model confidence, errors easier to anticipate
critical_hotspot Over-Confidence High error probability but predicts with high confidence
aleatoric_hotspot Outlier Insufficient features, model learning is difficult

Example: Real-time Detection of Critical Hotspot

Example of detecting Critical Hotspot (data where the model makes confident mistakes) in real-time in a production environment and immediately sending it to the data collection pipeline.

from ml_debugger.data_filter import ClassificationDataFilter

# Filter configuration targeting Critical Hotspot
data_filter = ClassificationDataFilter(
    model,
    model_name="my_model",
    version_name="v1",
    result_name="my_model_v1_classification_v1_20251219",
    filter_config={"target_zones": ["critical_hotspot"]},
)

# Real-time filtering during inference
for image, _, idx in inference_dataloader:
    model_output, filter_flags = data_filter(
        image.to(device),
        input_ids=[idx],
    )

    if filter_flags[0] is True:
        # Critical Hotspot detected → Request priority labeling
        send_to_labeling_queue(idx, priority="high")

Example: Filtering Out Low-contribution Data

Example of excluding Coverage region data (data the model has already learned sufficiently) to reduce storage costs.

# Target Coverage zones (= identify data to exclude)
data_filter = ClassificationDataFilter(
    model,
    model_name="my_model",
    version_name="v1",
    result_name=result.result_name,
    filter_config={"target_zones": ["stable_coverage", "operational_coverage"]},
)

for image, _, idx in inference_dataloader:
    model_output, filter_flags = data_filter(
        image.to(device),
        input_ids=[idx],
    )

    if filter_flags[0] is True:
        # Coverage zone → Don't save (doesn't contribute to learning)
        pass
    else:
        # Hotspot/Critical zone → Save (useful for learning)
        save_to_storage(image, idx)

Object Detection Real-time Filtering

For Object Detection, you can pass a BBoxStrategy dict to filter_config to filter based on aggregated image-level error probability from per-bbox values.

from ml_debugger.data_filter import ObjectDetectionDataFilter

# Flag images where aggregated error probability >= 0.5
data_filter = ObjectDetectionDataFilter(
    model,
    model_name="yolov5",
    version_name="v1",
    result_name=result.result_name,
    filter_config={
        "img_error_threshold": 0.5,
        "aggregation": "mean",
        "target_column": "error_proba",
    },
)

for image, _, idx in inference_dataloader:
    model_output, filter_flags = data_filter(
        image.to(device),
        input_ids=[idx],
    )

    if filter_flags[0] is True:
        # Image-level error probability above threshold → collect
        save_to_storage(image, idx)

BBoxStrategy and filter_config Details

For details on BBoxStrategy (target_column, top_n, aggregation, etc.) and filter_config usage, see Getting Started - DataFiltering.

Use Cases

Active Learning

Improve model performance with lower labeling costs by selectively labeling the most informative data from an unlabeled data pool.

See Active Learning for detailed implementation.

Dataset Quality Improvement

Identify problematic data from existing datasets for annotation review.

Use Cases:

  • Discovery and correction of annotation mistakes
  • Identification and exclusion of ambiguous data
  • Reduction of label noise

Optimizing Data Collection During Operation

Collect only data useful for model improvement from data inferred in production environment.

Effect:

  • Storage cost reduction
  • Labeling cost optimization
  • Continuous model improvement

Cost-Effectiveness Trade-offs

By adjusting the weighting of collection strategies, users can control the amount and quality of data collection, which directly impacts costs.

Strategy Data Volume Labeling Cost Model Improvement Effect
Collect all data Large High Medium
High Error Proba Small Low High
Random Medium Medium Medium

Next Steps