ML Interview Q Series: How does the Algorithm "The 10% You Don't Need" remove the Redundant Data?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
The algorithm often referred to as “The 10% You Don’t Need” is a data-reduction technique that tries to identify and remove a portion of the dataset (commonly around 10%, though it can be any fraction) that provides minimal additional benefit to a model’s predictive power or learning capacity. The fundamental idea is to measure how critical each subset of data samples is to the model’s performance and then selectively remove those samples deemed to have little or no impact. This approach helps reduce the size of the dataset without significantly harming the overall predictive accuracy, thus cutting down on computational overhead and storage needs.
Core Idea Behind the Algorithm
The underlying concept is to test how model performance changes if a certain data subset is removed. By evaluating these changes, one can gauge the relative importance of those data points. When the performance drops minimally upon removing a particular subset, we label that subset as “redundant” or non-critical. Typically, the procedure tries removing different potential subsets (often 10% each time, hence the nickname) and retrains or reevaluates the model to check for performance degradation.
Potential Mathematical Formulation of Importance
One typical way of defining the importance or redundancy of a data subset S is to look at the difference in model loss or accuracy when that subset is removed. For a data subset S, define I(S) as the drop in performance after removing S. The bigger the drop, the more important that subset is. Conversely, if removal yields little or no performance drop, S is more redundant.
Here:
Performance_original is the model's performance (for instance, accuracy or F1-score) on the full dataset.
Performance_without_S is the model's performance when subset S is excluded during training.
I(S) measures how critical S is: If the difference is small, S is relatively unimportant or redundant.
In practice, we often apply this measure to smaller chunks (e.g., 10% of the data each time) to systematically identify which chunk is least important. The algorithm can be iterative, removing the least important chunks and then re-evaluating until a certain threshold or limit is reached.
Practical Steps
One typical workflow is:
Split the data into multiple subsets (not necessarily 10%, but the name suggests that figure).
Train or partially train a model on all but one subset, measure the performance drop, and record it.
Rank subsets by the drop in performance: the subset(s) causing the smallest drop is considered “least important” or “most redundant.”
Remove the subset(s) that have the smallest impact on performance.
Retrain the model (or reevaluate) on the remaining data.
Repeat until you have reached the desired data reduction ratio or until removal starts significantly deteriorating the model’s performance.
Why It Works
By explicitly measuring the impact each data portion has on the final performance, the algorithm can isolate chunks that do not add unique information. Often, these chunks might contain samples that are near duplicates or that align closely with other points in the dataset. Such samples provide minimal new knowledge to the model and can be safely removed.
Example Python Pseudocode
import numpy as np
def measure_performance(train_data, valid_data, model_fn):
# model_fn: a function that trains a model on train_data and returns performance on valid_data
performance = model_fn(train_data, valid_data)
return performance
def remove_10_percent_redundant(data, labels, model_fn, performance_threshold_drop):
# data and labels are assumed to be numpy arrays
# performance_threshold_drop: acceptable drop in performance after removing subsets
# Compute baseline performance
baseline_perf = measure_performance((data, labels), (data, labels), model_fn)
# Partition data into chunks of size ~10% of dataset
chunk_size = int(0.1 * len(data))
# For simplicity, let's create 10 sequential chunks
subsets = [(data[i:i+chunk_size], labels[i:i+chunk_size]) for i in range(0, len(data), chunk_size)]
importance_scores = []
for i, (subset_data, subset_labels) in enumerate(subsets):
# Create a reduced dataset excluding this subset
reduced_data = np.delete(data, np.s_[i*chunk_size:(i+1)*chunk_size], axis=0)
reduced_labels = np.delete(labels, np.s_[i*chunk_size:(i+1)*chunk_size], axis=0)
# Measure performance without this chunk
perf_without_subset = measure_performance((reduced_data, reduced_labels),
(reduced_data, reduced_labels),
model_fn)
drop = baseline_perf - perf_without_subset
importance_scores.append((i, drop))
# Sort subsets by their performance drop (lowest drop => least important)
importance_scores.sort(key=lambda x: x[1])
# Suppose we take the chunk with the smallest drop as the redundant chunk
min_drop_index, min_drop_value = importance_scores[0]
# Check if removal is within threshold
if min_drop_value <= performance_threshold_drop:
# remove that chunk
data = np.delete(data, np.s_[min_drop_index*chunk_size:(min_drop_index+1)*chunk_size], axis=0)
labels = np.delete(labels, np.s_[min_drop_index*chunk_size:(min_drop_index+1)*chunk_size], axis=0)
return data, labels
Common Pitfalls and Considerations
Distribution Shift: If the data you remove happens to come from a rare class or special corner case, you might inadvertently eliminate critical data. Careful monitoring of class balance is needed.
Computational Cost: Continuously training models on different subsets can be time-consuming, especially for very large datasets. Practitioners often use smaller, approximate steps or simpler proxy measures of performance to speed up the process.
Overfitting to a Validation Set: If you repeatedly remove subsets based on validation performance, you risk overfitting to that validation set. It is often prudent to have a separate final test set or to rotate among multiple validation sets (cross-validation).
Chunk Sizing: The choice of chunk size matters. If it is too large, the measurement might be coarse. If it is too small, you increase computational overhead.
Follow-up Questions
How do we ensure that the 10% removed does not contain rare classes?
You can monitor the class distribution in each subset before removal. A typical approach is to ensure balanced partitioning so that no chunk disproportionately contains rare classes. Alternatively, incorporate constraints that forbid removing a subset if it heavily reduces minority class instances. Regularly re-check the distribution of the remaining data to ensure coverage and avoid skew.
What if the dataset is extremely large?
When faced with a massive dataset, training from scratch on multiple subsets can be computationally expensive. You can employ approximate methods such as:
Using a simpler model as a proxy for “importance” measurement.
Using incremental or online learning techniques that update performance metrics quickly.
Employing random sampling strategies to gather approximate but cheaper signals on which parts of the dataset are most redundant.
Could we use metrics other than accuracy or F1-score to decide on removal?
Yes. Depending on the task, you could measure the effect of data removal on metrics like:
AUC for binary classification tasks.
Mean Average Precision for ranking tasks.
Mean Squared Error for regression tasks. Any performance metric that appropriately reflects the real-world success criteria of the model can be used. The key is to measure the difference in performance with and without subsets of data, then remove subsets that yield minimal loss.
How might we handle redundancy removal in an imbalanced dataset?
In an imbalanced scenario, caution is critical. Removing a subset that contains positive examples for a minority class might lead to a severe drop in performance for that class, even if overall accuracy remains stable. A potential approach is to measure importance not just with overall performance changes but also with class-specific metrics. If removing a subset drastically harms minority-class performance, keep it, even if the overall performance drop is small.
Is there a theoretical guarantee about removing redundant samples?
The algorithm is largely heuristic. It relies on empirical checks of performance drops rather than a formal guarantee that those data points are truly redundant in every possible model scenario. If the model used to evaluate redundancy is flexible enough and well-tuned, you can gain more confidence, but it still may not guarantee you are removing “truly” redundant data from every possible perspective.
Could we instead find data redundancies through clustering before any model training?
Yes, clustering-based methods (e.g., K-means or density-based clustering) are often used to identify near-duplicate or highly similar samples. By grouping them, you can pick representative points from each cluster and discard the rest. Such unsupervised approaches don’t require a full training loop for measuring performance, but they also don’t directly account for the label distribution or the model’s actual performance criteria. “The 10% You Don’t Need” approach, on the other hand, is more performance-driven.
How do we decide how many rounds of removal are sufficient?
This depends on your performance requirements and computational constraints. Some practitioners remove 10% of the data in each iteration, re-check performance, and continue until:
The performance drop exceeds a predefined threshold.
The dataset reaches a target size or memory footprint.
The time budget is reached and further iteration becomes infeasible.
You can also adaptively adjust the removal percentage (e.g., remove smaller slices once you approach more critical data).
How does this approach compare to feature selection or dimensionality reduction?
“The 10% You Don’t Need” focuses on instance-level redundancy, removing entire samples. Feature selection or dimensionality reduction focuses on reducing the number of features. Both aim to reduce complexity, but one reduces the dataset size (rows), and the other reduces the dimensionality (columns). Sometimes both are employed for optimal efficiency: you might first remove uninformative samples, then reduce features on the pruned dataset.
When might removing redundant data harm the model if not done carefully?
If the data you remove contains subtle, useful patterns that are not captured by the remaining samples.
If your chosen performance metric doesn’t fully reflect the real-world importance of certain data subsets (e.g., ignoring an important minority class).
If your training approach overfits to the portion of the data left behind, missing generalization signals from the removed data.
Each of these pitfalls can be mitigated by using multiple validation checks, balanced chunk selection, or advanced performance metrics that capture all classes and relevant conditions.
Below are additional follow-up questions
Could this algorithm accidentally remove rare but important edge cases if the performance metric is too coarse?
Even though the algorithm measures changes in overall performance, some subtle but crucial data points might get removed if the performance metric does not fully capture their significance. For instance, an overall accuracy measure might stay high even after removing rare but vital edge-case examples. In such scenarios, the algorithm might misclassify these instances as “redundant.” A potential pitfall arises if these rare examples are pivotal for robust real-world performance. When data contains anomalies or edge cases that appear infrequently, it is crucial to use more refined performance metrics or incorporate domain-specific knowledge. If you only rely on broad metrics (e.g., overall accuracy), you may unintentionally remove the handful of data points that are critical in certain operational contexts—such as outlier detection or safety-critical scenarios.
One way to mitigate this issue is to ensure you are using granular metrics that specifically measure performance on rare classes or unusual conditions. Another approach is to implement constraints that explicitly preserve certain data subsets. You might also define a critical threshold below which examples cannot be removed, ensuring that the algorithm never discards truly exceptional edge cases.
Can this approach be adapted for streaming or incremental data scenarios?
In streaming or incremental data setups, the dataset is not static, and new data points continuously arrive over time. Traditional “The 10% You Don’t Need” methods that rely on re-partitioning the entire dataset and retraining a model might be too time- or resource-intensive for each new batch of data. A challenge here is that repeatedly evaluating how each incoming chunk affects performance can become prohibitively expensive in an online context.
A key pitfall is that you might allow old, less relevant data to remain in the system, while new data that reflects more recent trends might be overshadowed. To handle this effectively, one strategy is to maintain a sliding window of the data that covers a limited time range. Periodically, you could remove the “least important” 10% from the current window, based on a measure of performance drop in a continuously updated model. If the dataset is particularly large, approximate methods—such as training a lightweight model or using partial-fit algorithms—can be used to estimate the importance of each chunk without incurring a high computational cost every time.
How do we ensure that we are not removing the same information repeatedly when the dataset contains many near-duplicate points?
A situation can arise where certain near-duplicates appear in different 10% chunks. Since each chunk is analyzed individually for performance contribution, the algorithm might fail to recognize overlaps between chunks. In a worst-case scenario, you might keep chunks that are mostly duplicates of other chunks you have already retained, while removing chunks that introduce slightly different but still relevant data.
One way to address this pitfall is to use a pre-processing step—such as detecting near-duplicates via hashing or clustering—before applying “The 10% You Don’t Need.” By merging near-duplicates or labeling them as a single group, the algorithm can look at that group holistically. Another option is to run iterative checks not just on the performance drop for the chunk in question, but also on the chunk’s overlap with previously retained data. This approach can reduce the chance of perpetually keeping near-duplicates while discarding truly unique samples.
How might domain knowledge influence which data is considered redundant?
In many real-world applications, domain expertise plays a vital role in distinguishing truly redundant points from those that superficially appear redundant but offer unique domain-specific insights. For instance, in medical imaging data, two images might look statistically similar but differ in crucial pathology markers that only a trained specialist can identify. A purely algorithmic approach could incorrectly flag such pairs as duplicates.
To handle this, experts could label certain segments of the dataset as “protected” or “must keep.” These might include corner cases, specialized examples, or historically underrepresented instances. The algorithm could then run only on the remaining data, ensuring that domain-critical information is never discarded. This interplay between domain knowledge and redundancy detection helps maintain the quality of the remaining dataset while still reaping the benefits of size reduction.
Is there a risk that this algorithm removes important samples if the model used to evaluate performance is undertrained?
When evaluating how removal of a subset affects performance, the model might still be in an early or suboptimal training stage, especially if you’re using iterative methods or a small portion of data for quick checks. If the model hasn’t learned the intricacies of the data, it may incorrectly judge some subsets as unimportant simply because it lacks the capacity or training time to detect the subtle patterns within them.
A typical edge case arises if your model’s architecture or hyperparameters are poorly matched to the data. In that case, the performance differences might be misleadingly small or large. This can lead to removing data that later, with a better-tuned model, might prove beneficial. One possible mitigation is to run the “10% removal” check at multiple training checkpoints or use a well-validated, sufficiently powerful model for evaluation. Some practitioners adopt cross-validation approaches to ensure the model is not prematurely discarding data due to initial underfitting.
Could “The 10% You Don’t Need” algorithm inadvertently remove data that helps with model fairness?
Fairness concerns may arise if specific demographic groups are over-represented in the “redundant” subsets. Even if removing them has minimal effect on raw accuracy, it might exacerbate biases or reduce the model’s fairness metrics. This is particularly problematic when working with sensitive data. For instance, if the algorithm is purely performance-driven, it might remove subsets that include minority group representatives, thereby reducing the model’s exposure to important demographic variations.
An edge case might occur when the largest portion of redundant data is drawn from small populations that the main model sees as marginal. To avoid undermining fairness, you could apply constraints on the proportion of data from each demographic that can be removed. Or, you could measure the impact of subset removal on fairness metrics (e.g., equal opportunity difference) alongside the primary performance metric. This helps ensure that “The 10% You Don’t Need” approach does not inadvertently intensify discriminatory outcomes.
How do we handle real-time labeling and partial data situations?
In some scenarios, not all data points are fully labeled at the time of processing, or the labeling might be partial or noisy. The algorithm heavily relies on model performance measurements, which typically presuppose complete and correct labels. If a subset of data is unlabeled, the model might see removing that data as inconsequential to immediate performance, even though it might be labeled in the future and contribute to a more comprehensive training set.
This can lead to prematurely discarding data that, once labeled, could be highly informative. A possible remedy is to assign “placeholder” or semi-supervised labels where possible, or to adopt an active learning strategy that prioritizes labeling potentially valuable data. Another workaround is to keep unlabeled data separate from the fully labeled pool, only applying “The 10% You Don’t Need” to the fully labeled set so that you don’t remove potentially valuable unlabeled data prematurely.
Can we combine “The 10% You Don’t Need” with advanced data augmentation techniques?
Combining removal of redundant data with data augmentation is a nuanced practice. On one hand, if the dataset is large with many duplicated or near-duplicated samples, removing them before augmentation saves computation and ensures you are not repeatedly augmenting very similar data. On the other hand, certain augmentations can reveal new patterns from samples that seemed redundant. For example, a seemingly redundant image could become unique if augmented (e.g., rotated, cropped, color-jittered), thus broadening the diversity of your dataset.
A subtle pitfall is inadvertently removing a “base” sample that could have yielded useful augmented variants. This can be mitigated by deciding on an augmentation pipeline first, then determining the uniqueness or redundancy of the augmented data set. Alternatively, you can do a smaller scale data augmentation early, measure redundancy on the augmented data, and remove genuinely redundant augmented samples. The best practice typically depends on the domain and the complexity of the augmentation transformations.
Does this algorithm scale to distributed or federated learning environments?
In federated or distributed learning, data might reside on multiple nodes or devices, and privacy or bandwidth constraints often prevent centralized aggregation. Applying “The 10% You Don’t Need” in such scenarios becomes complicated because performance measurement across all data requires aggregated metrics, but the data itself might never fully reside on a single server.
An edge case appears if one node’s data is predominantly flagged as redundant without considering global diversity. This might reduce the global model’s exposure to certain local nuances. A partial solution involves each node locally identifying and discarding redundant data, followed by a joint performance metric evaluation that aggregates partial results. More complex protocols can ensure that distributed nodes do not remove data critical from a global perspective. However, designing such distributed redundancy removal is still an active area of research, and care must be taken not to infringe on data privacy constraints or degrade model performance for minority nodes.