When optimizing database queries, one crucial factor is estimating the Number of Distinct Values (NDV) after operations like filtering or sampling. Let’s explore how this works in a practical way.
What is NDV?
NDV (Number of Distinct Values) counts how many unique values exist in a dataset. For instance, in a table with 1000 customer records, a “country” column might have only 50 distinct values since many customers are from the same countries.
Why is NDV Estimation Important?
Accurate NDV estimation helps databases to:
- Choose optimal query execution plans
- Allocate memory and resources efficiently
- Predict result set sizes accurately
- Optimize join operations
How to Estimate NDV?
The estimation method adapts based on data characteristics. Let’s explore two main approaches:
1. High Cardinality Case
When data is highly unique (over 90% distinct values), we use a simple linear scaling method:
estimatedNDV = (distinctValues/totalRows) * newRowCount
Example:
- Original data: 1000 rows with 950 distinct values (95% unique)
- Target: Sample down to 100 rows
- Calculation: (950/1000) * 100 = 95 distinct values expected
This method works well for columns like:
- Primary keys
- Timestamps
- Unique identifiers
2. Normal Case: Poisson-based Estimation
For typical cases (less than 90% unique values), we use a Poisson-based formula:
estimatedNDV = currentNDV * (1 - Math.pow(1 - samplingRatio, totalRows/currentNDV))
Let’s break down this formula step by step:
Components Explained
Sampling Ratio (
samplingRatio = targetRows/totalRows
)- Represents the fraction of data we’re keeping
- Example: Sampling 200 rows from 1000 → ratio = 0.2 (20%)
Average Frequency (
totalRows/currentNDV
)- Shows how often each distinct value appears on average
- Example: 1000 rows with 200 distinct values → average frequency = 5
Miss Probability (
1 - samplingRatio
)- Chance of not selecting a specific row
- Example: 20% sampling → 80% (0.8) chance of missing each row
Complete Miss Probability (
Math.pow(1 - samplingRatio, frequency)
)- Probability of missing all occurrences of a value
- Accounts for multiple occurrences of each value
Final Estimation (
currentNDV * (1 - [miss probability])
)- Converts probabilities into estimated distinct values
- Adjusts for the sampling size
Practical Example
Consider these parameters:
- Total rows: 1000
- Current distinct values: 200
- Target sample size: 100
Calculation steps:
- Sampling ratio = 100/1000 = 0.1
- Average frequency = 1000/200 = 5
- Miss probability = 1 - 0.1 = 0.9
- Complete miss probability = 0.9^5 ≈ 0.59
- Survival probability = 1 - 0.59 = 0.41
- Estimated distinct values = 200 * 0.41 ≈ 82
This means we expect to see about 82 distinct values in our 100-row sample.
Implementation
Here’s the practical implementation combining both methods:
public static double calculateNDV(double distinctValues, double totalRows, double targetRows) {
if (totalRows == 0) {
return 0;
}
double ndvRatio = distinctValues / totalRows;
double reductionRatio = targetRows / totalRows;
return ndvRatio > 0.9
? ndvRatio * targetRows // High cardinality case
: distinctValues * (1 - Math.pow(1 - reductionRatio, totalRows/distinctValues));
}
Real-World Applications
NDV estimation is crucial for:
- Query plan optimization
- Memory allocation
- Cost-based optimization
- Data sampling strategies
- Resource management
Best Practices
Choose the Right Method
- Use linear scaling for highly unique data
- Use Poisson-based estimation for normal cases
Consider Data Distribution
- Account for skewed data
- Monitor estimation accuracy
Regular Updates
- Keep statistics up to date
- Validate estimates periodically
Conclusion
NDV estimation is fundamental to query optimization. While the mathematics might seem complex, understanding these estimation methods helps in:
- Building better database systems
- Optimizing query performance
- Making informed decisions about data operations
By choosing the appropriate estimation method and implementing it correctly, we can significantly improve database performance and resource utilization.