NDV Estimation in Query Optimization

When optimizing database queries, one crucial factor is estimating the Number of Distinct Values (NDV) after operations like filtering or sampling. Let’s explore how this works in a practical way.

What is NDV?

NDV (Number of Distinct Values) counts how many unique values exist in a dataset. For instance, in a table with 1000 customer records, a “country” column might have only 50 distinct values since many customers are from the same countries.

Why is NDV Estimation Important?

Accurate NDV estimation helps databases to:

Choose optimal query execution plans
Allocate memory and resources efficiently
Predict result set sizes accurately
Optimize join operations

How to Estimate NDV?

The estimation method adapts based on data characteristics. Let’s explore two main approaches:

1. High Cardinality Case

When data is highly unique (over 90% distinct values), we use a simple linear scaling method:

estimatedNDV = (distinctValues/totalRows) * newRowCount

Example:

Original data: 1000 rows with 950 distinct values (95% unique)
Target: Sample down to 100 rows
Calculation: (950/1000) * 100 = 95 distinct values expected

This method works well for columns like:

Primary keys
Timestamps
Unique identifiers

2. Normal Case: Poisson-based Estimation

For typical cases (less than 90% unique values), we use a Poisson-based formula:

estimatedNDV = currentNDV * (1 - Math.pow(1 - samplingRatio, totalRows/currentNDV))

Let’s break down this formula step by step:

Components Explained

Sampling Ratio (samplingRatio = targetRows/totalRows)
- Represents the fraction of data we’re keeping
- Example: Sampling 200 rows from 1000 → ratio = 0.2 (20%)
Average Frequency (totalRows/currentNDV)
- Shows how often each distinct value appears on average
- Example: 1000 rows with 200 distinct values → average frequency = 5
Miss Probability (1 - samplingRatio)
- Chance of not selecting a specific row
- Example: 20% sampling → 80% (0.8) chance of missing each row
Complete Miss Probability (Math.pow(1 - samplingRatio, frequency))
- Probability of missing all occurrences of a value
- Accounts for multiple occurrences of each value
Final Estimation (currentNDV * (1 - [miss probability]))
- Converts probabilities into estimated distinct values
- Adjusts for the sampling size

Practical Example

Consider these parameters:

Total rows: 1000
Current distinct values: 200
Target sample size: 100

Calculation steps:

Sampling ratio = 100/1000 = 0.1
Average frequency = 1000/200 = 5
Miss probability = 1 - 0.1 = 0.9
Complete miss probability = 0.9^5 ≈ 0.59
Survival probability = 1 - 0.59 = 0.41
Estimated distinct values = 200 * 0.41 ≈ 82

This means we expect to see about 82 distinct values in our 100-row sample.

Implementation

Here’s the practical implementation combining both methods:

public static double calculateNDV(double distinctValues, double totalRows, double targetRows) {
    if (totalRows == 0) {
        return 0;
    }
    
    double ndvRatio = distinctValues / totalRows;
    double reductionRatio = targetRows / totalRows;
    
    return ndvRatio > 0.9 
           ? ndvRatio * targetRows  // High cardinality case
           : distinctValues * (1 - Math.pow(1 - reductionRatio, totalRows/distinctValues));
}

Real-World Applications

NDV estimation is crucial for:

Query plan optimization
Memory allocation
Cost-based optimization
Data sampling strategies
Resource management

Best Practices

Choose the Right Method
- Use linear scaling for highly unique data
- Use Poisson-based estimation for normal cases
Consider Data Distribution
- Account for skewed data
- Monitor estimation accuracy
Regular Updates
- Keep statistics up to date
- Validate estimates periodically

Conclusion

NDV estimation is fundamental to query optimization. While the mathematics might seem complex, understanding these estimation methods helps in:

Building better database systems
Optimizing query performance
Making informed decisions about data operations

By choosing the appropriate estimation method and implementing it correctly, we can significantly improve database performance and resource utilization.

What is NDV?#

Why is NDV Estimation Important?#

How to Estimate NDV?#

1. High Cardinality Case#

2. Normal Case: Poisson-based Estimation#

Components Explained#

Practical Example#

Implementation#

Real-World Applications#

Best Practices#

Conclusion#