When optimizing database queries, one crucial factor is estimating the Number of Distinct Values (NDV) after operations like filtering or sampling. Let’s explore how this works in a practical way.

What is NDV?

NDV (Number of Distinct Values) counts how many unique values exist in a dataset. For instance, in a table with 1000 customer records, a “country” column might have only 50 distinct values since many customers are from the same countries.

Why is NDV Estimation Important?

Accurate NDV estimation helps databases to:

  • Choose optimal query execution plans
  • Allocate memory and resources efficiently
  • Predict result set sizes accurately
  • Optimize join operations

How to Estimate NDV?

The estimation method adapts based on data characteristics. Let’s explore two main approaches:

1. High Cardinality Case

When data is highly unique (over 90% distinct values), we use a simple linear scaling method:

estimatedNDV = (distinctValues/totalRows) * newRowCount

Example:

  • Original data: 1000 rows with 950 distinct values (95% unique)
  • Target: Sample down to 100 rows
  • Calculation: (950/1000) * 100 = 95 distinct values expected

This method works well for columns like:

  • Primary keys
  • Timestamps
  • Unique identifiers

2. Normal Case: Poisson-based Estimation

For typical cases (less than 90% unique values), we use a Poisson-based formula:

estimatedNDV = currentNDV * (1 - Math.pow(1 - samplingRatio, totalRows/currentNDV))

Let’s break down this formula step by step:

Components Explained

  1. Sampling Ratio (samplingRatio = targetRows/totalRows)

    • Represents the fraction of data we’re keeping
    • Example: Sampling 200 rows from 1000 → ratio = 0.2 (20%)
  2. Average Frequency (totalRows/currentNDV)

    • Shows how often each distinct value appears on average
    • Example: 1000 rows with 200 distinct values → average frequency = 5
  3. Miss Probability (1 - samplingRatio)

    • Chance of not selecting a specific row
    • Example: 20% sampling → 80% (0.8) chance of missing each row
  4. Complete Miss Probability (Math.pow(1 - samplingRatio, frequency))

    • Probability of missing all occurrences of a value
    • Accounts for multiple occurrences of each value
  5. Final Estimation (currentNDV * (1 - [miss probability]))

    • Converts probabilities into estimated distinct values
    • Adjusts for the sampling size

Practical Example

Consider these parameters:

  • Total rows: 1000
  • Current distinct values: 200
  • Target sample size: 100

Calculation steps:

  1. Sampling ratio = 100/1000 = 0.1
  2. Average frequency = 1000/200 = 5
  3. Miss probability = 1 - 0.1 = 0.9
  4. Complete miss probability = 0.9^5 ≈ 0.59
  5. Survival probability = 1 - 0.59 = 0.41
  6. Estimated distinct values = 200 * 0.41 ≈ 82

This means we expect to see about 82 distinct values in our 100-row sample.

Implementation

Here’s the practical implementation combining both methods:

public static double calculateNDV(double distinctValues, double totalRows, double targetRows) {
    if (totalRows == 0) {
        return 0;
    }
    
    double ndvRatio = distinctValues / totalRows;
    double reductionRatio = targetRows / totalRows;
    
    return ndvRatio > 0.9 
           ? ndvRatio * targetRows  // High cardinality case
           : distinctValues * (1 - Math.pow(1 - reductionRatio, totalRows/distinctValues));
}

Real-World Applications

NDV estimation is crucial for:

  • Query plan optimization
  • Memory allocation
  • Cost-based optimization
  • Data sampling strategies
  • Resource management

Best Practices

  1. Choose the Right Method

    • Use linear scaling for highly unique data
    • Use Poisson-based estimation for normal cases
  2. Consider Data Distribution

    • Account for skewed data
    • Monitor estimation accuracy
  3. Regular Updates

    • Keep statistics up to date
    • Validate estimates periodically

Conclusion

NDV estimation is fundamental to query optimization. While the mathematics might seem complex, understanding these estimation methods helps in:

  • Building better database systems
  • Optimizing query performance
  • Making informed decisions about data operations

By choosing the appropriate estimation method and implementing it correctly, we can significantly improve database performance and resource utilization.