Gini Impurity: Two Equivalent Formulas
Gini impurity is a measure used in decision tree algorithms (like CART) to quantify how “impure” a node is — in other words, how mixed the classes are.
✅ Formula 1: Basic Form
G = 1 - Σ(pᵢ²)
Where:
pᵢ
is the probability (or proportion) of classi
in the node.n
is the number of classes.
This formula calculates the probability that two items randomly chosen (with replacement) from the set belong to different classes.
✅ Formula 2: Pairwise Form
G = Σ(pᵢ × pⱼ), for all i ≠ j
This version directly computes the total probability that a randomly chosen pair of items will be from different classes.
✅ Why Are They Equivalent?
Because:
Σ(pᵢ × pⱼ) for all i, j = (Σ pᵢ)² = 1
So:
Σ(pᵢ × pⱼ) for i ≠ j = 1 - Σ(pᵢ²)
Hence:
G = 1 - Σ(pᵢ²) = Σ(pᵢ × pⱼ) for i ≠ j
✅ Summary of the Differences
Aspect | Formula 1 (1 - Σ(pᵢ²) ) |
Formula 2 (Σ(pᵢ × pⱼ), i ≠ j ) |
---|---|---|
Simpler to compute | ✅ Yes | ❌ More complex (double sum) |
Intuitive meaning | Easy: “1 – sum of squared probabilities” | Direct: “sum of all cross-class probs” |
Used in practice | Very commonly used | Rarely used explicitly |
Mathematically cleaner | Yes | Equivalent but more verbose |