Understanding TF-IDF with Example
📌 What is TF-IDF?
TF-IDF stands for Term Frequency–Inverse Document Frequency. It’s a technique used in information retrieval and text mining to evaluate how important a word is to a document in a collection or corpus.
🔢 TF-IDF Formula
TF-IDF(term, document) = TF(term, document) × IDF(term)
Term Frequency (TF)
TF(t, d) = (Number of times term t appears in document d) ------------------------------------------------ (Total number of terms in document d)
Inverse Document Frequency (IDF)
IDF(t) = log ( N / (1 + DF(t)) ) Where: - N = total number of documents - DF(t) = number of documents containing the term t
✅ Example: 3 Sample Documents
D1: "the cat sat on the mat" D2: "the dog sat on the log" D3: "the cat chased the dog"
Step-by-Step: TF-IDF for the word “cat” in D1
Step 1: Calculate TF
“cat” appears once in D1 which has 6 words:
TF(cat, D1) = 1 / 6 ≈ 0.1667
Step 2: Calculate IDF
“cat” appears in D1 and D3 → DF(cat) = 2
IDF(cat) = log(3 / (1 + 2)) = log(1) = 0
TF-IDF
TF-IDF(cat, D1) = 0.1667 × 0 = 0
→ Low score because “cat” is common across documents
Now Try: “mat” in D1
- “mat” appears once in D1 (6 words total):
TF = 1 / 6 = 0.1667
- “mat” appears only in D1 → DF = 1
IDF = log10(3 / (1 + 1)) = log10(1.5) ≈ 0.176
TF-IDF(mat, D1) ≈ 0.1667 × 0.176 = 0.0294
→ Higher score than “cat” because it’s more unique
🧠Interpretation
- Common words like
"the"
,"on"
,"sat"
have low or zero TF-IDF. - Unique words like
"mat"
get higher TF-IDF and help identify what the document is about.
💡 Want More?
Would you like to see the same thing implemented in Python with scikit-learn?