How is TF-IDF Calculated





TF-IDF Explanation


Understanding TF-IDF with Example

📌 What is TF-IDF?

TF-IDF stands for Term Frequency–Inverse Document Frequency. It’s a technique used in information retrieval and text mining to evaluate how important a word is to a document in a collection or corpus.

🔢 TF-IDF Formula

TF-IDF(term, document) = TF(term, document) × IDF(term)

Term Frequency (TF)

TF(t, d) = (Number of times term t appears in document d)
           ------------------------------------------------
           (Total number of terms in document d)
    

Inverse Document Frequency (IDF)

IDF(t) = log ( N / (1 + DF(t)) )

Where:
- N = total number of documents
- DF(t) = number of documents containing the term t
    

✅ Example: 3 Sample Documents

D1: "the cat sat on the mat"
D2: "the dog sat on the log"
D3: "the cat chased the dog"
    

Step-by-Step: TF-IDF for the word “cat” in D1

Step 1: Calculate TF

“cat” appears once in D1 which has 6 words:

TF(cat, D1) = 1 / 6 ≈ 0.1667

Step 2: Calculate IDF

“cat” appears in D1 and D3 → DF(cat) = 2

IDF(cat) = log(3 / (1 + 2)) = log(1) = 0

TF-IDF

TF-IDF(cat, D1) = 0.1667 × 0 = 0

→ Low score because “cat” is common across documents

Now Try: “mat” in D1

  • “mat” appears once in D1 (6 words total): TF = 1 / 6 = 0.1667
  • “mat” appears only in D1 → DF = 1
  • IDF = log10(3 / (1 + 1)) = log10(1.5) ≈ 0.176
  • TF-IDF(mat, D1) ≈ 0.1667 × 0.176 = 0.0294

→ Higher score than “cat” because it’s more unique

🧠 Interpretation

  • Common words like "the", "on", "sat" have low or zero TF-IDF.
  • Unique words like "mat" get higher TF-IDF and help identify what the document is about.

💡 Want More?

Would you like to see the same thing implemented in Python with scikit-learn?


Leave a Comment

Your email address will not be published. Required fields are marked *