Term Frequency-Inverse Document Frequency (TF-IDF)

March 24, 2024

Term frequency-inverse document frequency, or TF-IDF, is a numerical statistic used in information retrieval that aims to represent the significance of a word within a document in a collection or corpus. It is frequently utilized as a weighting factor in text mining, user modeling, and information retrieval searches. In order to account for the fact that some words appear more frequently than others overall, the TF-IDF value increases proportionately to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the term.

○ Introduction

In the context of information retrieval, TF-IDF (short for term frequency-inverse document frequency), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. The term is composed of two factors: TF and IDF.

○ Term Frequency

Term frequency (TF) is the relative frequency of a term within a given document. It is obtained as the number of times a word appears in a text, divided by the total number of words appearing in the text. Mathematically, it is given by

\[ \text{TF}\left(t, d\right) := \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}, \]

where \(f_{t, d}\) is the number of times that term \(t\) appears in the document \(d\). Note that the denominator is simply the total number of terms in the document \(d\), counting every instance of a given term.

○ Inverse Document Frequency

Inverse document frequency (IDF) measures how common or rare a word is across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that ratio. Mathematically, it is given by

\[ \text{IDF}\left(t, D\right) := \log{\frac{N}{\left\vert \left\{ d \in D : t \in d \right\} \right\vert}}, \]

where \(\left\vert S \right\vert\) denotes the cardinality of the set \(S\), \(N\) is the total number of documents in the corpus, i.e. \(N = \left\vert D \right\vert\), and the denominator \(\left\vert \left\{ d \in D : t \in d \right\} \right\vert\) is the number of documents where the term \(t\) appears.

If \(t\) is not in the corpus, then \(\text{IDF}\left(t, D\right)\) will become undefined. To avoid this, the denominator is often adjusted to \(1 + \left\vert \left\{ d \in D : t \in d \right\} \right\vert\).

○ TF-IDF

TF-IDF is the product of the two terms TF and IDF, i.e.

\[ \text{TF-IDF}\left(t, d, D\right) := \text{TF}\left(t, d\right) \times \text{IDF}\left(t, D\right). \]

It objectively evaluates how relevant a word is to a text in a collection of texts, taking into consideration that some words appear more frequently in general.

○ Text Vectorization

In order to perform machine learning on text data, we must transform the documents into vector representations. In natural language processing, text vectorization is the process of converting words, sentences, or even larger units of text data to numerical vectors.

The TfidfVectorizer class converts a list of texts to matrices of token counts.

from sklearn.feature_extraction.text import TfidfVectorizer

We create an instance of this class by setting the ngram_range parameter, with the default choice being \((1, 1).\)

TfidfVec = TfidfVectorizer(ngram_range = (1, 1))

Note: The parameter gives the lower and upper boundary of the range of \(n\)-values corresponding to different word \(n\)-grams to be extracted, i.e. all values of \(n\) such that \(\text{min}_n \leq n \leq \text{max}_n\) will be used. For example, an ngram_range of \((1, 1)\) means only words, \((1, 2)\) means words and bigrams, and \((2, 2)\) means only bigrams.

Now, we vectorize a list \(X\) of texts.

X_tfidf = TfidfVec.fit_transform(X)

The TF-IDF features are accessed through get_feature_names_out.

feature_names = TfidfVec.get_feature_names_out()

The output X_tfidf is a compressed sparse row (CSR) matrix. It can be converted to an array via toarray.

X_tfidf_arr = X_tfidf.toarray()

X_tfidf can be converted to a DataFrame in the following way.

import pandas as pd
X_tfidf_df = pd.DataFrame.sparse.from_spmatrix(
    data = X_tfidf,
    columns = feature_names
)