Term Frequency-Inverse Document Frequency (TF-IDF)
Term frequency-inverse document frequency, or TF-IDF, is a numerical statistic used in information retrieval that aims to represent the significance of a word within a document in a collection or corpus. It is frequently utilized as a weighting factor in text mining, user modeling, and information retrieval searches. In order to account for the fact that some words appear more frequently than others overall, the TF-IDF value increases proportionately to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the term.
○ Contents
○ Introduction
In the context of information retrieval, TF-IDF (short for term frequency-inverse document frequency), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. The term is composed of two factors: TF and IDF.
○ Term Frequency
Term frequency (TF) is the relative frequency of a term within a given document. It is obtained as the number of times a word appears in a text, divided by the total number of words appearing in the text. Mathematically, it is given by
where \(f_{t, d}\) is the number of times that term \(t\) appears in the document \(d\). Note that the denominator is simply the total number of terms in the document \(d\), counting every instance of a given term.
○ Inverse Document Frequency
Inverse document frequency (IDF) measures how common or rare a word is across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that ratio. Mathematically, it is given by
where \(\left\vert S \right\vert\) denotes the cardinality of the set \(S\), \(N\) is the total number of documents in the corpus, i.e. \(N = \left\vert D \right\vert\), and the denominator \(\left\vert \left\{ d \in D : t \in d \right\} \right\vert\) is the number of documents where the term \(t\) appears.
If \(t\) is not in the corpus, then \(\text{IDF}\left(t, D\right)\) will become undefined. To avoid this, the denominator is often adjusted to \(1 + \left\vert \left\{ d \in D : t \in d \right\} \right\vert\).
○ TF-IDF
TF-IDF is the product of the two terms TF and IDF, i.e.
It objectively evaluates how relevant a word is to a text in a collection of texts, taking into consideration that some words appear more frequently in general.
○ Text Vectorization
In order to perform machine learning on text data, we must transform the documents into vector representations. In natural language processing, text vectorization is the process of converting words, sentences, or even larger units of text data to numerical vectors.
The TfidfVectorizer
class converts a list of texts to matrices of token counts.
from sklearn.feature_extraction.text import TfidfVectorizer
We create an instance of this class by setting the ngram_range
parameter, with the default choice being \((1, 1).\)
TfidfVec = TfidfVectorizer(ngram_range = (1, 1))
Note: The parameter gives the lower and upper boundary of the range of \(n\)-values
corresponding to different word \(n\)-grams
to be extracted, i.e. all values of \(n\)
such that \(\text{min}_n \leq n \leq \text{max}_n\)
will be used. For example, an ngram_range
of \((1, 1)\)
means only words, \((1, 2)\)
means words and bigrams, and \((2, 2)\)
means only bigrams.
Now, we vectorize a list \(X\) of texts.
X_tfidf = TfidfVec.fit_transform(X)
The TF-IDF features are accessed through get_feature_names_out
.
feature_names = TfidfVec.get_feature_names_out()
The output X_tfidf
is a compressed sparse row (CSR) matrix. It can be converted to an array via toarray
.
X_tfidf_arr = X_tfidf.toarray()
X_tfidf
can be converted to a DataFrame in the following way.
import pandas as pd
X_tfidf_df = pd.DataFrame.sparse.from_spmatrix(
data = X_tfidf,
columns = feature_names
)
○ References
- Array
- Compressed sparse row
- DataFrame
- Information retrieval
- Inverse document frequency
- Term frequency
- Text mining
- User modeling