Statistical Information in NLP

Posted on 2024-07-08 In NLP Views: Word count in article: 5.1k Reading time ≈ 5 mins.

TF-IDF（Term Frequency-Inverse Document Frequency）是一种常用的文本特征提取方法，用于评估一个词在一个文档中的重要性。除了标准的TF-IDF，还有一些变体和扩展来进一步优化特征提取过程。以下是一些常见的变体：

BM25：
- BM25（Best Matching 25）是TF-IDF的一种改进版本，广泛用于信息检索。它考虑了词频饱和现象和文档长度规范化，使得对长文档和短文档的处理更加平衡。
- 公式：
\[ \text{BM25}(q, D) = \sum_{i=1}^{n} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})} \]

其中，\(f(q_i, D)\)是词\(q_i\)在文档D中的频率，|D|是文档长度，\(\text{avgdl}\)是平均文档长度，\(k_1\)和\(b\)是调节参数。
TF-CHI：
- 这种方法结合了词频（TF）和卡方检验（Chi-Square Test，CHI），在考虑词频的同时，引入卡方统计量来衡量词与类别之间的相关性。
- 公式：
\[ \text{TF-CHI}(t, c) = \text{TF}(t, c) \cdot \chi^2(t, c) \]

其中，\(\chi^2(t, c)\)是词\(t\)和类别\(c\)之间的卡方统计量。
TF-IG：
- 结合了词频和信息增益（Information Gain, IG），通过信息增益来评估词对分类的贡献。
- 公式：
\[ \text{TF-IG}(t, c) = \text{TF}(t, c) \cdot \text{IG}(t, c) \]

其中，\(\text{IG}(t, c)\)是词t和类别c之间的信息增益。
TF-RF：
- 结合了词频和相关系数（Relevance Frequency, RF），用于评估词与类别之间的相关性。
- 公式：
\[ \text{TF-RF}(t, c) = \text{TF}(t, c) \cdot \text{RF}(t, c) \]

其中，\(\text{RF}(t, c)\)是词t在类别c中的相关系数。
LDA-TF-IDF：
- 将主题模型（LDA, Latent Dirichlet Allocation）与TF-IDF结合，通过LDA生成主题分布后，再基于这些主题分布计算TF-IDF。
- 公式：
\[ \text{LDA-TF-IDF}(t, D) = \text{TF-IDF}(t, D) \cdot P(z | D) \]

其中，\(P(z | D)\)是文档D中主题z的概率。
Okapi BM25+：
- BM25的进一步改进版本，加入了一些附加参数来增强模型性能。
- 公式类似于BM25，但加入了文档饱和度和文档权重的参数。
Weighted TF-IDF：
- 在计算TF-IDF时，对词频或逆文档频率进行加权。例如，可以根据词的词性、词的重要性、领域专有词等因素进行加权。
- 公式：
\[ \text{Weighted TF-IDF}(t, d) = w_t \cdot \left( \frac{\text{TF}(t, d)}{\text{DF}(t)} \right) \]

其中，\(w_t\)是词t的权重，可以根据不同的标准来设定。
Logarithmic TF-IDF：
- 使用对数变换来平滑词频，减小高频词对模型的影响。
- 公式：
\[ \text{Log-TF}(t, d) = \log(1 + \text{TF}(t, d)) \]

\[ \text{Log-TF-IDF}(t, d) = \text{Log-TF}(t, d) \cdot \text{IDF}(t) \]
Sublinear TF-IDF：
- 对词频进行子线性缩放，通常用于大规模文档集。
- 公式：
\[ \text{Sublinear-TF}(t, d) = 1 + \log(\text{TF}(t, d)) \]

\[ \text{Sublinear-TF-IDF}(t, d) = \text{Sublinear-TF}(t, d) \cdot \text{IDF}(t) \]
Double Normalization TF-IDF：

通过双重归一化对词频进行标准化，常用于处理文档长度差异。
公式：

\[ \text{DoubleNorm-TF}(t, d) = 0.5 + 0.5 \cdot \frac{\text{TF}(t, d)}{\max_{t'} \text{TF}(t', d)} \]
\[ \text{DoubleNorm-TF-IDF}(t, d) = \text{DoubleNorm-TF}(t, d) \cdot \text{IDF}(t) \]

TF-IDF with Class-Based Weighting：

根据词在不同类别中的分布情况进行加权，以增强分类任务的效果。
公式：

\[ \text{Class-TF-IDF}(t, d, c) = \text{TF-IDF}(t, d) \cdot \text{ClassWeight}(t, c) \]
其中，\(\text{ClassWeight}(t, c)\)表示词\(t\)在类别\(c\)中的重要性权重。

TF-IDF with Query Expansion：

在查询扩展过程中使用TF-IDF，对初始查询进行扩展，以包含更多相关的词。
公式：

\[ \text{Expanded-TF-IDF}(q, d) = \sum_{t \in q \cup E(q)} \text{TF-IDF}(t, d) \]
其中，\(E(q)\)是查询\(q\)的扩展词集合。

Smooth Inverse Frequency (SIF)：

用于词向量加权，通过平滑逆频率来减少高频词的影响。
公式：

\[ \text{SIF}(t) = \frac{a}{a + \text{TF}(t)} \]
其中，\(a\)是一个小的平滑参数。

Reference

BM25：
- Robertson, S. E., Walker, S., Beaulieu, M. M., Gatford, M., & Payne, A. (1996). Okapi at TREC-4. In D. K. Harman (Ed.), NIST Special Publication 500-236: Proceedings of The Fourth Text REtrieval Conference (TREC-4) (pp. 73-96). Gaithersburg, MD: National Institute of Standards and Technology. Retrieved from SpringerLink.
TF-CHI：
- Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML) (pp. 412-420). San Francisco, CA: Morgan Kaufmann.
TF-IG：
- Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML) (pp. 412-420). San Francisco, CA: Morgan Kaufmann.
TF-RF：
- Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735. https://doi.org/10.1109/TPAMI.2008.110
LDA-TF-IDF：
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. Retrieved from JMLR.
Okapi BM25+ Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (pp. 42-49). ACM. https://doi.org/10.1145/1031171.1031181
Weighted TF-IDF：
- Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc.
Logarithmic TF-IDF：
- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
Sublinear TF-IDF：
- Lv, Y., & Zhai, C. (2011). Lower-bounding term frequency normalization. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (pp. 7-16). ACM. https://doi.org/10.1145/2063576.2063581
Double Normalization TF-IDF：
- Jones, K. S., Walker, S., & Robertson, S. E. (2000). A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing & Management, 36(6), 809-840. https://doi.org/10.1016/S0306-4573(00)00016-9
TF-IDF with Class-Based Weighting：
- Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. In Text mining and its applications (pp. 81-97). Springer. https://doi.org/10.1007/3-540-36618-0_15
TF-IDF with Query Expansion：
- Xu, J., & Croft, W. B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS), 18(1), 79-112. https://doi.org/10.1145/333135.333138
Smooth Inverse Frequency (SIF)：
- Arora, S., Liang, Y., & Ma, T. (2017). A simple but tough-to-beat baseline for sentence embeddings. International Conference on Learning Representations (ICLR). Retrieved from ICLR Conference.