Statistical Information in NLP

TF-IDF(Term Frequency-Inverse Document Frequency)是一种常用的文本特征提取方法,用于评估一个词在一个文档中的重要性。除了标准的TF-IDF,还有一些变体和扩展来进一步优化特征提取过程。以下是一些常见的变体:

  1. BM25

    • BM25(Best Matching 25)是TF-IDF的一种改进版本,广泛用于信息检索。它考虑了词频饱和现象和文档长度规范化,使得对长文档和短文档的处理更加平衡。

    • 公式:

    \[ \text{BM25}(q, D) = \sum_{i=1}^{n} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})} \]

    其中,\(f(q_i, D)\)是词\(q_i\)在文档D中的频率,|D|是文档长度,\(\text{avgdl}\)是平均文档长度,\(k_1\)\(b\)是调节参数。

  2. TF-CHI

    • 这种方法结合了词频(TF)和卡方检验(Chi-Square Test,CHI),在考虑词频的同时,引入卡方统计量来衡量词与类别之间的相关性。

    • 公式:

    \[ \text{TF-CHI}(t, c) = \text{TF}(t, c) \cdot \chi^2(t, c) \]

    其中,\(\chi^2(t, c)\)是词\(t\)和类别\(c\)之间的卡方统计量。

  3. TF-IG

    • 结合了词频和信息增益(Information Gain, IG),通过信息增益来评估词对分类的贡献。

    • 公式:

    \[ \text{TF-IG}(t, c) = \text{TF}(t, c) \cdot \text{IG}(t, c) \]

    其中,\(\text{IG}(t, c)\)是词t和类别c之间的信息增益。

  4. TF-RF

    • 结合了词频和相关系数(Relevance Frequency, RF),用于评估词与类别之间的相关性。

    • 公式:

    \[ \text{TF-RF}(t, c) = \text{TF}(t, c) \cdot \text{RF}(t, c) \]

    其中,\(\text{RF}(t, c)\)是词t在类别c中的相关系数。

  5. LDA-TF-IDF

    • 将主题模型(LDA, Latent Dirichlet Allocation)与TF-IDF结合,通过LDA生成主题分布后,再基于这些主题分布计算TF-IDF。

    • 公式:

    \[ \text{LDA-TF-IDF}(t, D) = \text{TF-IDF}(t, D) \cdot P(z | D) \]

    其中,\(P(z | D)\)是文档D中主题z的概率。

  6. Okapi BM25+

    • BM25的进一步改进版本,加入了一些附加参数来增强模型性能。

    • 公式类似于BM25,但加入了文档饱和度和文档权重的参数。

  7. Weighted TF-IDF

    • 在计算TF-IDF时,对词频或逆文档频率进行加权。例如,可以根据词的词性、词的重要性、领域专有词等因素进行加权。

    • 公式:

    \[ \text{Weighted TF-IDF}(t, d) = w_t \cdot \left( \frac{\text{TF}(t, d)}{\text{DF}(t)} \right) \]

    其中,\(w_t\)是词t的权重,可以根据不同的标准来设定。

  8. Logarithmic TF-IDF

    • 使用对数变换来平滑词频,减小高频词对模型的影响。

    • 公式:

    \[ \text{Log-TF}(t, d) = \log(1 + \text{TF}(t, d)) \]

    \[ \text{Log-TF-IDF}(t, d) = \text{Log-TF}(t, d) \cdot \text{IDF}(t) \]

  9. Sublinear TF-IDF

    • 对词频进行子线性缩放,通常用于大规模文档集。

    • 公式:

    \[ \text{Sublinear-TF}(t, d) = 1 + \log(\text{TF}(t, d)) \]

    \[ \text{Sublinear-TF-IDF}(t, d) = \text{Sublinear-TF}(t, d) \cdot \text{IDF}(t) \]

  10. Double Normalization TF-IDF

  • 通过双重归一化对词频进行标准化,常用于处理文档长度差异。

  • 公式:

 \[   \text{DoubleNorm-TF}(t, d) = 0.5 + 0.5 \cdot \frac{\text{TF}(t, d)}{\max_{t'} \text{TF}(t', d)}  \]
 \[   \text{DoubleNorm-TF-IDF}(t, d) = \text{DoubleNorm-TF}(t, d) \cdot \text{IDF}(t)  \]

  1. TF-IDF with Class-Based Weighting
  • 根据词在不同类别中的分布情况进行加权,以增强分类任务的效果。

  • 公式:

 \[   \text{Class-TF-IDF}(t, d, c) = \text{TF-IDF}(t, d) \cdot \text{ClassWeight}(t, c)  \]
  其中,\(\text{ClassWeight}(t, c)\)表示词\(t\)在类别\(c\)中的重要性权重。

  1. TF-IDF with Query Expansion
  • 在查询扩展过程中使用TF-IDF,对初始查询进行扩展,以包含更多相关的词。

  • 公式:

 \[   \text{Expanded-TF-IDF}(q, d) = \sum_{t \in q \cup E(q)} \text{TF-IDF}(t, d)  \]
  其中,\(E(q)\)是查询\(q\)的扩展词集合。

  1. Smooth Inverse Frequency (SIF)
  • 用于词向量加权,通过平滑逆频率来减少高频词的影响。

  • 公式:

 \[   \text{SIF}(t) = \frac{a}{a + \text{TF}(t)}  \]
  其中,\(a\)是一个小的平滑参数。

Reference

  1. BM25

    • Robertson, S. E., Walker, S., Beaulieu, M. M., Gatford, M., & Payne, A. (1996). Okapi at TREC-4. In D. K. Harman (Ed.), NIST Special Publication 500-236: Proceedings of The Fourth Text REtrieval Conference (TREC-4) (pp. 73-96). Gaithersburg, MD: National Institute of Standards and Technology. Retrieved from SpringerLink.
  2. TF-CHI

    • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML) (pp. 412-420). San Francisco, CA: Morgan Kaufmann.
  3. TF-IG

    • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML) (pp. 412-420). San Francisco, CA: Morgan Kaufmann.
  4. TF-RF

    • Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735. https://doi.org/10.1109/TPAMI.2008.110
  5. LDA-TF-IDF

    • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. Retrieved from JMLR.
  6. Okapi BM25+ Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (pp. 42-49). ACM. https://doi.org/10.1145/1031171.1031181

  7. Weighted TF-IDF

    • Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc.
  8. Logarithmic TF-IDF

  9. Sublinear TF-IDF

    • Lv, Y., & Zhai, C. (2011). Lower-bounding term frequency normalization. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (pp. 7-16). ACM. https://doi.org/10.1145/2063576.2063581
  10. Double Normalization TF-IDF

    • Jones, K. S., Walker, S., & Robertson, S. E. (2000). A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing & Management, 36(6), 809-840. https://doi.org/10.1016/S0306-4573(00)00016-9
  11. TF-IDF with Class-Based Weighting

  12. TF-IDF with Query Expansion

    • Xu, J., & Croft, W. B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS), 18(1), 79-112. https://doi.org/10.1145/333135.333138
  13. Smooth Inverse Frequency (SIF)

    • Arora, S., Liang, Y., & Ma, T. (2017). A simple but tough-to-beat baseline for sentence embeddings. International Conference on Learning Representations (ICLR). Retrieved from ICLR Conference.