1. 介绍
Gensim
是一个python
的包可以自动提取文档语义主题。主要用来对于原始未标记文档进行处理。包括Latent Semantic Analysis, Latent Dirichlet Allocation, Random Projections。这些算法都是unsupervised
意味着不需要人工输入。
一旦概率模式被发现,原始文档可以有效地进行语义表示,并且可以进行主题查询。
本文主要参考Gensim
官网教程。
1.1 特征
- Memory Independence
- 有很多以实现算法
1.2 核心概念
- Copus 文档集合
- Vector 在
Vector Space Model
中每个文档表示成一个特征数组 - Sparse Vector 使用
Sparse Vector
表示文档 - Model 相当于文档->特征的映射函数
2. 安装
1 | pip install --upgrade gensim |
3. 语料以及向量空间
3.1 String -> Vectors
1 | In [1]: from gensim import corpora, models, similarities |
这是一个九句话的语料库。
首先进行tokenize
这个文档集,去掉共同词作为stopwords
以及只在一个文档里出现过一次的词。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15In [3]: stoplist = set('for a of the and to in'.split())
In [6]: texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
In [7]: texts
Out[7]:
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'management', 'system'],
['system', 'human', 'system', 'engineering', 'testing', 'eps'],
['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
['generation', 'random', 'binary', 'unordered', 'trees'],
['intersection', 'graph', 'paths', 'trees'],
['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
['graph', 'minors', 'survey']]
1 | # remove words that appear only once |
把这个word
以及id
变成dictionary
。1
2
3dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference
print(dictionary)1
2In [16]: print(dictionary.token2id)
{'user': 3, 'trees': 9, 'eps': 8, 'minors': 11, 'interface': 2, 'survey': 5, 'system': 4, 'computer': 1, 'response': 7, 'human': 0, 'time': 6, 'graph': 10}
然后把tokenized
的文档变成向量。1
2
3
4In [17]: new_doc = "Human computer interaction"
In [18]: new_vec = dictionary.doc2bow(new_doc.lower().split())
In [19]: print(new_vec)
[(0, 1), (1, 1)]
函数doc2bow()
只是简单地计算一下每个不同的单词的出现次数。然后返回结果作为一个sparse vector
。1
2
3
4
5
6
7
8
9
10
11In [23]: corpus = [dictionary.doc2bow(text) for text in texts]
In [24]: pprint(corpus)
[[(0, 1), (1, 1), (2, 1)],
[(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
[(2, 1), (3, 1), (4, 1), (8, 1)],
[(0, 1), (4, 2), (8, 1)],
[(3, 1), (6, 1), (7, 1)],
[(9, 1)],
[(9, 1), (10, 1)],
[(9, 1), (10, 1), (11, 1)],
[(5, 1), (10, 1), (11, 1)]]
导出corpora
1
In [32]: corpora.MmCorpus.serialize('d.mm',corpus)
4. 主题以及tranformations
4.1 导入已经有的corpus
1 | In [33]: if (os.path.exists("d.dict")): |
下面把文档从一个向量表示转化为另一个。
4.2 Tfidf 表示
1 | In [34]: tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model |
4.3 LSI 表示
使用 Latent semantic analysis 来把Tf-idf
的corpus
表示到一个2-D space
(假设我们把num_tops=2
)。
1 | In [42]: lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation |
保存模型1
2lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')
4.4 可以使用的模型
Random Projections1
model = models.RpModel(tfidf_corpus, num_topics=500)
Latent Dirichlet Allocation, LDA1
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
Hierarchical Dirichlet Process, HDP。这是一个non-parameteric bayesian method
1
model = models.HdpModel(corpus, id2word=dictionary)
5. 相似度查询
5.1 初始化查询结构
在这里使用上一节的lsi model
形成的corpus
。1
2In [49]: from gensim import similarities
In [50]: index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
5.2 对一个新的文档进行查询。
1 | In [51]: doc = "Human computer interaction" |
把这个文档查询结果进行排序。1
2
3
4
5
6
7
8
9
10
11sorted(enumerate(sims), key=lambda item: -item[1]) sims =
print(sims) # print sorted (document number, similarity score) 2-tuples
[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees
5.3 存储Model
1 | index.save('/tmp/deerwester.index') |
6. Word2Vec
1 | In [60]: texts |
高级模型训练1
2
3model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet
model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator
model.train(other_sentences) # can be a non-repeatable, 1-pass generator
同样Doc2Vec
的使用是类似的。但是需要新建一个类来进行才行。
参考
[1] Gensim 官网教程:https://radimrehurek.com/gensim/index.html
[2] 分析Wikipedia主题分布实例:https://radimrehurek.com/gensim/wiki.html
[3] Word2Vec:https://rare-technologies.com/word2vec-tutorial/
[4] Doc2Vec教程:https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
[5] Doc2Vec Wiki教程:https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
因为我们是朋友,所以你可以使用我的文字,但请注明出处:http://alwa.info