英文关键词提取之RAKE算法

RAKE算法是由2010年的论文Automatic keyword extraction from individual documents提出的,比TextRank算法效果更好,原repository链接是 https://github.com/aneesha/RAKE,已经很久没有维护了,本文重新整理了代码,做了以下3个工作:

  1. 使其支持python 3.0版本
  2. 使其更灵活地用命令行调用
  3. 代码重构,提高可读性
    托管在https://github.com/laserwave/RAKE

RAKE算法思想

RAKE算法用来做关键词(keyword)的提取,实际上提取的是关键的短语(phrase),并且倾向于较长的短语,在英文中,关键词通常包括多个单词,但很少包含标点符号和停用词,例如and,the,of等,以及其他不包含语义信息的单词。
RAKE算法首先使用标点符号(如半角的句号、问号、感叹号、逗号等)将一篇文档分成若干分句,然后对于每一个分句,使用停用词作为分隔符将分句分为若干短语,这些短语作为最终提取出的关键词的候选词。
那么,如何来衡量每个短语的重要程度呢?
我们注意到,每个短语可以再通过空格分为若干个单词,可以通过给每个单词赋予一个得分,通过累加得到每个短语的得分。一个关键点在于将这个短语中每个单词的共现关系考虑进去。
最终定义的公式是:

$$
wordScore = wordDegree(w)/wordFrequency(w)
$$

即单词w的得分是该单词的度(是一个网络中的概念,每与一个单词共现在一个短语中,度就加1,考虑该单词本身)除以该单词的词频(该单词在该文档中出现的总次数)。
然后对于每个候选的关键短语,将其中每个单词的得分累加,并进行排序,RAKE将候选短语总数的前三分之一的认为是抽取出的关键词。

RAKE的实现

源码

源码中使用maxPhraseLength参数来限定候选短语的长度,用来过滤掉过长的短语。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import re
import operator
import argparse
import codecs

def isNumber(s):
try:
float(s) if '.' in s else int(s)
return True
except ValueError:
return False

class Rake:

def __init__(self, inputFilePath, stopwordsFilePath, outputFilePath, minPhraseChar, maxPhraseLength):
self.outputFilePath = outputFilePath
self.minPhraseChar = minPhraseChar
self.maxPhraseLength = maxPhraseLength
# read documents
self.docs = []
for document in codecs.open(inputFilePath, 'r', 'utf-8'):
self.docs.append(document)
# read stopwords
stopwords = []
for word in codecs.open(stopwordsFilePath, 'r', 'utf-8'):
stopwords.append(word.strip())
stopwordsRegex = []
for word in stopwords:
regex = r'\b' + word + r'(?![\w-])'
stopwordsRegex.append(regex)
self.stopwordsPattern = re.compile('|'.join(stopwordsRegex), re.IGNORECASE)

def separateWords(self, text):
splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
words = []
for word in splitter.split(text):
word = word.strip().lower()
# leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases
if len(word) > 0 and word != '' and not isNumber(word):
words.append(word)
return words


def calculatePhraseScore(self, phrases):
# calculate wordFrequency and wordDegree
wordFrequency = {}
wordDegree = {}
for phrase in phrases:
wordList = self.separateWords(phrase)
wordListLength = len(wordList)
wordListDegree = wordListLength - 1
for word in wordList:
wordFrequency.setdefault(word, 0)
wordFrequency[word] += 1
wordDegree.setdefault(word, 0)
wordDegree[word] += wordListDegree
for item in wordFrequency:
wordDegree[item] = wordDegree[item] + wordFrequency[item]

# calculate wordScore = wordDegree(w)/wordFrequency(w)
wordScore = {}
for item in wordFrequency:
wordScore.setdefault(item, 0)
wordScore[item] = wordDegree[item] * 1.0 / wordFrequency[item]

# calculate phraseScore
phraseScore = {}
for phrase in phrases:
phraseScore.setdefault(phrase, 0)
wordList = self.separateWords(phrase)
candidateScore = 0
for word in wordList:
candidateScore += wordScore[word]
phraseScore[phrase] = candidateScore
return phraseScore


def execute(self):
file = codecs.open(self.outputFilePath,'w','utf-8')
for document in self.docs:
# split a document into sentences
sentenceDelimiters = re.compile(u'[.!?,;:\t\\\\"\\(\\)\\\'\u2019\u2013]|\\s\\-\\s')
sentences = sentenceDelimiters.split(document)
# generate all valid phrases
phrases = []
for s in sentences:
tmp = re.sub(self.stopwordsPattern, '|', s.strip())
phrasesOfSentence = tmp.split("|")
for phrase in phrasesOfSentence:
phrase = phrase.strip().lower()
if phrase != "" and len(phrase) >= self.minPhraseChar and len(phrase.split()) <= self.maxPhraseLength:
phrases.append(phrase)

# calculate phrase score
phraseScore = self.calculatePhraseScore(phrases)
keywords = sorted(phraseScore.items(), key = operator.itemgetter(1), reverse=True)
file.write(str(keywords[0:int(len(keywords)/3)]) + "\n")
file.close()

def readParamsFromCmd():
parser = argparse.ArgumentParser(description = "This is a python implementation of rake(rapid automatic keyword extraction).")
parser.add_argument('inputFilePath', help = 'The file path of input document(s). One line represents a document.')
parser.add_argument('stopwordsFilePath', help = 'The file path of stopwords, each line represents a word.')
parser.add_argument('-o', '--outputFilePath', help = 'The file path of output. (default output.txt in current dir).', default = 'output.txt')
parser.add_argument('-m', '--minPhraseChar', type = int, help = 'The minimum number of characters of a phrase.(default 1)', default = 1)
parser.add_argument('-a', '--maxPhraseLength', type = int, help = 'The maximum length of a phrase.(default 3)', default = 3)
return parser.parse_args()

params = readParamsFromCmd().__dict__

rake = Rake(params['inputFilePath'], params['stopwordsFilePath'], params['outputFilePath'], params['minPhraseChar'], params['maxPhraseLength'])
rake.execute()

用法

命令行运行。

1
python rake.py [-h] [-o OUTPUTFILEPATH] [-m MINPHRASECHAR] [-a MAXPHRASELENGTH] inputFilePath stopwordsFilePath

positional arguments:
inputFilePath The file path of input document(s). One line represents a document.
stopwordsFilePath The file path of stopwords, each line represents a word.

optional arguments:
-h, –help show this help message and exit
-o OUTPUTFILEPATH, –outputFilePath OUTPUTFILEPATH The file path of output. (default output.txt in current dir).
-m MINPHRASECHAR, –minPhraseChar MINPHRASECHAR The minimum number of characters of a phrase.(default 1)
-a MAXPHRASELENGTH, –maxPhraseLength MAXPHRASELENGTH The maximum length of a phrase.(default 3)

实验

源码中的example中包括两个文档,第一个文档是论文中给出的一个样例,第二个文档是从wikipedia的NLP词条中拷贝的一段话。如下所示:

data

输出是每个文档中抽取的关键短语以及得分,如下:

res

RAKE可以处理中文吗

使用RAKE算法处理中文,会遇到一些问题,中文使用停用词来将一句话划分为若干短语的效果远不及英文,大部分的汉字相互粘连在一起,因此效果不好。

参考资料

1 Automatic keyword extraction from individual documents by Stuart Rose et al.

源码已托管至 github