Paper: Computational Linkuistics: Word Triggers Across Hyperlinks

ACL ID N04-4031
Venue Human Language Technologies
Session Short Paper
Year 2004

It is known that context words tend to be self- triggers, that is, the probability of a content word to appear more than once in a document, given that it already appears once, is signifi- cantly higher than the probability of the first oc- currence. We look at self-triggerability across hyperlinks on the Web. We show that the prob- ability of a word a6a8a7 to appear in a Web docu- ment a9a11a10 depends on the presence of a6a12a7 in doc- uments pointing to a9a13a10. In Document Model- ing, we will propose the use of a correction fac- tor, a14, which indicates how much more likely a word is to appear in a document given that another document containing the same word is linked to it.