Paper: The Method of Improving the Specific Language Focused Crawler

ACL ID W10-4120
Title The Method of Improving the Specific Language Focused Crawler
Venue Joint Conference on Chinese Language Processing
Session Main Conference
Year 2010
Authors

In recent years, more and more CJK (Chinese, Japanese, and Korean) web pages appear in the Internet. The infor- mation in the CJK web page also be- comes more and more important. Web crawler is a kind of tool to retrieve web pages. Previous researches focused on English web crawlers and th web crawler is always optimized for English web pages. We found that the perform- ance of the web crawler is worse in re- trieving CJK web pages. We tried to en- hance the performance of the CJK crawler by analyzing the web link struc- ture, anchor text, and host name on the hyperlink and changing the crawling al- gorithm. We distinguish the top-level domain name and the language of the anchor text on hyperlinks. The method that distinguishes the language of the an- chor text on hyperlinks is...