ACL ID N03-2001
Title Automating XML Markup Of Text Documents
Venue Human Language Technologies
Session Short Paper
Year 2003

We present a novel system for automatically marking up text documents into XML and discuss the benefits of XML markup for intel- ligent information retrieval. The system uses the Self-Organizing Map (SOM) algorithm to arrange XML marked-up documents on a two- dimensional map so that similar documents appear closer to each other. It then employs an inductive learning algorithm C5 to auto- matically extract and apply markup rules from the nearest SOM neighbours of an unmarked document. The system is designed to be adap- tive, so that once a document is marked-up; its behaviour is modified to improve accuracy. The automatically marked-up documents are again categorized on the Self-Organizing Map.