Efficient Integrated Tagging Of Word Constructs
We describe a robust text-handling com- ponent, which can deal with free text in a wide range of formats and can suc- cessfully identify a wide range of phe- nomena, including chemical formulae, dates, numbers and proper nouns. The set of regular expressions used to cap- ture numbers in written form ("sech- sundzwanzig") in German is given as an example. Proper noun "candidates" are identified by means of regular ex- pressions, these being then rejected or accepted on the basis of run-time in- teraction with the user. This tagging component is integrated in a large-scale grammar development environment, and provides direct input to the grammat- ical analysis component of the system by means of "lift" rules which convert tagged text into partial linguistic struc- tures. 1 Motivation 1.1 The...