Named Entity Extraction

The overall aim of this exercise is to extract named entities from a piece of news text (see examples on labs page) by marking up the output text. Examples of markup are as follows, but further details, if required, can be found on the MUC6 pages (see IE Lectures pages).

Type	Example	Markup
Numerical Expressions	2.5 per cent	<numex type=pc>2.5.per cent</numex>
Organisations	Morgan Stanley	<namex type=org>Morgan Stanley</namex>
Persons	Thierry Lacraz	<namex type=per>Thierry Lacraz</namex>
Time Expressions	1645	<timex type=time>1645<timex>
Locations	Wall Street	<namex type=loc>Wall Street</namex>

Preliminary Exercise

1. Tokenise the text. This could be written as a simple shell script

2. Produce a frequency count of (a) single words (b) bigrams.

3. Produce a list of all single words that are

beginning with a capital letter
ending with punctuation
numbers

N.B. You will need to devise algorithms to strip off final punctuation where appropriate.

4. Now transform output to marked up form (as opposed to writing a list).

Main Exercise

5. Mark up "per cent" phrases.

5. Mark up other phrases involving the combination of a fixed word +