Named Entity Extraction

The overall aim of this exercise is to extract named entities from a piece of news text (see examples on labs page) by marking up the output text. Examples of markup are as follows, but further details, if required, can be found on the MUC6 pages (see IE Lectures pages).

Type Example Markup
Numerical Expressions 2.5 per cent <numex type=pc>2.5.per cent</numex>
Organisations Morgan Stanley <namex type=org>Morgan Stanley</namex>
Persons Thierry Lacraz <namex type=per>Thierry Lacraz</namex>
Time Expressions 1645 <timex type=time>1645<timex>
Locations Wall Street <namex type=loc>Wall Street</namex>

 

Preliminary Exercise

1. Tokenise the text. This could be written as a simple shell script

2. Produce a frequency count of (a) single words (b) bigrams.

3. Produce a list of all single words that are

  1. beginning with a capital letter
  2. ending with punctuation
  3. numbers

N.B. You will need to devise algorithms to strip off final punctuation where appropriate.

4. Now transform output to marked up form (as opposed to writing a list).

Main Exercise

5. Mark up "per cent" phrases.

5. Mark up other phrases involving the combination of a fixed word +