The overall aim of this exercise is to extract named entities from a piece of news text (see examples on labs page) by marking up the output text. Examples of markup are as follows, but further details, if required, can be found on the MUC6 pages (see IE Lectures pages).
Type | Example | Markup |
Numerical Expressions | 2.5 per cent | <numex type=pc>2.5.per cent</numex> |
Organisations | Morgan Stanley | <namex type=org>Morgan Stanley</namex> |
Persons | Thierry Lacraz | <namex type=per>Thierry Lacraz</namex> |
Time Expressions | 1645 | <timex type=time>1645<timex> |
Locations | Wall Street | <namex type=loc>Wall Street</namex> |
1. Tokenise the text. This could be written as a simple shell script
2. Produce a frequency count of (a) single words (b) bigrams.
3. Produce a list of all single words that are
N.B. You will need to devise algorithms to strip off final punctuation where appropriate.
4. Now transform output to marked up form (as opposed to writing a list).
5. Mark up "per cent" phrases.
5. Mark up other phrases involving the combination of a fixed word +