Sentences
What is a Sentence
- Something ending with a ".", "!", or "?".
- Sometimes sentences appear to end with other punctuation(e.g. ":") or
no punctuation (espcially those that end with URLs).
- Quotation marks cause problems.
An Algorithm
- Place putative sentence boundaries after all occurrences of . ?
(possibly ; : -- as well)
- Move the boundary after following quotation marks, if any.
- Cancel a period boundary under the following circumstances:
- If it is preceded by a known abbreviation of a sort that does not normally occur word finally, but which is commonly followed by a capitalised proper name such as Prof..
- If it is preceded by a known abbreviation and not followed by an uppercase word (e.g. etc.. This will deal with most cases.
- Disqualify a boundary with ? or ! if it is followed by a lowercase letter
or a known name.
Other Approaches
-
Statistical Classification (Riley 1989) based on case and length of
the words preceding and following a period, and the a priori
probability of different words to occur before and after a sentence
boundary.
- POS distribution of preceding and following words + a neural network
to predict sentence boundaries (Palner & Hearst 1997).
Last modified: Wed Apr 5 10:33:21