This repository provides a text corpus of Maltese and
English extracted from the Laws of
This
data provides both a substantial source of easily processable
Maltese text, and also a roughly structured bilingual corpus for English and
Maltese. Thus, it is hoped that it should prove a useful tool for natural
language processing research efforts on the Maltese language concerning such
topics as lexicon building, spell-checking and machine translation.
All files
are plain text encoded in UTF-8 (Unicode)
format with an initial byte-order mark (BOM) at the start of the file. Such an
encoding is necessary in order to properly cater for the non-ASCII Maltese
characters (ċ, ġ, ħ, ż, etc.). In order for the files to be handled properly, Unicode
enabled software is required. Most modern software supports Unicode, although
in some cases it may be necessary to explicitly indicate that the content is in
UTF-8 format (such as when viewing the files in a web browser.)
The
data is provided in four different formats, to suit different processing
requirements:
|
Type |
Description |
|
Unicode |
All characters are
encoded using their proper Unicode values. |
|
Plain |
The non-ASCII Maltese characters are
converted to their corresponding plain ASCII counterpart (c, g, h, z, etc.). This
caters for the type of Maltese text often encountered when there is no means
to properly display these characters. |
|
Maltilex |
The non-ASCII Maltese characters are replaced
with an escape sequence (_c, _g, _h, _z, etc.) used in some Maltilex related works. |
|
Tornado |
The non-ASCII characters are represented
using the code used in the obsoleted, non-Unicode,
Tornado font family. Unless formatted with this particular font family, the
characters will appear as punctuation symbols in most normal fonts. This
format is provided for completeness sake only, as it is the original encoding
of the source files from which the text was extracted. |
In more detail, the main differences in character representation are given in the table below:
|
Unicode |
Plain |
Maltilex |
Tornado |
|
ċ |
c |
_c |
` |
|
Ċ |
C |
_C |
¬ or ~ |
|
ġ |
g |
_g |
[ |
|
Ġ |
G |
_G |
{ |
|
ħ |
h |
_h |
] |
|
Ħ |
H |
_H |
} |
|
għ |
gh |
g_h |
g] |
|
Għ |
Gh |
G_h |
G] |
|
GĦ |
GH |
G_H |
G} |
|
ż |
z |
_z |
\ |
|
Ż |
Z |
_Z |
| |
|
à |
a |
_a |
à |
|
À |
A |
_A |
À |
|
_ |
_ |
__ |
_ |
The
corpus can be downloaded as zip files containing the documents in Unicode, Plain, Maltilex
and Tornado formats. The files can also be browsed
online here.
Despite
its origin, the data contained within this corpus is to be strictly interpreted as random
text, and has no legal value whatsoever. In certain cases, the text content or
formatting may not be coherent or well-structured. The data is freely available
for academic research and non-commercial use. Parties interested in developing commercial
applications involving the direct use of this data should seek authorisation.
We thank
the Ministry
of Justice and Home Affairs for the kind permission to make this data publicly available on-line.
Last update: January 23, 2005