Laws of Malta Text Corpus

Introduction

This repository provides a text corpus of Maltese and English extracted from the Laws of Malta PDF documents as provided at http://www.justice.gov.mt. It consists of a number of plain (UTF-8 encoded) text files, each file containing the extracted text of a single chapter or legal notice. For each such content, a Maltese and English version is provided in similarly named files residing under parallel directories.

This data provides both a substantial source of easily processable Maltese text, and also a roughly structured bilingual corpus for English and Maltese. Thus, it is hoped that it should prove a useful tool for natural language processing research efforts on the Maltese language concerning such topics as lexicon building, spell-checking and machine translation.

Corpus Format

All files are plain text encoded in UTF-8 (Unicode) format with an initial byte-order mark (BOM) at the start of the file. Such an encoding is necessary in order to properly cater for the non-ASCII Maltese characters (ċ, ġ, ħ, ż, etc.). In order for the files to be handled properly, Unicode enabled software is required. Most modern software supports Unicode, although in some cases it may be necessary to explicitly indicate that the content is in UTF-8 format (such as when viewing the files in a web browser.)

The data is provided in four different formats, to suit different processing requirements:

Type

Description

Unicode

All characters are encoded using their proper Unicode values.

Plain

The non-ASCII Maltese characters are converted to their corresponding plain ASCII counterpart (c, g, h, z, etc.). This caters for the type of Maltese text often encountered when there is no means to properly display these characters.

Maltilex

The non-ASCII Maltese characters are replaced with an escape sequence (_c, _g, _h, _z, etc.) used in some Maltilex related works.

Tornado

The non-ASCII characters are represented using the code used in the obsoleted, non-Unicode, Tornado font family. Unless formatted with this particular font family, the characters will appear as punctuation symbols in most normal fonts. This format is provided for completeness sake only, as it is the original encoding of the source files from which the text was extracted.

In more detail, the main differences in character representation are given in the table below:

Unicode

Plain

Maltilex

Tornado

ċ

c

_c

`

Ċ

C

_C

¬ or ~

ġ

g

_g

[

Ġ

G

_G

{

ħ

h

_h

]

Ħ

H

_H

}

gh

g_h

g]

Gh

G_h

G]

GH

G_H

G}

ż

z

_z

\

Ż

Z

_Z

|

à

a

_a

à

À

A

_A

À

_

_

__

_

Download

The corpus can be downloaded as zip files containing the documents in Unicode, Plain, Maltilex and Tornado formats. The files can also be browsed online here.

Conditions of Use

Despite its origin, the data contained within this corpus is to be strictly interpreted as random text, and has no legal value whatsoever. In certain cases, the text content or formatting may not be coherent or well-structured. The data is freely available for academic research and non-commercial use. Parties interested in developing commercial applications involving the direct use of this data should seek authorisation.

Acknowledgements

We thank the Ministry of Justice and Home Affairs for the kind permission to make this data publicly available on-line.

Last update: January 23, 2005