John Kinder CITS3200 Project - Information from Letters

Extracting and storing information about letter writers and their readers

The project will consist of creating a piece of software that can automatically extract predefined items from Word documents and compile those items into a searchable database. The Word documents have been produced in the context of a larger research programme which is developing a digital archive of private documents created by missionaries in colonial Australia, mainly letters, diaries and other private writings. The Word documents record, for each original piece of correspondence, a set of metadata categories, for instance, name of the archive, archival code, author of the correspondence, recipient of the correspondence, place of writing, date of writing, language(s) used in the correspondence, names and places named in the document etc. The metadata are not tagged in the Word files. The programme to be developed will automatically extract the metadata under each category and store in a searchable database.

The program will ideally allow the following functions:

allow the user to import the original documents (doc or docx);
identify the metadata in the documents (through basic NLP );
create a standard database of the metadata;
allow the user to search for combinations of metadata in the database and point to the relevant source document.

The program needs to have the following non-functional requirements:

built in an open-source environment (preferably not scripted in Java);
GUI;
works in Windows and macOS (possibly also Linux).

Client

Contact Person: A/Prof John Kinder; Mr Francesco De Toni
Telephone: +61 8 6488 2192
Email: [email protected]; [email protected]
Preferred method of contact: email
Location: Room 2.06, Arts Building, Crawley Campus, UWA

Client Unavailability

None

IP Exploitation Model

The client wishes to use a Creative Commons CC BY-NC model to deal with IP embodied in the project.