Faraz Ahmed CITS3200 Project - Page Metadata Vision Models

Identifying Page Metadata Using Vision Models

Project Proposal

Our exploration geologists are required to read through hundreds of publicly available documents. They only determine the value of these documents after they have been read, which is time-consuming and inefficient. We want to provide the geologists with more upfront information, allowing them to filter out irrelevant documents. To achieve this, we want to investigate using large/small language models with vision capabilities to generate tags that can be added to the documents and used for filtering.

The solution should be able to determine if a page has text, images and/or tables. It should be able to describe what is contained in the image or table.

Additional Scope

We are also looking at extracting images from documents. Whilst images can be easily extracted from modern documents, older documents are provided as full-page scans, meaning they may contain multiple sub-images. Based on the output of the above solution, we may want to segment the page to extract images.

Client

Contact: Faraz Ahmed
Phone: +61 8 6218 8888
Email: [email protected]
Preferred contact: Email
Location: Perth CBD

IP Exploitation Model

The IP exploitation model requested by the Client is: Creative Commons (open source) http://creativecommons.org.au/

Department of Computer Science & Software Engineering
The University of Western Australia
Last modified: 21 July 2024
Modified By: Michael Wise