CITS2002 Systems Programming

Lecture & Workshop
recordings on LMS

Schedule

FAQ

Unit outline

C textbooks

OS textbooks

Information resources

Look after yourself!

CITS2002 Systems Programming - Project 2 2022

Background

In recent years, computer disk sizes and densities have increased dramatically, with costs dropping to 4c/gigabyte (HDD) and 12c/gigabyte (SSD). As a consequence, we store an enormous number of files on our computers' disks, and finding required information is increasingly difficult. Historically, the names of files and directories alone reminded us of the location of information; today we require software tools to find information based on the content of files. In addition to file-system metadata storing information about our files and directories, we require an index of the content of our files.

As examples, the macOS operating system provides the Spotlight utility to index the contents of files, and there are a number of similar utilities developed for Linux.

The goal of this project is to design and develop a command-line utility program, named trove, to build and search a compressed index of the contents of files.

Successful completion of the project will develop and enhance your understanding of advanced features of the C11 programming language, core Linux and macOS system-calls, POSIX function calls, and the use of the make utility.

(Some of) the contents of all required files are to be stored in a trove-file. A trove-file will contain the absolute pathnames of the files that have been indexed, and each of the 'words' found in those files.

The words indexed by trove, are minimum-length sequences of alphanumeric characters. Some example words are: main, Australia, and GTHO351. The character sequence COVID-19 contains 2 words, but the second word might not be stored if indexed words must be at least 3 characters long.

Naturally, if many files are indexed, or if many words are found, the trove-file has the potential to become very large. Thus, the trove-file will be compressed when stored on disk.

Program invocation

From the command-line, the trove utility may be invoked in one of two ways, with command-line options indicating the actions to be performed.

prompt.1> ./trove [-f trovefile] word

The first invocation requests trove to list the absolute pathname of files containing the indicated word. Filenames are simply listed one-per-line (their order is not important).

If any file does not exist at the time of the search (it may have been deleted since the trove-file was built), then the file is not listed.

If any files containing the requested word are found, then trove will exit with success; otherwise with failure.
prompt.2> ./trove [-f trovefile] [-b | -r | -u] [-l length] filelist

The second invocation requests trove to build a new index, to remove information from an existing index, or to update an existing index.

A filelist is a sequence of one-or-more filenames or directory names. The contents of each named file should be indexed, and each named directory (and its subdirectories) should be traversed to locate files to be indexed.

Note that the files do not have to be text-files - they could equally be, for example, executable programs or object files.

If the requested action can be performed successfully, then trove will exit with success; otherwise with failure.

The following command-line options are to be supported:

-b	build a new trove-file from the contents of a filelist. The new trove-file will replace an existing trove-file of the same name.
-f trovefile	provide the name of the trove-file to be built or searched. If the -f option is not provided, the default name of the trove-file is /tmp/trove
-l length	only words of at least the indicated length should be added to the trove-file. length must be a positive integer, for example: -l 6. If the -l option is not provided, the default length is 4.
-r	if any of the files from the filelist appear in the trove-file, remove all of their information from the trove-file.
-u	update the trove-file with the contents of all files in the filelist. If the information from any of the files already exists in the trove-file, then that (old) information is first removed.

The trove-file format

While there is no required format for the trove-file your design should consider some important factors:

the filenames stored in the trove-file should be absolute pathnames - see man realpath
to reduce the size of the trove-file (and, thus, the time to read it), each indexed word and each absolute filename should (ideally) be stored only once.
the trove-file may, but does not have to be, a text-file.

Getting started

There is no required sequence of steps to undertake the project, and no sequence of steps will guarantee success. However, the following sequence is strongly recommended (and this is how the sample solution was built). It is assumed (considered essential for success!) that each step:

is extensively tested before you proceed to the next step, and
if any errors are found, the first error should be reported to stderr, and the program should immediately terminate.

The recommended initial steps:

find a project partner.
determine what activities are to be performed by the program. Determine what data needs to be stored during the execution of the program. Design one of more data types and data structures to store this long-term data, and create a new trove.h header file to define these constructs. These definitions will certainly evolve as your project develops.
break up the activities into fairly independent sets. Write a number of "empty" C files, each of which #includes your trove.h header file. Ensure that your main() function is implemented in a file named trove.c
write a Makefile to compile and link your C files, each of which will be dependent on the header file.
write the main() function, whose primary task is simply to receive and validate the command-line options and arguments. Write a usage() function to report the program's synopsis, should command-line processing detect any errors.
determine if trove is required to build/modify a trove-file, or to search an existing trove-file.
if building/modifying a trove-file, (recursively) traverse the filelist to find all of the files to be indexed. At this stage (as a debugging aid) just print each filename as it is found.
open each file found, and find all 'words' of at least the required length. At this stage (as a debugging aid) just print each word as it is found. Remember that the files do not have to be text-files.
store each word and each filename in which it is found in your data-structure(s).
....
initially, store your trove-file on disk without any compression.
learn how to compress files using the /usr/bin/gzip utility, and how to read the contents of a compressed file (without decompressing it) using the /usr/bin/zcat utility. Both utilities are available under macOS and Linux. When confident that your program works as required, add code to compress the file - see man gzip
....

It is anticipated that a successful project will very likely use (most of) the following Linux system-calls, and standard C11 & POSIX functions:

getopt(), malloc(), realloc(), strdup(), free(); opendir(), readdir(), closedir(), stat(), perror(); realpath(); access(), open(), read(), write(), close(); fork(), execl(), dup2(), wait().

Project requirements and fine print

Your project must be developed in multiple C11 source files and (at least one) header file.
Your submission will be compiled with the C11 compiler arguments -std=c11 -Wall -Werror, and marks will not be awarded if your submission does not compile.
Compilation and linking of your project must be supported by a Makefile, employing approriate variable definitions and automatic variables.
The default target of your Makefile must be named trove, and invocation of the make command must produce an executable program named trove.
The default name of the trove-file must be /tmp/trove, and the default minimum length of each indexed word must be 4
Each trove-file must be stored on, and read from, disk using the /usr/bin/gzip and /usr/bin/zcat utilities.
When performing a search, the names of files containing the required word should be listed one-per-line (commencing in the first column). You may leave 'debug printing' in your submitted program, but all debug printing should be indented with one-or-more space or tab characters.

Assessment

The project is due at 5pm Friday 21st October (end of week 12), and is worth 25% of your final mark for CITS2002.

It will be marked out of 50. The project may be completed individually or in teams of two (but not teams of three). The choice of project partners is up to you - you will not be automatically assigned a project partner.

You are strongly advised to work with another student who is around the same level of understanding and motivation as yourself. This will enable you to discuss your initial design together, and to assist each other to develop and debug your joint solution. Work together - do not attempt to split the project into two equal parts, and then plan to meet near the deadline to join your parts together.

25 of the possible 50 marks will come from the correctness of your solution. The remaining 25 marks will come from your programming style, including your use of meaningful comments; well chosen identifier names; appropriate choice of basic data-structures, data-types and functions; and appropriate choice of control-flow constructs.

During the marking, attention will obviously be given to the correctness of your solution. However, a correct and efficient solution should not be considered as the perfect, nor necessarily desirable, form of solution. Preference will be given to well presented, well documented solutions that use the appropriate features of the language to complete tasks in an easy to understand and easy to follow manner. That is, do not expect to receive full marks for your project simply because it works correctly. Remember, a computer program should not only convey a message to the computer, but also to other human programmers.

Good luck!

Chris McDonald.