School of Computer Science and Software Engineering

CITS3401 Data Warehousing and Data Mining — Project-1

Submission Instructions

Submit your document (PDF format) to cssubmit. Two deadlines are listed for CITS3401 Data Warehousing and Data Mining, select the first link "Data Warehousing using Jedox" for this project.

A suggestion on how the submission document should be structured
  1. An illustration and discussion of the design: the schema (star or snowflake), the lattic of cuboids and the concept hieararchies for each dimension. Use a starnet to explain how the design may help the OLAP analysis, and a justification of your choice among other possible alternatives you might have considered. (length: around 1-2 A4 pages)
  2. How the OLTP data are processed into the data warehouse for OLAP purposes: what pre-processing steps have been considered or taken. For example, how you designed the column headers to enable data-driven modelling and how some of the column values are processed to fit the concept hiearchy design. Brief procedure description of these steps is useful, possibly adding general discussions on the pros and cons of your choice. (length: around one A4 page)
  3. Example analysis: for a total of 10 example analysis, provide no less than 5 concrete examples. By concrete we mean the analysis needs to be carried out on data from BITRE, with results displayed as screenshots of tables and SUCCESS charts produced using JEDOX. Each analysis should be accompanied by a short description on what the analysis is about and what trends have been revealed. The rest of the examples could be hypothetical, i.e. not implemented, but as a way to provide recommendations to BITRE. (length: half an A4 page for each analysis)

News:

Data warehouse and Data cube design for Australian Road Deaths Database

Administrative Notes

This project is worth 20% of the total assessment of this unit, and is due April 15 17, 11:59 pm

You can work on this project either individually or with another group member. The group registration page as well as marking guideline will be available soon.

FAQ

Should this column be a dimension on its own or should it be a sub-level of an existing dimension?

The #1 rule to make such a decision is to check if the children adds up to the parent. Putting it into more practicle terms: to make a decision on whether "Gender" should be a sub-level for "Road User", you need to ask if all "Male" fatalities aggregate up to a certain type of "Road User" (e.g. "Driver"). If not, it is better to maintain them as separate dimensions. Then during queries ("Paste View"), you can interrogate the data to find out how many male drivers are dead as compared to female drivers.

How come I have millions of fatalities?

This shows that your concept hiearchies and dimension design encouraged double counting. For example, instead of parsing the "Date" column using Date[DATE,dd.MM.yyyy,EN,1989,2016], one may be tempted to model the "Day" and "Month" as subdimensions of Year, this will create extra combinations of day values with month values, in other words, the original unique date information is lost. For example, for Column "Day", you have 31 unique values, multiplied by the 12 unique values for "Month", we have counted the fatalties 12*31 times for a year, while there might be only a few for each month.

My column headers look all right, why does Jedox throw errors?

This could be to do with a few different reasons:

  • Subtle syntax errors - check to see if you have left space between the column name and the left square bracket, the dimension names should not contain space nor special characters;
  • Clash with reserved keywords - there is no clear documentation on what keywords are reserved, but we have spotted "ID", "USER" and "GROUP" have been causing problems. So avoid using these as your dimension names.
  • Multiple user trying to load large datasets at the same time - Jedox is using in-memeory multi-dimensional cube compuation. When multiple users all trying to load large datasets into the memory, the server will find it difficult to cope. No good solution to this, but as a compromise, marks will not be deducted if your design is sensible and a year worth of data is analysed. Also try to avoid the peak period, e.g. hours close to the submission deadline.

The Datasets

The Bureau of Infrastructure, Transport and Regional Economics (BITRE) of the Australian Government provides relevant data and analysis to inform both the federal government and wider community. One of the databases they provide in the public domain is the Australian Road Deaths Databases at:

https://www.bitre.gov.au/statistics/safety/fatal_road_crash_database.aspx

BITRE has two different excel files that record the crashes that have fatalities from 1989 to 2016. One file records circumstances of the crash, e.g. date, location, and crash type. The other records information about the persons killed, e.g. age, gender and road user group. The two tables are linked through Crash ID.

Project Requirements

In this project, we are to use Jedox's data-driven modelling and the SUCCESS visualisation tool to prepare a document for Data Warehousing capability demonstration. Following the data warehouse design process explained in the lecture and documented in the first three labs,

  • ETL: Extract, Transform and Load these two datasets into an OLAP database;
  • Desgin the data warehouse schema and concept hierarchies for the dimensions;
  • Implement the data cubes;
  • Carry out OLAP analysis to answer business queries;
  • Prepare an OLAP report in a similar fashion to the Biker example;
  • Document the entire process concisely and convincingly in a final submssion as a PDF document.

The final submission is a PDF document that should consist of:

  • A good quality data warehouse schema that makes use of both datasets.
  • Detailed discussions with examples to demonstrate how your data cube can assist BITRE in analysing the data, advising the government for planning new infrastructure, and changing transport policies. A few examples of such analysis are given below, however these examples are by no means exhaustive and you should try to provide a good number of examples to improve the quality of your tender submission.
  • Implemented data cubes to support the above examples business queries. You do not need to submit the data cubes as the markers will have access to them on the Jedox Web Server. Relevant screenshots however, need to be included in the submission PDF file to show solutions for ten distinct example business queries your design will be able to answer.
  • You can discuss the general issues with anyone or on help forum, but not on specific details, such as the design of concept hiearchies and dimension tables.
  • You can makereasonable assumptions when implementing the prototype. For example, the public dataset is very high-level in terms of locations, only state information is provided. You can write scripts to generate finer grained data for road information, for example, which will allow for more detailed analysis at suburb level. To improve the services provided by BITRE, you can also suggest changing the process how BITRE collects its data. In fact, any other suggestions that you deem feasible and necessary can be included in your document. Explain from a data warehousing perspective, how the changes will help better decision making and transport infrastructure planning.

A few example analysis scenarios:

  • BITRE is interested in knowing how the time of the day (e.g. midnight, early morning, rush hour, morning, high-noon, afternoon, dusk, night and etc.) relates to crash patterns. Refer to this interesting thread about names of parts of day definition. You can come up with your own sensible groups as well. For example, time around sunrise, day time, time before sunset, and night.
  • Another analysis along the temporal dimension could be on seasons, spring, summer, fall and winter, or raining or non-raining days if you can look up Bureau of Meteorology data, or week days vs. weekends.
  • It should be possible to analyse trends at different levels of granularities, e.g., whether there is any trend if the same analysis is performed over different months, quarter, years, over 5 years intervals or across decades.
  • As you can see in the data, bus and trucks may be involved in a crash, BITRE is interested in higher level analysis of whether a big vehicle is involved in the crash, if so, is it a bus or a truck.
  • It should be possible to analyse the age group of fatalities, for example, young male or young female, whether they are drivers or not.
  • It should also be possible to group the states into Northern Australia and Southern Australia, or West Coast and East Coast, and do analysis on each region.
  • Speed limit can also be grouped into freeway speed, highway speed, residential area speed, school zone speed and etc.

Note: If you think the specification provided here is vague, imprecise or incomplete, you are correct. We wants you to suggest a quality solution.


This Page

Last Edited on:
Wednesday 13th of April 2016 03:23:02 AM

Website Feedback:
wei.liu@uwa.edu.au