Task 1.1 Choose a large organisation located within Australia that is publicly listed on Australia Stock Exchange and is already actively engaged in the Information Age. Briefly describe your chosen organisation and include the url link to their corporate website and explain why you have chosen this organisation for Task 1 about 250 words).
Honesty & transparency
Accuracy and access
to guide your analysis and discussion (about 1250 words)
Task 2 (Worth 35 Marks)
The goal of Task 2 is to predict whether a person has diabetes or not based on data collected on 768 female Pima Indians contained in the diabetes.csv data set provided for Assignment 3 Task 2 (see Table 2.1 for the Data Dictionary for diabetes.csv data set below). It is important you understand this data set in order to complete Task 2 and four sub tasks.
Table 2.1 Data Dictionary for diabetes.csv
Number of Times Pregnant – Gestational Diabetes- age 25+
Plasma glucose concentration after 2 hours in an oral glucose
tolerance test, normal when less than/equal to 110 mg/dL
Diastolic blood pressure (mm Hg) : 60-80 mm normal
Triceps skin fold thickness (mm) used to determine body fat
percent – Normal 23mm
2-Hour serum insulin (mu U/ml) Greater than 150 mu U/ml
relates to insulin therapy
BMI: Body mass index (weight in kg/(height in m)^2)
Ideal Range between 18.5 and 24.9, Less 18.5 underweight,
over 24.9 overweight – there is a link between obesity and
Diabetes Pedigree Function equates to History of diabetes in
family (a) 0.5 (50%) for parent, full sibling (b) 0.25 (25%)
half sibling, grandparent, aunt, or uncle (c) 0.125 (12.5%)
half aunt, half uncle, or first cousin
Age in years
Class variable (0 or 1) classification prediction of diabetes 0
= False, 1 = True
Task 2.1 Conduct an exploratory data analysis of the diabetes.csv data set using RapidMiner Studio data mining tool.
Provide the following for Task 2.1:
A screen capture of your final EDA process and briefly describe your final EDA process
Summarise the key results of your exploratory data analysis in a table named Table 2.1 Results of Exploratory Data Analysis for Diabetes.csv
Discuss the key results of your exploratory data analysis and provide a rationale for selecting your top 5-6 variables for predicting diabetes as the outcome based on the results of your exploratory data analysis and a review of the relevant literature on key factors contributing to likelihood of developing diabetes (About 500 words)
Table 2.1 should include the key characteristics of each variable in the diabetes.csv data set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc.
Hint: The Statistics Tab and the Chart Tab in RapidMiner provide a lot of descriptive statistical information and the ability to create useful charts like Barcharts, Scatterplots etc for the EDA analysis. You might also like to look at running some correlations and chisquare tests on the diabetes.csv data set to indicate which variables you consider to be the top 5-6 key variables which contribute most to predicting diabetes as an outcome.
Task 2.2 Build a Decision Tree model for predicting diabetes based on the diabetes.csv data set using RapidMiner and an appropriate set of data mining operators and a reduced set of variables from diabetes.csv determined by your exploratory data analysis in Task 2.1.
Provide the following for Task 2.2:
(1) Final Decision Tree Model process, (2) Final Decision Tree diagram, and (3) Decision tree rules.
Briefly explain your final Decision Tree Model Process, and discuss the results of the Final Decision Tree Model drawing on the key outputs (Decision Tree Diagram,
Decision Tree Rules) for predicting diabetes. This discussion should be based on the contribution of each of the top five variables to the Final Decision Tree Model and relevant supporting literature on the interpretation of decision trees (About 250 words).
Task 2.3 Build a Logistic Regression model for predicting the diabetes based on the diabetes.csv data set using RapidMiner and an appropriate set of data mining operators and a reduced set of variables determined by your exploratory data analysis in Task 2.1.
Provide the following for Task 2.3:
(1) Final Logistic Regression Model process and (2) Coefficients, and (3) Odds Ratios. Hint you will need to install the Weka Extension in RapidMiner, use W-Logistic Regression Operator for this Task 2.3.
Briefly explain your final Logistic Regression Model Process and discuss the results of the Final Logistic Regression Model drawing on the key outputs (Coefficients, Odds Ratios) for predicting diabetes. This discussion should be based on the contribution of each of the top five variables to the Final Logistic Regression Model and relevant supporting literature on the interpretation of logistic regression models (About 250 words).
Task 2.4 Conduct a comparative performance evaluation of your Final Decision Tree Model with your Final Logistic Regression Model for predicting diabetes. Note you will need to use the Cross Validation Operator; Apply Model Operator and Performance (Binominal Classification) Operator in your final data mining process models (Decision Tree, Logistic Regression) to generate the required model performance metrics (Accuracy, Miscalculation Rate, True Positive Rate, False Positive Rate, Area under Roc Chart (AUC), Precision, Recall, Lift, Sensitivity, F Measure) required for Task 2.4.
Provide the following for Task 2.4:
A screen snapshot of the Confusion Matrix and AUC for each Final Model (Decision Tree, Logistic Regression)
A table named Table 2.2 Results of Model Performance Evaluation (Decision Tree, Logistic Regression) that compares the key results of the performance evaluation for the Final Decision Tree Model and Final Logistic Regression Model in terms of Model Accuracy, Miscalculation Rate, True Positive Rate, False Positive Rate, Precision, Recall, Lift, Sensitivity, F Measure.
Discuss and compare the key results of your performance evaluation of two final models (Decision Tree, Logistic Regression) presented in parts i and ii of the Task 2.4, indicate which model is better and explain why (About 500 words).
All important outputs from data mining analyses conducted using RapidMiner for Task 2 should be included in your Assignment 3 report to provide support for conclusions reached regarding each analysis conducted for Task 2.1, Task 2.2, Task 2.3 and Task 2.4.
Note you will find the Sharda et al. 2018 and North Text books useful references for the data mining process activities conducted in Task 2 in relation to the exploratory data analysis, decision tree analysis, logistic regression analysis and evaluation of the comparative performance of the Final Decision Tree model and the Final Logistic Regression model.
Task 3 (Worth 30 marks)
The aviation-wildlife.xlsx lists historical data recorded for USA Aviation industry regarding wildlife strikes with aircraft for the years 2000 to 2011. See Table 3.1 which provides the Data dictionary for aviation-wildlife.csv Data set. It is important you understand the variables in this data set in order to build the required Aircraft Wildlife Strikes (AWS) dashboard with four specified Tableau views.
Table 3.1 Data dictionary for aviation-wildlife.csv Data set
Name of Airport
< 1000 Metres, > 1000 Metres, Unknown
Make and Model of Aircraft
Wildlife: Number struck
Range of numbers
Effect: Impact to flight
None, Aborted Take-off, Engine Shut Down,
Precautionary Landing, Other
Text remarks recorded for flight
Location: Nearby if en route
Aircraft: Flight Number
Date of Flight
Record ID – unique integer number
Effect: Indicated Damage
No Damage, Caused Damage
Location: Freeform en route
Text remark recorded for flight
Aircraft: Number of engines?
1, 2, 3 or 4
Flight Origin State
When: Phase of flight
Take-off run, Approach, Climb, En-route,
Fog, None, Rain, Snow
Remains of wildlife collected?
Remains of wildlife sent to
Text remarks recorded regarding aviation –
Date Aircraft collusion with wildlife reported
Small, Medium, Large
No Cloud, Overcast, Some Cloud
Different types of wildlife mainly birds
When: Time (HHMM)
24 hour format
When: Time of day
Dawn, Day, Night, Dusk
Pilot warned of birds or wildlife?
Y = Yes, N = No
Cost: Aircraft time out of service
Cost: Other (inflation adj)
Cost: Repair (inflation adj)
Cost: Total $
Miles from airport
Feet above ground
Number of human fatalities
Number of people injured
Speed (IAS) in knots
Task 3 requires you build a Tableau dashboard which includes four different views of the aviation-wildlife.csv data set for the years 2000-2011 as specified in sub Tasks 3.1, 3.2, 3.3 and 3.4.
Task 3.1 Create a Tableau View of the impact of wildlife strikes with aircraft over time for a specific origin state. Provide a screen capture of and describe the Tableau view you have created and comment on the different types of impact to aircraft from wildlife strikes over time and does this differ much for different origin states (About 125 words).
Task 3.2 Create a Tableau View of flight phase by time of the day which shows when wildlife strikes with aircrafts occur. Provide a screen capture of and describe the Tableau view you have created and comment on which phase of a flight and time of the day wildlife strikes with aircraft are more likely to occur (about 125 words)
Task 3.3 Create a Tableau View that compares wildlife species in order of aircraft strike frequency and the chance of damage occurring. Provide a screen capture of and comment on which wildlife species are most frequently involved in aircraft strikes and which wildlife species are most likely to have the most impact in terms of damage (total cost) when an aircraft strike occurs (about 125 words).
Task 3.4 Create a Tableau GeoMap View of flights by origin states that displays the number of wildlife strikes and total monetary cost for each origin state for different periods of time. Provide a screen capture of and describe the Tableau view you have created and comment on this Tableau GeoMap View in relation to the number of wildlife strikes by origin state and total monetary cost over time. A number of origin states cannot be plotted on the geomap view as these are outside USA, comment on how you can deal with this issue (About 125 words).
Note: you need copy the four Text Table / Graph views and the dashboard you have created in Tableau using the Worksheet Menu Copy or Export Image option and include in the Task 3 section where relevant or in Appendix 3 of Assignment 3 report.
Task 3.5 Provide screen snapshot of your AWS Dashboard and an accompanying rationale (drawing on the relevant literature for good dashboard design) for the graphic design and functionality that is provided by your AWS Dashboard for the four specified Tableau views for sub Tasks 3.1, 3.2, 3.3 and 3.4 (About 500 words).
Report presentation writing style and referencing (worth 15 marks)
Presentation: Cover page, table of contents, page numbers, headings, sub headings, tables and diagrams, use of formatting, spacing, paragraphs,
Writing style: Use of English (Correct use of language and grammar. Also, is there evidence of spelling-checking and proofreading?)
Quality of research evident by appropriate referencing: Appropriate level of referencing in text where required for a sub task, reference list provided, used Harvard Referencing Style correctly
Assignment 3 Report should be structured as follows:
Assignment 3 Cover page
Table of Contents
Task 1 Main Heading
Task 1 Sub Tasks – Sub headings for Tasks 1.1 and 1.2 Task 2
Task 2 Sub Tasks – Sub headings for Task 2.1, 2.2, 2.3 and 2.4 Task 3
Task 3 Sub Tasks – Sub headings for Task 3.1, 3.2, 3.3, 3.4 and 3.5
List of References
List of Appendices