Task 1.1 Choose a large organisation located within Australia that is publicly listed on Australia Stock Exchange and is already actively engaged in the Information Age. Briefly describe your chosen organisation and include the url link to their corporate website and explain why you have chosen this organisation for Task 1 about 250 words).

Task 1.2 Conduct a desktop research to analyse your chosen organisation in terms of the security and privacy policy statements available on its website. Provide the url links to the security and privacy policy statements available online in your answer to Task 1 (ii) and then discuss how governance of privacy and security of data is addressed in this organisation drawing on the nine core principles of the :

No-harm rule

Honesty & transparency

Fairness

Choice

Accuracy and access

Accountability

Stewardship

Security

Enforcement

to guide your analysis and discussion (about 1250 words)

Task 2 (Worth 35 Marks)

The goal of Task 2 is to predict whether a person has diabetes or not based on data collected on 768 female Pima Indians contained in the diabetes.csv data set provided for Assignment 3 Task 2 (see Table 2.1 for the Data Dictionary for diabetes.csv data set below). It is important you understand this data set in order to complete Task 2 and four sub tasks.

Table 2.1 Data Dictionary for diabetes.csv

Variable Name

Data

Description

Type

Pregnancies

Integer

Number of Times Pregnant – Gestational Diabetes- age 25+

Glucose

Integer

Plasma glucose concentration after 2 hours in an oral glucose

tolerance test, normal when less than/equal to 110 mg/dL

BloodPressure

Integer

Diastolic blood pressure (mm Hg) : 60-80 mm normal

SkinThickness

Integer

Triceps skin fold thickness (mm) used to determine body fat

percent – Normal 23mm

Insulin

Integer

2-Hour serum insulin (mu U/ml) Greater than 150 mu U/ml

relates to insulin therapy

BMI

Real

BMI: Body mass index (weight in kg/(height in m)^2)

Ideal Range between 18.5 and 24.9, Less 18.5 underweight,

over 24.9 overweight – there is a link between obesity and

diabetes

DiabetesPedigreeFunction

Real

Diabetes Pedigree Function equates to History of diabetes in

family (a) 0.5 (50%) for parent, full sibling (b) 0.25 (25%)

half sibling, grandparent, aunt, or uncle (c) 0.125 (12.5%)

half aunt, half uncle, or first cousin

Age

Integer

Age in years

Outcome

Integer

Class variable (0 or 1) classification prediction of diabetes 0

= False, 1 = True

Task 2.1 Conduct an exploratory data analysis of the diabetes.csv data set using RapidMiner Studio data mining tool.

Provide the following for Task 2.1:

A screen capture of your final EDA process and briefly describe your final EDA process

Summarise the key results of your exploratory data analysis in a table named Table 2.1 Results of Exploratory Data Analysis for Diabetes.csv

Discuss the key results of your exploratory data analysis and provide a rationale for selecting your top 5-6 variables for predicting diabetes as the outcome based on the results of your exploratory data analysis and a review of the relevant literature on key factors contributing to likelihood of developing diabetes (About 500 words)

Table 2.1 should include the key characteristics of each variable in the diabetes.csv data set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc.

Hint: The Statistics Tab and the Chart Tab in RapidMiner provide a lot of descriptive statistical information and the ability to create useful charts like Barcharts, Scatterplots etc for the EDA analysis. You might also like to look at running some correlations and chisquare tests on the diabetes.csv data set to indicate which variables you consider to be the top 5-6 key variables which contribute most to predicting diabetes as an outcome.

Task 2.2 Build a Decision Tree model for predicting diabetes based on the diabetes.csv data set using RapidMiner and an appropriate set of data mining operators and a reduced set of variables from diabetes.csv determined by your exploratory data analysis in Task 2.1.

Provide the following for Task 2.2:

(1) Final Decision Tree Model process, (2) Final Decision Tree diagram, and (3) Decision tree rules.

Briefly explain your final Decision Tree Model Process, and discuss the results of the Final Decision Tree Model drawing on the key outputs (Decision Tree Diagram,

Decision Tree Rules) for predicting diabetes. This discussion should be based on the contribution of each of the top five variables to the Final Decision Tree Model and relevant supporting literature on the interpretation of decision trees (About 250 words).

Task 2.3 Build a Logistic Regression model for predicting the diabetes based on the diabetes.csv data set using RapidMiner and an appropriate set of data mining operators and a reduced set of variables determined by your exploratory data analysis in Task 2.1.

Provide the following for Task 2.3:

(1) Final Logistic Regression Model process and (2) Coefficients, and (3) Odds Ratios. Hint you will need to install the Weka Extension in RapidMiner, use W-Logistic Regression Operator for this Task 2.3.

Briefly explain your final Logistic Regression Model Process and discuss the results of the Final Logistic Regression Model drawing on the key outputs (Coefficients, Odds Ratios) for predicting diabetes. This discussion should be based on the contribution of each of the top five variables to the Final Logistic Regression Model and relevant supporting literature on the interpretation of logistic regression models (About 250 words).

Task 2.4 Conduct a comparative performance evaluation of your Final Decision Tree Model with your Final Logistic Regression Model for predicting diabetes. Note you will need to use the Cross Validation Operator; Apply Model Operator and Performance (Binominal Classification) Operator in your final data mining process models (Decision Tree, Logistic Regression) to generate the required model performance metrics (Accuracy, Miscalculation Rate, True Positive Rate, False Positive Rate, Area under Roc Chart (AUC), Precision, Recall, Lift, Sensitivity, F Measure) required for Task 2.4.

Provide the following for Task 2.4:

A screen snapshot of the Confusion Matrix and AUC for each Final Model (Decision Tree, Logistic Regression)

A table named Table 2.2 Results of Model Performance Evaluation (Decision Tree, Logistic Regression) that compares the key results of the performance evaluation for the Final Decision Tree Model and Final Logistic Regression Model in terms of Model Accuracy, Miscalculation Rate, True Positive Rate, False Positive Rate, Precision, Recall, Lift, Sensitivity, F Measure.

Discuss and compare the key results of your performance evaluation of two final models (Decision Tree, Logistic Regression) presented in parts i and ii of the Task 2.4, indicate which model is better and explain why (About 500 words).

All important outputs from data mining analyses conducted using RapidMiner for Task 2 should be included in your Assignment 3 report to provide support for conclusions reached regarding each analysis conducted for Task 2.1, Task 2.2, Task 2.3 and Task 2.4.

Note you will find the Sharda et al. 2018 and North Text books useful references for the data mining process activities conducted in Task 2 in relation to the exploratory data analysis, decision tree analysis, logistic regression analysis and evaluation of the comparative performance of the Final Decision Tree model and the Final Logistic Regression model.

Task 3 (Worth 30 marks)

The aviation-wildlife.xlsx lists historical data recorded for USA Aviation industry regarding wildlife strikes with aircraft for the years 2000 to 2011. See Table 3.1 which provides the Data dictionary for aviation-wildlife.csv Data set. It is important you understand the variables in this data set in order to build the required Aircraft Wildlife Strikes (AWS) dashboard with four specified Tableau views.

Table 3.1 Data dictionary for aviation-wildlife.csv Data set

Variable Name

Data Type

Description

1.

Aircraft:Type

Categorical

Aircraft, Helicopter

2.

Airport:Name

Categorical

Name of Airport

3.

Altitude-Bin

Categorical

< 1000 Metres, > 1000 Metres, Unknown

4.

Aircraft:Make/Model

Categorical

Make and Model of Aircraft

5.

Wildlife: Number struck

Categorical

Range of numbers

6.

Effect: Impact to flight

Categorical

None, Aborted Take-off, Engine Shut Down,

Precautionary Landing, Other

7.

Effect: Other

Categorical

Text remarks recorded for flight

8.

Location: Nearby if en route

Categorical

State Abbreviation

9.

Aircraft: Flight Number

Real

10.

FlightDate

Date

Date of Flight

11.

Record ID

Integer

Record ID – unique integer number

12.

Effect: Indicated Damage

Categorical

No Damage, Caused Damage

13.

Location: Freeform en route

Categorical

Text remark recorded for flight

14.

Aircraft: Number of engines?

Integer

1, 2, 3 or 4

15.

Aircraft: Airline/Operator

Categorical

Airline Operator

16.

Origin State

Categorical

Flight Origin State

17.

When: Phase of flight

Categorical

Take-off run, Approach, Climb, En-route,

Landing Roll

18.

Conditions: Precipitation

Categorical

Fog, None, Rain, Snow

19.

Remains of wildlife collected?

Categorical

False, True

20.

Remains of wildlife sent to

Categorical

False, True

Smithsonian

21.

Remarks

Categorical

Text remarks recorded regarding aviation –

wildlife collusion

22.

Reported: Date

Date

Date Aircraft collusion with wildlife reported

23.

Wildlife:Size

Categorical

Small, Medium, Large

24.

Conditions: Sky

Categorical

No Cloud, Overcast, Some Cloud

25.

Wildlife: Species

Categorical

Different types of wildlife mainly birds

26.

When: Time (HHMM)

Categorical

24 hour format

27.

When: Time of day

Categorical

Dawn, Day, Night, Dusk

28.

Pilot warned of birds or wildlife?

Categorical

Y = Yes, N = No

29.

Cost: Aircraft time out of service

Integer

(hours)

30.

Cost: Other (inflation adj)

Integer

31.

Cost: Repair (inflation adj)

Integer

32.

Cost: Total $

Integer

33.

Miles from airport

Integer

34.

Feet above ground

Integer

35.

Number of human fatalities

Integer

36.

Number of people injured

Integer

37.

Speed (IAS) in knots

Integer

Task 3 requires you build a Tableau dashboard which includes four different views of the aviation-wildlife.csv data set for the years 2000-2011 as specified in sub Tasks 3.1, 3.2, 3.3 and 3.4.

Task 3.1 Create a Tableau View of the impact of wildlife strikes with aircraft over time for a specific origin state. Provide a screen capture of and describe the Tableau view you have created and comment on the different types of impact to aircraft from wildlife strikes over time and does this differ much for different origin states (About 125 words).

Task 3.2 Create a Tableau View of flight phase by time of the day which shows when wildlife strikes with aircrafts occur. Provide a screen capture of and describe the Tableau view you have created and comment on which phase of a flight and time of the day wildlife strikes with aircraft are more likely to occur (about 125 words)

Task 3.3 Create a Tableau View that compares wildlife species in order of aircraft strike frequency and the chance of damage occurring. Provide a screen capture of and comment on which wildlife species are most frequently involved in aircraft strikes and which wildlife species are most likely to have the most impact in terms of damage (total cost) when an aircraft strike occurs (about 125 words).

Task 3.4 Create a Tableau GeoMap View of flights by origin states that displays the number of wildlife strikes and total monetary cost for each origin state for different periods of time. Provide a screen capture of and describe the Tableau view you have created and comment on this Tableau GeoMap View in relation to the number of wildlife strikes by origin state and total monetary cost over time. A number of origin states cannot be plotted on the geomap view as these are outside USA, comment on how you can deal with this issue (About 125 words).

Note: you need copy the four Text Table / Graph views and the dashboard you have created in Tableau using the Worksheet Menu Copy or Export Image option and include in the Task 3 section where relevant or in Appendix 3 of Assignment 3 report.

Task 3.5 Provide screen snapshot of your AWS Dashboard and an accompanying rationale (drawing on the relevant literature for good dashboard design) for the graphic design and functionality that is provided by your AWS Dashboard for the four specified Tableau views for sub Tasks 3.1, 3.2, 3.3 and 3.4 (About 500 words).

Report presentation writing style and referencing (worth 15 marks)

Presentation: Cover page, table of contents, page numbers, headings, sub headings, tables and diagrams, use of formatting, spacing, paragraphs,

Writing style: Use of English (Correct use of language and grammar. Also, is there evidence of spelling-checking and proofreading?)

Quality of research evident by appropriate referencing: Appropriate level of referencing in text where required for a sub task, reference list provided, used Harvard Referencing Style correctly

Assignment 3 Report should be structured as follows:

Assignment 3 Cover page

Table of Contents

Task 1 Main Heading

Task 1 Sub Tasks – Sub headings for Tasks 1.1 and 1.2 Task 2

Task 2 Sub Tasks – Sub headings for Task 2.1, 2.2, 2.3 and 2.4 Task 3

Task 3 Sub Tasks – Sub headings for Task 3.1, 3.2, 3.3, 3.4 and 3.5

List of References

List of Appendices