Table of Contents
INFS4018 Assignment 1
Project methodology 2
Business understanding 2
Data understanding 3
Data preparation 3
Assignment 1 6
Assignment 2 6
Business understanding 7
Analysis of data 7
Final report 8
Data is created as a side-product of practice or as a result of targeted activity. While this data is used for the primary business of the organisation/practice, it can be also used to extract business intelligence. One of the frameworks, used in the industry is Cross-Industry Standard Process for data Mining (CRISP-DM). This process has several phases:
Before you start any attempt to collect/analyse data you need to get a good idea why you are doing the exercise – understand the purpose. The main components are:
• Determine business objectives
– Initial situation/problem etc. (…we have crowded emergency departments (ED)…)
• Assess situation
– Inventory of resources (personnel, data, software)
– Requirements (e.g. deadline), constraints (e.g. legal issues), risks
Understanding your business will support determining the scope of the project, the timeframe, budget etc.
Next step is to look at what data is needed (available) and write data definitions (so that we know exactly what we talking about – this is very important for aggregation of apparently same data: the definitions may not be the same!).
• Collect initial data
– Acquire data listed in project resources
– Report locations of data, methods used to acquire them, …
• Describe data
– Examine -surface- properties
– Report for example format, quantity of data, … ? Data dictionary
• Explore data
– Examine central tendencies, distributions, …
– Report insights suggesting examination of particular data subsets (data selection)
• Verify data quality
– Is the data complete? (missing values)
– Is the data correct? (integrity constraints)
– Is the data noisy or are there outliers?
NB: this is an initial exploration – scouting the problem space. It helps you to understand what data is available and it helps to align your approach to the business objectives and the data available. At the same time – this phase can help to verify, whether the project is viable (feasibility) and refine the project scope, budget, resources etc.
Typically any data you get is not in the right format for analysis (it was collected for other purposes such as providing care or managing the practice) and needs to be pre-processed.
• Select data
– Relevance to the data mining goals
– Quality of data
– Technical constraints, e.g. limits on data volume
• Clean data
– Raise data quality if possible
– Selection of clean subsets
– Insertion of defaults
• Construct data
– Derived attributes (e.g. age = NOW – DOB)
• Integrate data
– Merge data in different sources
– Merge data within source (tuple merging)
• Format data
– Data must conform to requirements of initially selected mining tools (e.g. input data is different for Weka, and different to Disco).
This phase goes hand-in-hand with the data preparation. Here you select what analytic techniques you are planning to use, in which sequence etc. Once you have the analysis design, you execute it.
• Select modelling technique
– Make selection during business understanding phase concrete
– E.g., neural network with back propagation
• Generate test design
– Separate test data from training data (in case of supervised learning)
– Define quality measures for the model
• Build model
– List parameters and chosen values
– Assess model
At the end of the Data preparation-Modelling phase you have a set of results coming from the analysis (you have a model). NB: this needs to be assessed and evaluated form the technical point of view (to mitigate issues such as overfitting etc.).
Here you evaluate the results (model) form the business perspective (Did we learn something new? How do the results fit into knowledge we already have? etc.).
• Evaluate results from business perspective
– Test models on test applications if possible
• Review process
– Determine if there are any important factors or tasks that have been overlooked
• Determine next steps
NB: this phase leads to change in business understanding (and starts a new cycle of business intelligence), is viable as-is and can be deployed, or both.
In this phase you conclude the project.
• Plan deployment
– Determine how results (discovered knowledge) are effectively used to reach business objectives
• Plan monitoring and maintenance
– Results become part of day-to-day business and therefore need to be monitored and maintained.
• Final report
• Project review
– Assess what went right and what went wrong; debriefing
The project will be split into 2 parts. Assignment 1 will be the planning part, assignment 2 will execute your plan.
Assignment 1 will require you to achieve workable level of business understanding, look at the data and achieve a workable level of data understanding. From this material you will write a project plan. This will contain:
• Justification of your project (why you propose to do the BI/data-mining) – here you demonstrate workable business understanding
• Plan what to do:
o With data (here you will be able to demonstrate your data understanding)
? Data transformations
? Data cleansing?
o Methods you plan to apply in your analysis (justify your choice!)
Assignment 2 will be execution of your plan. This report will contain:
• Background and motivation (recycling of the introduction from Assig1)
• Methods used with justification (mostly recycled from Assig1, but you may change/improve)
• Results of your analysis
• Interpretation of results in terms of:
o Data quality
• Conclusions and recommendations (we discuss this later)
How the components of the topic map to CRISP-DM – see below.
Emergency department overcrowding is a serious problem. Overcrowding is associated with inferior clinical outcomes (such as mortality), as well as in quality and timeliness of therapy . In this semester you will be looking at analysing processes in an emergency department. The basic model for a patient moving across the ED is as follows:
Write a brief justification of a project measuring times of arrival, triage, clinical care, and departure. This will require for you to find a few resources and get a basic idea how an emergency department operates.
Your work will serve as Introduction/Background section.
Analysis of data
You will be given a dataset of realistic data measuring the time points of ED processes in an institution in NSW. Your task will be to have a look at the data (with your understanding of ED processes from your previous reading) and:
• Extract a data dictionary from the NSW document (will be provided) and add description of any data you construct.
• Select which data you will be using for your analysis (and justify your choice)
• Construct data (e.g. duration between “triage” and “seen by clinician”) – justify why you need this data, and describe in detail (in data dictionary) how you are going to construct the data point (formulas, …)
• Explore the data (e.g. basic statistics, graphs…)
• Comment on data quality
• Make your choices on analytic methods (basic stats, process mining [Disco] etc.) and justify your choices (you are asked to use process mining – think about why it is useful – other methods are up to you).
• Formatting/re-formatting data – what changes need to be done for methods you apply (different input is expected for statistical packages, and different for Disco).
• Write an analysis plan
• Perform the analysis as you propose it taking into account any comments you may have got.
In the final report you will take the results of your analysis, interpret what you have found (comments on what the results may mean in context of what you learned about the ED; design visualisation of your results), and write conclusions and recommendations for action, future analytic project or both.
Your final document will contain ALL sections form all previous parts – you copy the material over, re-organise it, and – I suggest – improve.
Your document is supposed to be aimed to senior management of a hospital – adjust your style accordingly. While you have to do 2 assignments, these are stepping stones towards writing one document at the end – write individual parts in a way you can recycle them in later stages
Please do not write lengthy introductions (your audience is senior management of a hospital !).
Use references only if you need them (no merit in “backfilling” references). Preferred format is Harvard, but you can use any other format if you use it consistently throughout the entire document.
Word count – you use any amount of words you need. No penalty will be for exceeding the word count.