DATA7202 Assignment 2 solved

Starting from:

~~$25~~

$17.50

Part I Re-analysis of the UCI online news popularity dataset
Reconsider the dataset from assignment 1.
Here you will fit a new model to the data to predict number of shares for a new article using
the same explanatory variables and data as you used in assignment 1. However, instead you
will use a generalised linear model and a non-negative distribution and a suitable link
function. Note: due to a question asking about a predictive interval, we recommend you do
not attempt to use the quasi-Poisson model.
1. Choose a suitable generalised linear model and suitable link function and explain these
decisions. Fit the model to the data and report summary results. [2 marks]
2. Produce a 95% predictive interval for each fitted model (ignore uncertainty with respect to
model parameters). Compare the results from using this model with those from using a
multiple linear regression model (= general linear model) which you would have run for
assignment 1. Include consideration of significant variables, measures of goodness of fit,
predictive intervals and related plots. Discuss theoretical and practical differences between
the two models. [4 marks]
Part II
Translink is a division of the Queensland government’s Department of Transport and Main
Roads. They operate the go card system which allows card-based access and payment for
South East Queensland public transport systems, including buses, trains, ferries and trams.
Some of the data collected via go cards has been analysed by UQ’s Dr. Jiwon Kim of the
School of Civil Engineering (see https://www.jiwonkim.co/ ). Translink is interested in
monitoring and predicting passenger numbers and demand to help with planning changes to
routes and their frequency, and how to best respond to unusual incidents and events.
A subset of the data is available to you via Blackboard covering passenger flow data from 25
Feb to 24 Mar 2013. Your main task is to construct predictive models based on this data for a
single high-traffic pair of regions, namely from region 1 to region 5 (Brisbane City to South
Brisbane).
You will create two types of model. The first model should be based entirely on reasonable
summaries (e.g. averages) from past data, without a stochastic model (see for example
https://otexts.com/fpp2/simple-methods.html). The second should involve the fitting of a
stochastic time series model (or a comparable model, if given permission by the lecturer).
The modelling process, choice of data to use and how to represent it, is up to you.
Please only consider the v0_num_traj column (not the v10 – v60 columns) – assume this is
the number of passenger trajectories (trips using all forms of public transport) occurring
during the listed time period (hour) for the listed route. The time_id shows the hour since
midnight, making the start time and end time redundant, since these are just in minutes since
midnight.
The detailed tasks are given below. Assessment will consider clarity of exposition,
presentation and statistical reasonableness. Where possible, give reasons for your decisions
and some discussion of results.
3. Check for stationarity and seasonality of the time-series data. Explain how you have done
this and include relevant graphs and numerical summaries. [4 marks].
Note: you have a number of tricky issues to deal with, including the presence of weekends,
and the fact that the range of hours in which people take trips differs from day to day. With
respect to the latter, the main interest of Translink is in the busier periods of the day. So, if
you wish, you can use only the time periods which are present for every day in the data.
This is linked to the ideas of structural and measured zeroes. In this case, a structural zero is a
zero count when it was not possible for any counts to be recorded, e.g. because there were no
services running. A measured zero could occur when services were available, but no one used
them. In this dataset, zero counts are entirely missing from the dataset. For such missing
times, we cannot know whether they were structural or measured zeroes, although
presumably Translink staff could find out.
4. Choose and detail each type of model used (including mathematical form and explanation
of notation) and some details of how it was fitted. Explain why each model may be suitable
for this type of data. [4 marks]
5. Report full details of each fitted model. [2 marks]
6. Discuss the limitations of each model with respect to this dataset. [3 marks]
7. Give and plot model predictions for each model over the observed data range (include the
observed values somehow for comparison). Also give 95% predictive intervals for the
stochastic model over this range. [3 marks]
8. Evaluate accuracy 1 and 2 hours ahead via final day. [3 marks]
As part of evaluation of the models, evaluate the accuracy of each model’s predictions ahead
in time 1 and 2 hours. Utilise only the last day of data for testing. However, you should use
all the available data available for prediction, so if predicting ahead 1 hr to 3pm, you should
train your model on data up to 2pm. Similarly for a 1 hr prediction to 4pm, you should
training your model on data up to 3pm. So you will have to produce a number of models for
the purpose of testing. It is possible you will find a way to re-use models without full
retraining, but either approach is ok. Suggested accuracy measure: mean squared error. Give
details of how you calculate this.
9. Make predictions for test day, which will be evaluated by RMSE vs true counts for that
day (held by lecturer). [3 marks]
You will need a version of at least one of your models to predict into the future. We provide
data below on counts of trips for the first few hours of travel on Monday 25/3/2013. You
should make predictions for the remainder of the day. These predictions will be compared
against the truth in marking – some marks depend upon the accuracy of these predictions.
The following data is from Monday 25/3/2013 from 5am to 9am:
Date time_id start_time end_time region_from region_to v0_num_traj
25/03/2013 6 300 360 1 5 2
25/03/2013 7 360 420 1 5 18
25/03/2013 8 420 480 1 5 34
25/03/2013 9 480 540 1 5 68
Make predictions of the passenger counts over this route for each hour for the remainder of
this day’s services (until midnight) using any of your models. You can use different models
for different time periods if you wish. Some of the marks for this question will be based on
the mean squared error of your predictions. Please list your predictions in a column for time
id 10-24 (15 rows, 1 predicted count per line, no other characters) in your report. We will
copy them out and evaluate them.
10. Paragraph to Translink staff [2 marks]
Include a paragraph aimed at a member of the Translink planning staff who may not have a
statistics background, explaining how your modelling could potentially help them make
decisions about how many bus/train/ferry services to run at various times to meet demand.
Notes:
Please store all the R commands you use and submit these via a separate file (This can be
within R markdown or a Jupyter notebook if desired – just make sure we can read the R
code). Please include your name in the filename for all files submitted. You should not
generally give R commands in your main report and should not include any raw output – i.e.
just include figures from R (each with a title, axis labels and caption below) and put any
relevant numerical output in a table or within the text.
As per http://www.uq.edu.au/myadvisor/academic-integrity-and-plagiarism, what you submit
should be your own work. Even where working from sources, you should endeavour to write
in your own words. Equations are either correct or not, but you should use consistent notation
throughout your assignment, define all of it and ensure that your report flows logically.
You are asked to use the R software environment for this assignment. This is free to install on
any of your own computers. Information and downloads are available from http://www.rproject.org/ . Rstudio https://www.rstudio.com/ is a quality free interface for R.
Submit your assignment report as a pdf file via TurnItIn on Blackboard. Any R programs or
scripts that you write to answer the assignment should be placed in a separate .zip file and
uploaded as a second file (see Blackboard).
References:
A. Agresti, Categorical Data Analysis, 2
nd edition, Wiley, 2002. (chapter 4 provides a
readable introduction to GLMs)
A. J. Dobson, and A. Barnett, An Introduction to Generalized Linear Models, 4th edition,
CRC Press, 2018. (Main course reference for GLMs)
J. J. Faraway, Extending the Linear Model with R, 2nd ed., CRC Press, 2016. (GLMs with R
examples and discussion)
K. Fernandes, P. Vinagre, and P. Cortez, A Proactive Intelligent Decision Support System for
Predicting the Popularity of Online News, in: Pereira F., Machado P., Costa E., Cardoso A.
(eds.) Progress in Artificial Intelligence, EPIA 2015, Lecture Notes in Computer Science,
vol. 9273, Springer, 2015. (the paper which first analysed the data used in Assignment 1)
R. J. Hyndman and G. Athanasopoulos, Forecasting: Principles and Practice 2
nd ed., OTexts,
Melbourne, 2018. https://otexts.com/fpp2/ (Accessible text on time series analysis, esp:
forecasting)
J. Maindonald and J. Braun, Data Analysis and Graphics Using R - An Example-Based
Approach, 3rd edition, Cambridge University Press, 2010. (Intro to R with wide range of
examples)
R. H. Shumway and D. S. Stoffer, Time Series Analysis and Its Applications with R
Examples, Fourth Edition, Springer, 2017. (Detailed book on time series analysis, with R)
W. N. Venables and B. D. Ripley, Modern Applied Statistics with S, Fourth Edition, Springer,
2002. (Classic book introducing S (the language implemented in R) and applications)
H. Wickham and G. Grolemund, R for Data Science, O'Reilly, 2017. http://r4ds.had.co.nz
(mainly about Hadley Wickham’s R packages, but also a good general introduction to using
R for data science)