TO FIND THE NEXT OR DIFFERENT PROJECT CLICK ON THE SEARCH BUTTON ON THE TOP RIGHT MENU AND SEARCH USING THE TITLE CODE OR PROJECT TITLE.
$15
Analysis of UCI online news popularity dataset: exploratory data analysis, prediction,
evaluation and inference with generalized linear models.
The dataset and associated information can be found at
https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
We will focus on predicting the popularity of online articles published by Mashable over two
years from 7th Jan 2013, as measured through the number of times an article was shared
online.
In particular, we try to predict popularity based on explanatory variables (features) which
would be available before the publication of the article, as mentioned in the accompanying
paper by Fernandes et al. (2015) who collected the dataset. The articles have already been
heavily processed to produce a large set of numerical and categorical attributes, and these are
what you should work with. Many more types of attributes could be extracted from the
articles and related information, but we will not attempt such processing.
We will try to predict popularity in two ways. The first target is as it is recorded, as the
number of times the articles is shared over the period. The second target is a binary variable
which discretises the above. In Fernandes et al. (section 3.1), an article is considered popular
if it exceeds 1400 shares. Here we will reduce that threshold slightly and define an article as
popular if it is shared 1000 or more times. You should create this binary variable. You should
use multiple linear regression for the first task and binary logistic regression for the second.
No model or variable selection is to be performed as part of this assignment (except as
specifically mentioned in questions), despite the reasonable temptation. However,
transformation of variables is allowed and variables can be removed to help compliance with
assumptions or due to other problems with them.
The aim of the assignment is to develop a solid theoretical and practical understanding of the
models, rather than to achieve the best possible performance. Even within the constraints
above, a large number of models are possible, and the Fernandes et al. article indicates some
of the many other options. I would also like you to consider the aims of the study. These are
somewhat vague, but two aspects mentioned by Fernandes et al. (p536) are:
(i) “Predicting such popularity is valuable for authors, content providers, advertisers and even
activists/politicians (e.g., to understand or influence public opinion).”
Why is not stated, but it seems likely that predicted popularity or probability of being popular
could be considered when deciding whether or not to publish a new article, for example one
submitted by a freelance journalist.
(ii) “allowing (as performed in this work) to improve content prior to publication”.
This considers how to potentially improve the (predicted) appeal of an article by considering
the effect of various changes to the article, and is discussed in section 3.2 of Fernandes et al.
1. (i) Give relevant equations and assumptions to describe the general linear model and the
logistic regression model for multivariate data (assume X has dimension p and Y is a single
variable in each case). Define any notation used.
(ii) Briefly describe the algorithms used to fit each of these models and their mathematical
basis.
(iii) Derive the maximum likelihood estimate of σ
2
for multiple linear regression (= the
general linear model). List any assumptions. Note: consider the matrix form of the model and
consider using some of the identities found in the Matrix Cookbook, available for example at
https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf .
(iv) In lectures, interactions have been described between two continuous variables and
between two categorical variables. Explain the interaction between a continuous variable and
a binary categorical variable. Give an example, including equations and a plot. Also attempt
to find evidence of one such interaction in this dataset - report your evidence and conclusions.
2.
(i) Perform exploratory data analysis as relevant to the construction of the regression models.
Investigate and highlight any apparent structure in the data.
(ii) Search for at least one reasonable use of an interaction term between two explanatory
variables and include it in your model. Note that the variables can be of any type, but the
interaction must be in addition to the one considered in 1(iv) and you must justify why you
considered this interaction (relevant graphs or tables) and explain the effect of including it in
the model.
(iii) Carefully evaluate whether or not the assumptions of a multiple regression model hold.
Consider, provide evidence, act to remedy (if reasonable) and discuss: standard assumptions
of multiple linear regression, collinearity, any outliers or influential observations. Note: you
may have to consider the conditions many times as you adjust the model (including for
transformations, as discussed below).
(iv) Attempt to choose and use at least some transformations of some of the variables to
improve model fit and compliance with assumptions. Justify these transformations and
discuss the results, with further discussion of the extent to which the assumptions are met
after this.
3. Use a general linear model (multiple regression) to attempt to predict the number of times
an article will be shared based on the other variables which are available before publication.
(i) List and explain which of the variables you think would and would not be available before
publication.
(ii) Give a table including all the estimated model parameters (including error variance),
confidence intervals, test statistics and p-values.
(iii) Interpret the two most significant slope parameters.
4. Use a logistic regression model to attempt to predict whether or not an article will be
popular (defined here as 1000 shares).
(i) You should attempt to check all assumptions of the model and report on this.
(ii) By default, re-use any transformations or other processing of the X variables attempted
with multiple linear regression. Comment on whether any further transformations or rerepresentations of the data may be useful. Use them if you think they help.
(iii) Give a table including all the estimated model parameters, confidence intervals, test
statistics and p-values.
(iv) Interpret the two most significant slope parameters.
5. Evaluate the predictive performance of each of the above two models using two
appropriate metrics. Include details of each metric and its advantages and disadvantages.
Compare your results with logistic regression to the reported results of Fernandes et al. with
random forests.
6. For the aims of the study, as first outlined by Fernandez, and any related aims which you
might reasonably ascribe to Mashable, which of the two models (multiple linear regression or
logistic regression) do you think is more useful and why? This is mainly an argument about
whether or not it is worthwhile to discretise the shares variable.
7. The model used by Fernandes et al. was a random forest. You may not know this model,
but it is essentially a large ensemble of decision tree models, with generally strong predictive
capabilities, but weak interpretability. It is hoped that the regression models you used will be
more interpretable.
(i) Explain how one can determine how to improve the predicted popularity of an article with
either of the regression models considered here.
(ii) Using your two fitted regression models, identify the article with the highest predicted
popularity and predicted probability of being popular.
(iii) For each of your two fitted regression models, list the attributes of two hypothetical
articles (fake news?) which would give the highest possible predicted popularity and
predicted probability of being popular, respectively. You should give values for every
attribute, but for each variable, keep them within the range seen in the dataset.
(iv) Based on your analysis of dependence between variables in the dataset, comment on
whether or not these hypothetical articles could be produced.
Notes:
Where possible, give reasons for your answers. Avoid thinking “you know what I mean”. Say
what you mean and don’t assume much.
You will need to submit two files:
1) A report (pdf format) answering the questions given here.
2) Code in a zip file.
Name your files something like Yourfirstname_Yoursurname_DATA7202_assn1.pdf to help
avoid confusion in marking.
You should de-emphasise any R commands and raw output in the report. R commands should
go in the code file. You should not be including all R output. Any output should be edited or
re-formatted to reasonably present the most important information. Use figures and tables
with numbers and captions for each and refer to them from the text. Try to make sure that any
figures you produce can be fairly easily understood.
As per http://www.uq.edu.au/myadvisor/academic-integrity-and-plagiarism, what you submit
should be your own work. Even where working from sources, you should endeavour to write
in your own words. Equations are either correct or not, but you should use consistent notation
throughout your assignment, define all of it and ensure that your report flows logically.
You are asked to use the R software environment for this assignment. This is available on all
computers in the Maths Department and is also free to install on any of your own computers.
Information and downloads are available from http://www.r-project.org/ . Rstudio
https://www.rstudio.com/ is a quality free interface for R.
Submit your pdf report via TurnItIn on Blackboard. Please make sure that the overall file size
is less than 10 MB. It is not necessary to include every data point on a typical plot. Using just
a sample of the data can be ok for most illustrative plots. The main exception is if you want to
show outliers along with all the data.
References:
A.J. Dobson, and A.G. Barnett, An Introduction to Generalized Linear Models, 4th edition,
CRC Press, 2018.
J. Faraway, Linear Models with R, 2nd edition, Chapman & Hall/CRC, 2014.
J. Faraway, Extending the Linear Model with R, Chapman & Hall/CRC, 2016.
K. Fernandes, P. Vinagre, and P. Cortez, A Proactive Intelligent Decision Support System for
Predicting the Popularity of Online News, in: Pereira F., Machado P., Costa E., Cardoso A.
(eds.) Progress in Artificial Intelligence, EPIA 2015, Lecture Notes in Computer Science,
vol. 9273, Springer.
J. Maindonald and J. Braun, Data Analysis and Graphics Using R - An Example-Based
Approach, 3rd edition, Cambridge University Press, 2010.
W. N. Venables and B. D. Ripley, Modern Applied Statistics with S, 4th edition, Springer,
2002.
S. Weisberg, Applied Linear Regression, 3rd edition, Wiley, 2005.
H. Wickham, and G. Grolemund, R for Data Science, O'Reilly, 2017.