A6 – Prediction. Build a model to predict flight delays.
Fine print: (0) Group assignment, two students. (1) Your program takes as input 36 historical files. Each file is a month of data. You can use these to build a model. (2) You program also take a single test file. This represents all the future flights we want to predict. (3) Output a file containing predictions in the format <FL_NUM<FL_DATE<CRS_DEP_TIME, logical. The first column uniquely identifies a flight and the second is TRUE if the flight will be late. (4) Report execution time and the confusion matrix for the provided data. (5) The choice of predictive model is open; you will be graded on the accuracy of your method as well as execution time. One possible choice is to use the Random Forest algorithm. R and Java implementations exist. (6) Input data is to be found in bucked s3://mrclassvitek, in folders a6history and a6test. The folder a6validate contains a file that has the correct answers for most flights, use it to compute the confusion matrix. (7) As a measure of accuracy, use the sum of the percentage of on-time flights misclassified as delayed and the percentage of delayed flights misclassified as on-time. (8) Some hints for using random forests: (a) split the data and build models for subsets of the entire data set. (b) Recode the data so that the type of each column has at most 50 different values. In R, they should be factors. (c) Delete columns that are not usable for predictions. (d) Synthesize features that you think make sense. For example you could create a column labelled “Holidays” and it would be true when a flight is close to Christmas, New Year, and Thanksgiving. (8) A flight is delayed is ARR_DELAY 0.