**Data structures: ****Data Mining**

*Course Section: 001 or 002*

A total of 1 question with 8 parts for 50 points.

1- This question involves concepts and practices from Naïve Bayes and CART classifier. Download and open “Assignment4-Code.R” and “accidents1.csv” data file from Canvas. The data set contains 595 records of car accidents. Description of the parameters are show in the following table:

RUSH_HR Accident occurred during rugh hours or not

WRK_ZONE Accident occurred in the work zone or not

WKDY Accident occurred on the weekdays or weekends

INT_HWY Accident occurred on the interstate highways or not

SPD_LIM Speed limit in the accident zone

SUR_COND_dry Road surface was dry or wet when the accident occurred

TRAF_two_way Accident occurred on a two-way traffic road or on-way

WEATHER_adverse Weather was normal or adverse when the accident occurred

MAX_SEV Maximum severity level: No-Injury and Non-Fatal

In this data set, MAX_SEV is our target with two levels. Open the R code in RStudio and read the remarks and follow the instructions to answer the following questions.

What is the probability of accident with No-Injury?

What is the probability of having Wet-Road as the road surface condition?

What is the probability of an accident with No-Injury and Wet-Road condition?

What is the conditional probability of No-Injury given that the roads are wet?

First, calculate in part b and in part c, and then plug in the values into the above formula to calculate the conditional probability.

Split the data into training and validation sets. Train a decision tree with full complexity (CP=0). How many leaves are in this tree?

What is the accuracy of the model with full complexity on the training set? What is the accuracy on the validation set? Compare the accuracy values for the training set and validation set. Is there a sign of over-fitting?

From the full tree developed in the previous part, what is the CP value that corresponds to the minimum cross-validation error (xerror)?

Train a new decision tree with the optimum CP value that you found in the previous step. What is the accuracy of the full complexity model on the training set? What is the accuracy on the validation set? Compare the accuracy values for the training set and validation set. Is there a sign of over-fitting?