Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Data Mining for Business Applications: Using R programming in R Studio: Please h

ID: 3749258 • Letter: D

Question

Data Mining for Business Applications: Using R programming in R Studio: Please help with parts a-c. thank you! (Screenshot of the hint under part c is in the comments) Problem 3 ( Points) In this problem, we will be working with the following data frame. Before attemping to answer the questions below, make sure to execute the code below to make the data frame available in your R session. patient data data.franegender c( ale ale""Fesale""Fesale Fenale""ale age e(20, 54, 68, 34. 45, 42. cholesterol c10, 180, 173. 132, stress teat-eigh"Lou", "Mediun"Lou" 3, 38, S0, 25) 40. 120. 134, 140, 190, 95), Very Rig"Med1umLow Bigh"Very High""Mediu (a) 1 Point Create an indicator variable in the patient_data data frame called high stress This should be a numeric variable that is 1 if stress_test is either High or Very High" and 0 otherwise The resulting data frame should look like this when you are done patient data gender age cholesterol stress test big stress Hale 20 Male 5 3 Fesale 68 Fesale 34 5 l 45 6 Fele 42 > 7 Fesale 43 >8 Male 38 9 Fenale 50 10 Male 25 110 173 140 Very Righ 120 Medin 90 Very Righ (b) 1 Poin Create a categorical variable called cholest_cat from the cholesterol variable in patient data that has the following values: "Less than 140. 140-160.and "161 or greater The resulting data frame should look like this when you are done patient data > gender age cholesterol tress-test high.stress Hale 20 2 Male 54 > 3 Female 68 Fesale 34 5 ale 45 6 Fele 42 s 7 Female 43 8ale 38 9 Female 50 10 Male 25 cbolest.cat Less than 140 0 161 or greater 161 or greater 0 Less than 140 140 160 Let# than 140 0 Less than 140 110 173 140 VeryRigh 120 190 Very Righ 161 or greater Lets than 140 (c) 3 Points Write R code that uses map_df to replace the age and cholesterol variable values with their absolute mean-deviated values. If we name our transformed vector w then each element of w is caculated with the following formula, wherei and aen, are the minimum, manimum, and average values in the numerit vector x a,

Explanation / Answer

(a). Given data is

patient_data <- data.frame(gender=c("Male","Male","Female","Female",
                                    "Male","Female","Female","Male",
                                    "Female","Male"),
                           age= c(20,54,68,34,45,42,43,38,50,25),
                           cholestrol=c(110,180,173,132,140,120,134,140,190,95),
                           stress_test=c("High","Low","Medium","Low",
                                         "Very High","Medium","Low",
                                         "High","Very High","Medium"),
                           stringsAsFactors =FALSE
                           )

To access a particular column value in R data frame we need to use varibale$column_name where variable is the dataframe and column_name is the column whose value we need to use or manipulate

We can access the stress_test column of patient_data by using patient_data$stress_test.

Search in the stress_test colum for a pattern "High" which covers both "High" and "Very High" using grepl() function. If a match found grepl() will return TRUE or else it will return FALSE

Use as.integer() to convert TRUE or FALSE returned from grepl() function to 1 or 0

ie   as.integer(TRUE) =1 and as.integer(FALSE) =0

So the final code to append an extra column hight_stress is as follows

patient_data$high_stress <- as.integer(grepl('High',patient_data$stress_test))

patient_data

(b)

Here we need to create a categorical column with 3 values based on the value of cholestrol column

(i) Less than 140

(ii) 140 - 160

(ii) 161 or greater

Check one by one the values of the column cholesterol using comparison operator and assign the value to a newly created colum cholest_cat as in the code below

patient_data$cholest_cat[patient_data$cholesterol >160]="161 or greater"

#The above code will check for the cholesterol colum with values greater than 160

patient_data$cholest_cat[patient_data$cholesterol >=140 & patient_data$cholesterol<=160]="140 - 160"
#This line will identify the column with cholesterol values 140 to 160

patient_data$cholest_cat[patient_data$cholesterol <140]="Less than 140"

#finally column with cholesterol value less than 140 are identified and assgned to the category "Less than 140"

patient_data

will display the data frame with newly added column cholest_cat

(C)

Note: Method explained here does not uses map_df() function to find the mean deviation of age and cholesterol

To find the mean deviation , we need to find the mean of the column values , maximum and minimum of a particular column.

Functions to calculate mean, maximum and minimum values of a column are as follows

(i) mean(variable$column_name)

(ii) max(variable$column_name)

(iii) min(variable4column_name)

where variable here is patient_data and column_name are age and cholesterol whose value to be replaced by the mean deviation value.

to find the absolute value, we need to use the abs() function

Code to replace age with the mean deviation value is as follows

patient_data$age<- abs((patient_data$age-mean(patient_data$age))/ (max(patient_data$age)-min(patient_data$age)))

Code to replace cholesterol with mean deviation value is as given below

patient_data$cholesterol<- abs((patient_data$cholesterol-mean(patient_data$cholesterol))/ (max(patient_data$cholesterol)-min(patient_data$cholesterol)))

To view the manipulated data use patient_data