Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. Please answer the following conceptual questions a) Compare the following ter

ID: 399469 • Letter: 1

Question

1. Please answer the following conceptual questions

a) Compare the following terms: supervised learning vs. unsupervised learning; classification vs. class probability estimation; clustering vs. association rule mining; data science vs. data mining.

b) Define Data-Analytic thinking.

c) Define CRISP data mining process.

d) What is the definition of predictive model?

e) What is the definition of entropy and information gain?

2. Come up with one business problem (real or hypothetical) for each of the following types of data science solutions: classification, class probability estimation, clustering and association rule mining.

3. As a concrete example, consider a set S of 14 people with eight of the non-write-off class and six of the write-off class. Based on the following table 1 of dataset, please answer the following questions.

Table 1. “Write-off example for supervised segmentation”

Name Balance Age Employed Write-off

Mike $40,000 35 Yes No

John $200,000 32 No Yes

Matt $60,000 53 No No

Mark $8,000 23 Yes Yes

Mary $100,000 43 No No

Andy $25,000 34 Yes Yes

Dora $39,000 18 Yes No

Robert $65,000 31 Yes No

Bob $8,200 27 Yes Yes

Captain $19,000 32 Yes No

Michael $72,000 43 Yes Yes

Howard $52,000 33 No Yes

King $105,000 36 No No

Peter $89,000 38 No No

a) We are trying to predict whether the person is a loan write-off. Could you describe what is the target variable in the above dataset table? What attributes can be used to predict the target variables?

b) What is the entropy of the dataset with respect to the target variable mentioned in the above question a)?

c) How much information gain can get after introducing the Employed attribute for the segmentation with respect to the above target variable mentioned in question a)? Is the attribute informative for segmenting the target variable?

d) If we categorize the “Balance” attribute into three types based on the following cutting point number $10,000 and $50,000, could you know how much information gain can get after introducing the Balance attribute for the segmentation with respect to the above target variable mentioned in question a)? Is the attribute informative for segmenting the target variable?

e) Please use a tee-induction model to visualize question d)’s segmentation result?

Explanation / Answer

(a) Supervised learning and Unsupervised learning:

Classification and Class probability estimation:

Clustering and. Association rule mining:

Data Science and Data Mining:

(b) Data Analytic thinking: In the past few years organizations have invested a lot to collect as many data as possible for various purposes like profitability of the company and for the development of society. Even every business and role is open to data. The information is widely available and hence it is needed to find the trends and movements. The method to come to a conclusion from the large set of data is termed as data analytic thinking.

(c) CRISP data mining process: Cross Industry Standard process for Data mining often written as (CRISP-DM). The set of comprehensive data with well define methodology and the process so that anyone can using the data analytic expertees forms a blue print in the project work. It is further divided in 6 phases:

(d) Predicitive model: It is the process that use the combimation of data mining and the probability to come to a desired outcome. The number of predictors and variables are fed into the system for influencing the future results. When the data is being collected and gathered in systematic manner the statiscal methods are used to find the conclusion.

(e) Entropy and Information gain: The entropy characterizes purity and impurity in the data moving ahead to some outsome. While information gain creates reduction in entropy by partitioning examples. In other words entropy controls the decision tree to split the data and removes the uncertaininty when decision tree creates boundaries.