Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

no copy Briefly explain, why we use Data Mining, specifically on large data sets

ID: 3800010 • Letter: N

Question

no copy Briefly explain, why we use Data Mining, specifically on large data sets (big data)?

List and briefly explain major data mining tasks with examples?

Explain different approaches to handle the problem of missing values of attributes while data cleaning.

Explain each of the following characteristics about the data warehouse mentioned in its definition:

“A data warehouse is a (1) subject-oriented, (2) integrated, (3) time-variant, and (4) nonvolatile collection of data in support of (5) management’s decision-making process”

Explanation / Answer

Data Mining is useful for extracting information from huge sets of data. Big data concerns for large data volumes which is highly supported by large computational sets, analysis and algorithms provided for data mining. In this era, big data mining plays an important role to provide most accurate and most relevant feedback discovered by social sensing network so that we can better understand our society needs at run time. For example, if you are going to mining data of a garment shop for a particular item, it will check the availability over world wide web for that store and then refine the searches for location, color, size, design etc. Now looking for particular item in a particular shop, there will be no such need to apply mining as by just calculating owner can provide details for your need. That's why data mining is specifically for big datasets.

The various data mining task primitives are -

There are various types of approaches for handling missing data for mining some of them are-

1. MCAR (“Missing Completely at Random”) - It refers to data where the missingness mechanism
does not depend on the variable of choice or any other variable, found on dataset.

2. MAR (Missing at Random) - Like an entry Xi as MAR if the data meets the requirement that missingness should not depend on Xi value after controlling for another variable.

Data ware house characteristics are -

1. Controlled data load.

2. Subject oriented - DW to analyse particular subject area. Say sales a particular choice.

3. Integrated- It integrates data from multiple data sources. As coding conventions are standarized like M_Male, F_Female.

4. Time variant - Historical data resides in data warehouse. For example, retrieve data from 3 months to 12 months, or even older data from a data warehouse.

5. Non volatile - As it kept historical data which should never be altered.

6.