Data Preprocessing is an important area in order to have quality input data. The
ID: 3781192 • Letter: D
Question
Data Preprocessing is an important area in order to have quality input data. The four methods discussed are data cleaning, data integration, data transformation and data reduction.
a.Suppose we were to develop software to implement each of these techniques. Discuss how easy or difficult it would be to develop software for each technique. Give this some thought and write a well-thought answer. Imagine that you had to write such software, how easy or hard would it be? (7 points)
b.After the software was developed, we are now going to run the programs. Explain for each of the programs implementing the four techniques, how much CPU processing time is required. I am not looking for an actual number. Rather, I am looking for your analysis how much CPU processing is needed for each technique. (7 points)
please answer the question by writing about each data preprocessing methods.
Explanation / Answer
Let's start with data cleaning.
Data cleaning: If we are writing a software related to cleaning of the data we must ensure the requirement gathering first, in data preprocessing what data cleaning does that it detects the corrupt and faulty data and then it removes such data. We must ensure consistency, accuracy, completeness and uniformity in the data while writing the software. While writing this software consistency and accuracy in the result would be the biggest challenge. If we talk about CPU time, now a days computing capability of CPU is increasing day by day and as far as data cleansing is concerned it will not take much of a processing in the CPU only accuracy and consistency will be the biggest concern of data cleansing.
Data Integration: Data integration is used to combine different data from different sources in order to create a meaningful data. The process becomes critical when two companies wants to merge database. While performing data integration we must ensure integrity and accuracy as well as consistency of the data. Data should not be lost while integrating two databases. While writing the software for data integration biggest concern for me as a software developer should be the consistency. As far as CPU processing of this technique is concerned this technique is going to take more CPU processing time than data cleaning tools.
Data Transformation: In data transformation data is converted from one format to another. So here also if we see accuracy will be the biggest player while writing this software and if we talk about CPU processing capability for this software it will be fast because data is only transformed here.
Data Reduction: Data reduction is also a kind of data transformation but in this big unformatted and scattered data is reduced into some nice form which will be easy to read, for example big numerical data into graphical representation is easy to grasp and in this user readability and accuracy of data also increases. While writing this software the software developer can use API's by Google and any Rest API through which it will be easy to write the software. As far as CPU processing is concerned for this technique, CPU processing will be smooth this means CPU will not take too much time to process this technique.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.