1. [5 points] State the formula for computing the sample variance. What type of
ID: 640917 • Letter: 1
Question
1.
[5 points] State the formula for computing the sample variance. What type of measures
(distribute, algebraic, or holistic) is the sample variance? What about mode? Explain your
answer.
Before answering this question, please read the paper [Gray 1995] (Section 5 only) indicated in
2.
[15 points ] Compute the similarity of the following two documents using City Block, Euclidean,
Minkowski with r =3, and Cosine function. Which of these measures distance/similarity of two
objects? State the minimum and maximum values of each function on any two objects.
Stock
Exchange
Market
Job
Computer
Science
Document #1
5
4
2
1
1
0
Document #2
0
0
2
2
7
3
For each of the above functions, can you suggest a way to normalize the scores to [0, 1] if the
original score given by the function does not fall into the range?
3.
[5 points] Write a SQL query for loading data from Dealer Two to the warehouse. The schemas
of Dealer Two and warehouse were described in Lecture (ETL, Data Cleaning & Data Reduction,
slide #8).
4.
[5 points] Give at least three examples for each of the schema?level and instance?level data
quality problem. Indicate if it is a single or multi?source problem.
5.
[5 points] Explain the detect?code?apply process of data cleaning. Give at least two scenarios
where the data cleaning process needs to be iterated.
6.
[10 points] If we switch the number for the two cases for playing chess, that is, 50 people like
fiction & play chess, while 250 play chess but do not like playing chess, are the two
variables/attributes still correlated (at the same significance level)? Show your work.
7.
[5 points] How many possible ways of sampling 3 records out of a table of 10 records, using each
of the four sampling strategies as discussed in the lecture? Ordered with replacement, ordered
without replacement, etc.
8.[5 points] Describe a scenario where using clustered sampling is more appropriate than stratified sampling. Explain your answer.
9.[20 points] Consider the following measurements on weight (pound) and height (foot) of a group of people.
Draw a box plot for weight and height data points using R (note that you should have got familiar with R at this point. As announced in the class, R is available at http://www.r?project.org/.).
Draw a scatter plot using R to observe the relationship between weight and height. Are they correlated based on your visual inspection of the plot?
Compute the Pearson
1.
[5 points] State the formula for computing the sample variance. What type of measures
(distribute, algebraic, or holistic) is the sample variance? What about mode? Explain your
answer.
Before answering this question, please read the paper [Gray 1995] (Section 5 only) indicated in
2.
[15 points ] Compute the similarity of the following two documents using City Block, Euclidean,
Minkowski with r =3, and Cosine function. Which of these measures distance/similarity of two
objects? State the minimum and maximum values of each function on any two objects.
Stock
Exchange
Market
Job
Computer
Science
Document #1
5
4
2
1
1
0
Document #2
0
0
2
2
7
3
For each of the above functions, can you suggest a way to normalize the scores to [0, 1] if the
original score given by the function does not fall into the range?
3.
[5 points] Write a SQL query for loading data from Dealer Two to the warehouse. The schemas
of Dealer Two and warehouse were described in Lecture (ETL, Data Cleaning & Data Reduction,
slide #8).
4.
[5 points] Give at least three examples for each of the schema?level and instance?level data
quality problem. Indicate if it is a single or multi?source problem.
5.
[5 points] Explain the detect?code?apply process of data cleaning. Give at least two scenarios
where the data cleaning process needs to be iterated.
6.
[10 points] If we switch the number for the two cases for playing chess, that is, 50 people like
fiction & play chess, while 250 play chess but do not like playing chess, are the two
variables/attributes still correlated (at the same significance level)? Show your work.
7.
[5 points] How many possible ways of sampling 3 records out of a table of 10 records, using each
of the four sampling strategies as discussed in the lecture? Ordered with replacement, ordered
without replacement, etc.
Explanation / Answer
s^2 = summmation(x-x'^2)/n-1
3)
String resultset=select data from dealer;
insert into Warehouse values(resultset);
4)
json schema
{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
}
},
"required": ["firstName", "lastName"]
}
xml purchase order schema
<?xml version="1.0"?>
<purchaseOrder xmlns="http://tempuri.org/po.xsd" orderDate="1999-10-20">
<shipTo country="US">
<name>Alice Smith</name>
<street>123 Maple Street</street>
<city>Mill Valley</city>
<state>CA</state>
<zip>90952</zip>
</shipTo>
<billTo country="US">
<name>Robert Smith</name>
<street>8 Oak Avenue</street>
<city>Old Town</city>
<state>PA</state>
<zip>95819</zip>
</billTo>
<comment>Hurry, my lawn is going wild!</comment>
<items>
<item partNum="872-AA">
<productName>Lawnmower</productName>
<quantity>1</quantity>
<USPrice>148.95</USPrice>
<comment>Confirm this is electric</comment>
</item>
<item partNum="926-AA">
<productName>Baby Monitor</productName>
<quantity>1</quantity>
<USPrice>39.98</USPrice>
<shipDate>1999-05-21</shipDate>
</item>
</items>
</purchaseOrder>
5)
Data cleaning deals with data problems once they have occurred. Error-prevention strategies can reduce many problems but cannot eliminate them. We present data cleaning as a three-stage process, involving repeated cycles of screening, diagnosing, and editing of suspected data abnormalities. Figure 1 shows these three steps, which can be initiated at three different stages of a study. Many data errors are detected incidentally during study activities other than data cleaning. However, it is more efficient to detect errors by actively searching for them in a planned way. It is not always immediately clear whether a data point is erroneous. Many times, what is detected is a suspected data point or pattern that needs careful examination. Similarly, missing values require further examination. Missing values may be due to interruptions of the data flow or the unavailability of the target information. Hence, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can screen for suspect features in survey questionnaires, computer databases, or analysis datasets. In small studies, with the investigator closely involved at all stages, there may be little or no distinction between a database and an analysis dataset.
8)
The main difference between cluster sampling and stratified sampling lies with the inclusion of the cluster or strata.
In stratified random sampling, all the strata of the population is sampled while in cluster sampling, the researcher only randomly selects a number of clusters from the collection of clusters of the entire population. Therefore, only a number of clusters are sampled, all the other clusters are left unrepresented.
remaining ones i will update
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.