Suppose we are given a database with the following schema. Users User IDINTEGER,
ID: 3579813 • Letter: S
Question
Suppose we are given a database with the following schema. Users User IDINTEGER, Name CHAR(30), Age INTEGER, ReviewCount INTEGER) Businesses (BusinessID INTEGER, BName CHAR(30), City CHAR(20), State CHAR(2) Checkins (BusinessIDINTEGER, Weekdays INTEGER, Weekends INTEGER) Reviews (ReviewID INTEGER, UserIDINTEGER, BusinessID INTEGER, Stars REAL) Reviews (UserID) is a foreign key referring to Users (UserID). Reviews (BusinessID) is a foreign key referring to Businesses (BusinessID). Checkins (BusinessID) is a foreign key referring to Businesses (BusinessID) A page is 8 kB in size. The RDBMS buffer pool has 10,000 pages, all of which are usable. Initially, the buffer pool is empty. The relation instances have the following statistics. Assume there are no NULL values. Each integer or real is 8B, and each character is 1B (so as an example CHAR(20) is 20B). Additionally, the record id of each tuple is 8B. Relation Number of Pages Number of Tuples 75,684 10m. Users 41,504 5mm Businesses 19,532 5mm Checkins Reviews 488,282 100mExplanation / Answer
Answer: The main goal towards processing a SQL query is to evaluate the query as quick as possible and return the result.
Factors that can affect the performance of processing the query are no. of disc accesses, no. of read/write operations(I/O, page transfer) etc.
Processing of an SQL query involves following steps:
1. Parsing and translation of query into parse tree.
2. Creation of relational algebra expression from parse tree.
3. Identification of physical query plan. It is also referred as execution plan of query. It depends on relational algebra expression created in step 2. Before evaluating, relational algebra needs to be optimized. There can be many equivalent expressions with respect to relational algebra expression. Optimization means finding the cheapest execution plan for the query.
4. Next step is the execution of (optimized) execution plan by execution engine and return the result.
Cost estimation:
Given:
1. No. of tuples in relations:
a) Users: 10m
b) Businesses: 5m
c) Checkins: 5m
d) Reviews: 100m
2. No. of pages per relation:
a) Users: 75684
b) Businesses: 41504
c) Checkins: 19532
d) Reviews: 488282
3. Size of a tuple per relation:
a) Users: 8+8+30+8+8=62 Bytes
b) Businesses: 8+8+30+20+2=68 bytes
c) Checkins: 8+8+8+8=32 bytes
d) Reviews: 8+8+8+8+8=40 bytes
4. Number of tuple per page per relation (paging factor):
a) Users: 10m/75684 = 10000000/75684 = 132.128
b) Businesses: 5m/41504 = 5000000/41504 = 120.470
c) Checkins: 5m/19532 = 5000000/19532 = 255.990
d) Reviews: 100m/488282 = 100000000/488282 = 204.799685 = 204.800
5. No. of distinct values of concerned attributes in query per relation:
a) UserID in Users: UserID is unique for each user. Hence, its distinct values will be 10m.
b) Stars in Reviews: Stars are uniformly distributed inclusivly between 0 and 5. Hence we can consider average counts of each rating number here. Total rating numbers ((0,1,2,3,4,5) are 6. Hence, average count per rating will be 100m/6 = 16.67m
c) Age in Users: Age values are also uniformly distributed between 10 and 99 inclusively. Hence, in this case also we can consider average count for each possible age value between 10 and 99. Total possible age values are 90 (99-10+1). Therefore, average count per age value will be 10m/90 = 0.11m
6. Selectivity of concerned attributes: Selectivity of an attribute in a relation refers to average number of tuple in relation satisfying equality condition on that attribute. So for:
a) UserID: Selectivity = Number of tuples in Users/distinct values of UserID = 10m/10m = 1.
b) Stars: Selectivity = Number of tuples in Reviews/average count per rating = 100m/16.67m = 5.9988 = 6
c) Age: Selectivity = Number of tuples in Users/average count per age value = 10m/0.11m = 99.909 = 91
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.