In this part of the project, you will various queries using Hive, Pig and Hadoop
ID: 3900091 • Letter: I
Question
In this part of the project, you will various queries using Hive, Pig and Hadoop streaming. The schema is available below, but don’t forget to apply the correct delimiter:
http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql
The data is available at:
http://rasinsrv07.cstcis.cti.depaul.edu/CSC553/data/ (we will use Scale4)
Use Lineorder.tbl
Implement the following query: select sum(lo_revenue), p_brand1 from lineorder, part, supplier where lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_category = 'MFGR#12' and s_region = 'EUROPE' group by p_brand1;
using Hive and Pig .
Explanation / Answer
Step 1: Configure Hadoop and PigStorage
Step 2: Load data into the storage
Step 3: Load data for part and supplier tables as well
Step 4: Execute below in grunt
> lineparts = JOIN lineorder BY lo_partkey, part BY p_partkey;
> linesupply = JOIN lineorder BY lo_suppkey, supplier BY s_suppkey;
> filterlineparts = FILTER lineparts BY p_category = 'MFGR#12';
> filterlinesupply = FILTER linesupply BY s_region = 'EUROPE';
> filterData = JOIN filterlineparts BY lo_partkey, filterlinesupply BY lo_partkey;
> Group_data = GROUP filterData BY p_brand1;
> revenue_sum = foreach Group_data Generate
(p_brand1.name,SUM(lo_revenue);
Query in HiveQL:
select sum(lo.lo_revenue), p.p_brand1
from lineorder lo
join part p on (lo.lo_partkey = p.p_partkey)
join supplier s on (lo.lo_suppkey = s.s_suppkey)
where p.p_category = 'MFGR#12' and s_region = 'EUROPE'
group by p_brand1;
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.