Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

(JAVA) Write code into the main function of class Analysis. You are free to modi

ID: 3733285 • Letter: #

Question

(JAVA) Write code into the main function of class Analysis. You are free to modify Business.java if you use it and finally everything must be put in a package called Data.

Starter Code:

Business.java

package Data;

public class Business {

String businessID;

String businessName;

String businessAddress;

String reviews;

int reviewCharCount;

public String toString() {

    return "------------------------------------------------------------------------------- "

          + "Business ID: " + businessID + " "

          + "Business Name: " + businessName + " "

          + "Business Address: " + businessAddress + " "

          //+ "Reviews: " + reviews + " "

          + "Character Count: " + reviewCharCount;

}

}

Use the data set provided at the bottom of the post. It contains the following format:

{businessID, businessName, businessAddress, reviews}

{businessID, businessName, businessAddress, reviews}

The reviews consists of lowercase English letters and spaces without any punctuation or non-English characters. The goal of this assignment is to process this data set and extract meaningful words from the data set that represent the businesses.

Read the file, and create a Business Object for each business. Each Business should contain its reviews as a String. You may use the class Business provided in Business.java.

We will determine whether or not a word meaningfully represents a business with the number of times it appears in the reviews. However words like “a”, “that”, “is”, or “and” are most frequent, and these words clearly do not say much about the business.

Therefore, we use the term frequency-inverse document frequency (tf-idf) score:

tf-idf(w, D) = ______number of times word w appears in document D_____

                       number of documents in the entire corpus that contain word w

The tf-idf score is high if a rare word appears many times in a certain document. The numerator is the “term frequency” , and the denominator is the “document frequency”.

For each word, count the number of documents the word appears in.

For the top 10 Businesses with the most characters in their reviews, output (to the command line) the top 30 words with the highest tf-idf scores.

When you do so, you’ll notice that some words with high tf-idf scores are so rare that they appear in 1 or 2 documents. Some of these words are misspellings or slangs that only make sense to locals. Filter these out.

If a word appears in less than 5 documents, assign a tf-idf score of 0.

Poorly written code will take too long to process the full data set. Perform some optimization.

Optimize your code so that it runs on the full data set within 10 minutes.

Your code could look something like

public static void main(String[] args) {

Map<String, Integer> corpusDFCount = ???;

List<Business> businessList = ???;

while (true) {

                   Business b = readBusiness(???);

                   If (b==null) // end of file and processed all businesses

                                      Break;

                   businessList.add(b);

}

for (Business b : businessList)

                   addDocumentCount(corpusDFCount, b);

//sort by character count

Collections.sort(businesslist, ???);

//for the top 10 businesses with the most review characters

for (int i=0; i<10; i++) {

                   Map<String, Double> tfidfScoreMap = getTfidfScore (corpusDFCount, businessQueue.remove(), 5);

//Entry is a static nested interface of class Map

List<Map.Entry<String, Double>> tfidfScoreList = new ArrayList<> (tfidfScoreMap.entrySet());

sortByTfidf(tfidfScoreList);

System.out.println(businessList.get(i));

printTopWords(tfidfScoreList, 30);

                   }

}

This code can be further optimizes using a PriorityQueue

public static void main(String[] args) {

Map<String, Integer> corpusDFCount = ???;

PriorityQueue<Business> businessQueue = ???;

while (true) {

                   Business b = readBusiness(???);

                   If (b==null) //end of file and processed all businesses

                                      break;

                   addDocumentCount(corpusDFCount, b);

                   businessQueue.add(b);

                   if(businessQueue.size()>10)

                                      businessQueue.remove();

}

//for the top 10 businesses with most review characters

for (int i=0; i<10; i++) {

                   Business currB = businessQueue.remove();

                   Map<String, Double> tfidfScoreMap = getTfidfScore (corpusDFCount, currB, 5);

//Entry is a static nested interface of class Map

List<Map.Entry<String, Double>> tfidfScoreList = new ArrayList<>(tfidfScoreMap.entrySet());

sortByTfidf(tfidfScoreList);

System.out.println(currB);

printTopWords(tfidfScoreList, 30);

                   }

}

(This code is just a suggestion provided to clarify the instructions.)

Your output should look something like:

Example 1

Business ID: 60454

Business Name: Bacchanal Buffet Business Address: Caesars Palace Las Vegas Hotel And Casino 3570 Las Vegas Boul evard South The Strip Las Vegas NV 89109

Character Count: 3780749

(bacchanal,10.75) (bacchanals,2.73) (buffet,1.48) (baccahanal,1.17) (bachannal, 1.07) (buffets,1.06) (bacchanel,1) (bachanal,0.92) (ginseng,0.91) (alvaro,0.89) (baccanal,0.89) (carving,0.84) (legsclaws,0.83) (macarons,0.74) (oysters,0.64) (platinumdiamond,0.6) (virkelig,0.6) (wicked,0.57) (bacchanalian,0.56) (crab,0. 54) (dionysus,0.5) (fastpass,0.5) (wicket,0.5) (gelato,0.48) (legs,0.46) (bucch anal,0.45) (the,0.45) (maccaroons,0.44) (jonah,0.44) (vomitorium,0.44)

Example 2

Business ID: 20187

Business Name: Mon Ami Gabi

Business Address: 3655 Las Vegas Blvd S The Strip Las Vegas NV 89109

Character Count: 3481415

Hint. When optimizing your code, consider the following advice.

Instead of directly using java.io.FileInputStream, java.io.BufferedInputStream, or java.io.Reader, use java.io.BufferedReader with java.io.FileReader.
You can read a line of the dataset with readLine() of BufferedReader. You can separate a String about commas with split(…) of String.
String are immutable, and using + to concatenate Strings is inefficient. When you’re building up a large string, use java.lang.StringBuffer or java.lang.StringBuilder.

If you add additional fields to class Business, make sure your additional fields aren’t too large. In particular, you’re doing something wrong if each Business owns (as a field) a Collection or a Map.

Remark. As a point of reference, my code runs in 90 seconds. Almost all of this time is spent in the while loop reading the Businesses and accumulating the document count.

Remark. The suggested code for this assignment is not very object-oriented, and that’s fine. Object-oriented programming is a tool, and a tool should only be used when it fits the task.

Remark. In our data analysis, we entirely ignore the grammar the reviews were written in. Natural language processing is a field of artificial intelligence that studies how to process human language with a computer. Techniques from natural language processing would use the grammatical information and produce better results.

Had issues uploading the data set as a file on here so I will post it in the question

Explanation / Answer

YelpAnalysis.java


import java.io.*;
import java.util.*;
import java.text.DecimalFormat;
//only perform the tf-idf for the top 10 business with the most character?
//how to sort? Need some motivation on sorting.
public class YelpAnalysis{
/*When construct, we need its capacity and a comparator. Comparator is an interface. If you do not want to use anonymous class, create a class implements Comparator<T>, then implement
public int compare(T c1, T c2). When prompted to use a comparator, put a new instance of that class. Or you can use anonymous class.
*/
private static PriorityQueue <Business> businessQueue = new PriorityQueue<Business>(10, new Comparator<Business>() {
public int compare(Business o1, Business o2) {
if(o1.reviewCharCount>o2.reviewCharCount){
return 1;
}else if(o1.reviewCharCount<o2.reviewCharCount){
return -1;
}else{
return 0;
}
}
});
  
//Create a hashmap to get the number of documents in the entire corpus that contain a specific string
private static Map <String,Integer> corpusDFCount = new HashMap<String, Integer>();
  
//read the lines and store them as the way we want. Accepts a buffered reader and returns a business
private static Business readBusiness(BufferedReader br)throws Exception{
String line = br.readLine();//read a new line
if(line == null){
return null;
}//If read in a null, line return null to let the outer while loop know
String line2 = line.substring(1);
String result_line = line2.substring(0, line2.length() - 1);//process the read in line to get the format we want
String array1[]= result_line.split(",");
Business b = new Business();//create a new business
b.businessID = array1[0];//put in the respective data
b.businessName = array1[1];
b.businessAddress = array1[2];
b.reviews = array1[3];
b.reviewCharCount = b.reviews.length();
return b;//return the business
}
  
//get the number of documents in the entire corpus that contain a specific string
private static void addDocumentCount(Map<String, Integer> corpusDFCount, Business b ){
String array2[] = b.reviews.split(" ");//get every word in the array
Set<String> s = new HashSet();//add these word to a set to just keep the unique ones
for(String element: array2){
s.add(element);
}
for(String element: s){
//for each element in s, if the hashmap already has the element, its value +1. If the hash map does not have the value, set the value to be 1
if(corpusDFCount.containsKey(element)){
int temp_value = corpusDFCount.get(element) + 1;
corpusDFCount.put(element,temp_value);
}else{
corpusDFCount.put(element,1);
}
}
}
  
//For each business b, count how many a particular word is in that review, then delete the word's value in corpusDFcount. Finally return the scoremap.
private static Map <String,Double> getTfidfScore(Map <String,Integer> corpusDFCount, Business b, int k){
String array3[] = b.reviews.split(" ");
Map <String,Double> tfidfScoreMap = new HashMap<String, Double>();
Set<String> s2 = new HashSet();
for(String element: array3){
s2.add(element);
}
  
  
  
for(String element:s2){
int num = 0;
for(int i = 0; i<array3.length;i++){
if(element.equals(array3[i])){
num++;
}
}
  
double result = ((double)num)/corpusDFCount.get(element);
tfidfScoreMap.put(element,result);
}

  

  
  
  
  
Set<String> setOfKeys = tfidfScoreMap.keySet();
for(String kk: setOfKeys){
if(corpusDFCount.get(kk) < k){
double temp = 0;
tfidfScoreMap.put(kk,temp);
}
}
  
/*for(Map.Entry<String, Double> element: tfidfScoreMap.entrySet()){
System.out.print( "("+element.getKey()+","+element.getValue()+")"+" ");
}*/
  
return tfidfScoreMap;
  
}
  
//Accepts a list of map.entry. Treat Map.Entry<String, Double> as objects. Say Map.Entry<String, Double> o1, use o1.getValue() to get the value of that entry (combination of string and double)
private static void sortByTfidf(List <Map.Entry<String, Double>> tfidfScoreList){
Collections.sort(tfidfScoreList, new Comparator<Map.Entry<String, Double>>() {
public int compare(Map.Entry<String, Double> o1, Map.Entry<String, Double> o2) {
return (-1*(o1.getValue()).compareTo(o2.getValue()));
}
});
  
}
//print out the top 30 words.
private static void printTopWords(List<Map.Entry<String,Double>> tfidfScoreList, int num){
DecimalFormat df = new DecimalFormat("#.00");
for(int i = 0; i < num; i++) {
System.out.print( "("+tfidfScoreList.get(i).getKey()+","+df.format(tfidfScoreList.get(i).getValue())+")"+" ");
}
}
  
public static void main(String[] Args) throws Exception{

InputStream is = new FileInputStream ("full.txt");
Reader r = new InputStreamReader(is);
BufferedReader br = new BufferedReader(r);
try{


  
  
  
  
  
  
while(true){
Business b = readBusiness (br);
if(b == null){
break;
}
addDocumentCount(corpusDFCount,b);
businessQueue.add(b);
if (businessQueue.size() > 10){
businessQueue.remove();
  
}
  
}
  
  
} finally{
  
br.close(); //are we closing this or closing bufferedreader?
}
  
  
  
for (int i =0; i <10; i ++) {
Business currB = businessQueue.remove();
Map <String,Double> tfidfScoreMap =
getTfidfScore(corpusDFCount,currB, 5);
// Entry is a static nested interface of class Map
List <Map.Entry<String, Double>> tfidfScoreList = new ArrayList(tfidfScoreMap.entrySet());
sortByTfidf(tfidfScoreList);
System.out.println(currB);
printTopWords(tfidfScoreList, 30);
}
}
  
}

Business.java


import java.io.*;
public class Business {
String businessID;
String businessName;
String businessAddress;
String reviews;
int reviewCharCount;

public String toString() {
return "------------------------------------------------------------------------------- "
+ "Business ID: " + businessID + " "
+ "Business Name: " + businessName + " "
+ "Business Address: " + businessAddress + " "
//+ "Reviews: " + reviews + " "
+ "Character Count: " + reviewCharCount;
}

}