This question should be written in Python. Write a function getWebInfo( ) that t
ID: 3606335 • Letter: T
Question
This question should be written in Python. Write a function getWebInfo( ) that takes as input a URL and calls three functions, to print the following information:
1. The set of all absolute links already in the page, that is links that start with 'http://'. Must use HTML parser class methods. Do not copy code from the book, which is using url join to *make* every link absolute.
2. A set that contains all e-mail addresses appearing in the page. Must use regular expressions to detect e-mail addresses on the web page. Must remove duplicates. E-mails should be matching general e-mails, not just depaul.edu emails; do not use cdm.depaul.edu in your pattern. Your program should work for any e-mail address on any web page.
3.A list of tuples (derived from a dictionary)that contains the 20most frequent words and their frequencies, in order of frequency. Words must contain 5or more characters. Discard any words of 4characters or less.(6points)There are several steps to follow on this part.
a) You need to construct a dictionary first, containing words and their frequencies.
b) Then the dictionary has to be reversed.
c) The reversed dictionary has to be sorted. Please note that the sorting method returns a list of tuples. d) Print the first 20tuples of the list of tuples.
Write one function for each of the three pieces of information, a total of three functions. Then assemble the three function calls(and headings)inside your main function getWebInfo( ).Include the call to getWebInfo at the bottom of your module (file.)
Explanation / Answer
from bs4 import BeautifulSoup import urllib2 import re def get_links(): url = raw_input('Enter a url: ') htmlScript = urllib2.urlopen(url).read() soup = BeautifulSoup(htmlScript, 'html.parser') all_links = [] for link in soup.find_all('a'): if 'http' in link['href']: all_links.append(str(link['href'])[str(link['href']).find('http'):]) return all_links def get_emails(): url = raw_input('Enter a url: ') htmlScript = urllib2.urlopen(url).read() email_pattern = re.compile("([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`" "{|}~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)") for email in re.findall(email_pattern, htmlScript): print email[0]
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.