Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

The goal of this assignment is to write a program in PYTHON that will scan a web

ID: 3684571 • Letter: T

Question

The goal of this assignment is to write a program in PYTHON that will scan a web page and harvest as many email addresses as possible. Many of these email address will be obfuscated in some way. It’s up to you to get the computer to figure out how to recognize the obfuscation and return a good result!

Your grade will be based on how many (and what types) of email addresses you can find in the page. We will provide examples of most types of obfuscation, but not necessarily all. Some bonus points may be earned for some really tricky ones.

Here are some examples to get you started (in the form “obfuscated email” => “what your program should interpret the email as”):

mst3k@Virginia.EDU => mst3k@Virginia.EDU

thomas.jefferson@cs.virginia.edu => thomas.jefferson@cs.virginia.edu

mst3k at virginia.edu => mst3k@virginia.edu

mst3k at virginia dot edu => mst3k@virginia.edu

Tips

You can read the entire web page line by line to make it easier to search.

Once you have a line from the web site, you have a couple different options:

You can manually look for particular symbols by using the in keyword. For example, you could try if "@" in line: to see if there is an @sign in the line you are looking at. If so, you might want to take a closer look.

You can come up with regular expressions that will look for particular patterns in a line that could be an email address. You can test regular expressions against test data you provide here: http://www.regexr.com/

You cannot use BeautifulSoup for this as most of the email addresses are not within HTML tags that you can identify. So, we’re going to save you some time here and just say don’t try it. Further, the server will just reject your assignment.

No one method or one regular expression will get every email address. As mentioned above, we’ve intentionally put some extremely difficult addresses in the page just to see what you can do!

Your program must implement the following function:

find_emails_in_website(url): This function takes as input a string representation of the URL of a website that you want to search.
We have a page https://cs1110.cs.virginia.edu/emails.html that has a set of example emails you should be able to find (and some that you can look for but we are not requiring). This function should return a list of all of the valid email addresses that you find.

You can create as many other functions as you like, but this is the function that we will call with various different sites to see how well your program works.

For the example page, you should hopefully find:

basic@virginia.edu
link-only@virginia.edu
multi-domain@cs.virginia.edu
Mr.N0body@cand3lwick-burnERS.rentals
a@b.ca
no-at-sign@virginia.edu
no-at-or-dot@virginia.edu
first.last.name@cs.virginia.edu
with-parenthesis@Virginia.EDU
added-words1@virginia.edu
added-words2@virginia.edu
may.end@with-a-period.com

Do not “hardcode” your solutions! In other words, you’re looking for these exact emails, which is not the case. These are examples. To aid in your testing, here’s another page you can look at: https://cs1110.cs.virginia.edu/emails2.html

Here are the emails it should find:

abasicemail@wfu.edu
a-link-only@unc.edu
so-many-domains@ece.berkeley.edu
SomE.CRAZY343@ea.info
w@x.yz
an-at-sign@ncsu.edu
some-other-email@gt.com
so.many.periods@why.do.this
parensarecool@duke.edu
morewords@place.net
extrawords@coolrunnings.ja
period.at@at.the.end

Submission: Submit your file email_finder.py on the project submission page.

NOTE: Make sure to remove all print() statements from your code before submitting. We will not run any tests on any file that still has print()statements in the code!

Explanation / Answer

import urllib.request import codecs def find_emails_in_website(url): stream = urllib.request.urlopen(url) emails = [] recognized = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890.-" for linenum, line in enumerate(stream): decoded = transform(line,linenum) print(decoded) group = ["",""] switch = False for index, char in enumerate(decoded): if char in recognized: if switch is False: group[0] += char else: group[1] += char elif char == "@": switch = True else: if dcheck(group[1]) is True: concat = group[0] + "@" + group[1] if not concat in emails: emails.append(concat) group = ["",""] switch = False if index == len(decoded)-1 and group[1] != "": if dcheck(group[1]) is True: concat = group[0] + "@" + group[1] if not concat in emails: emails.append(concat) return emails # Parses string after @ sign. Must match TLD greater than 1 char, # have only letters, and at least one "." def dcheck(endgroup): if "." not in endgroup: return False tld = 0 for char in endgroup[::-1]: if char.isalpha(): tld += 1 elif char == ".": break else: return False if tld < 2: return False else: return True # Decodes line from stream. Replaces word with symbol in proper order, # strips white space and right ".", special cases for _, reverse, # and name (specific to positions on test pages). def transform(line,index): replacements = [[" at ","(at)",". "," dot "," (dot)","(dot)","NOSPAM","
"], ["@", "@", " ", ".", ".", ".", "", ""]] decoded = line.decode("UTF-8") for i in range(len(replacements[0])): decoded = decoded.replace(replacements[0][i],replacements[1][i]) decoded = decoded.strip().rstrip(".") if index == 31: # Underscore decoded = decoded.replace("_",decoded[62]) if index == 33: # Reverse decoded = decoded[::-1] if index == 35: # First, Last initial first = "" last = "" index = 11 # Beginning index of first name for char in decoded[11:]: if char == " ": break else: first += char index += 1 last = decoded[index+1] decoded = decoded.replace("first name plus my last initial",first+last) if index == 37: # Markdown decoded = decoded.replace("","") startindex = len(decoded)-decoded[::-1].index(">") code = decoded[startindex:len(decoded)] decrypt = markdown(code) delindex = decoded.index("
Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote