Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Python 2.7 regex code and HTML I am trying to get my code just to print out the

ID: 3841418 • Letter: P

Question

Python 2.7 regex code and HTML

I am trying to get my code just to print out the URLs of the images of watches on a specific page and have located the div container where they are stored. My initial problem was that my .* would not search through the new lines in that container, so I first used [sS]*? in my regex where I wanted it to search everything between the div container title and the image source, and then everything from the image source to the next div container title. Yet when I did this I only got one URL print out, when I knew that there were many more images inside that container.

So I did some research and found out how to use the DOTALL flag. Yet when I use this I get the same result (naturally I guess because in theory the function is the same as the one above). I have asked around on the forums and some have suggested using .*? to make it non-greedy, yet this hasn't helped at all. I am just getting one result still, yet now it is the first image in the container, not the last one. Can someone tell me what I am doing wrong and provide the code to fix this please. The code I am using is below.

from urllib import urlopen

from re import findall

import re

dennisov_url = 'https://denissov.ru/en/'

dennisov_html = urlopen(dennisov_url).read()

# Print all images between div class="grid" and div class="orderplacebut"

# Because the regex spans over several lines, use DOTALL flag to include

# every character between, including new lines

watch_image_urls = findall('<div class="grid".*<img src="([^"]+)".*<div class="orderplacebut"', dennisov_html, flags=re.DOTALL)

print watch_image_urls

Explanation / Answer

Explanation: The regex used in findall method '<div class="grid".*<img src="([^"]+)".*<div class="orderplacebut" actually matches the last occurence between the divs because thats the best best match between those divs. Thats why the last image url which is "/files/collections/1f218a38b9d8427923ceba21fb93a4953.png" is returned. So according to me its better we get all the lines between two divs(grid and orderplacebut) then extract image urls which is what i did below and it works fine. Below are the details.

On another note if we are very sure that image urls in between required divs of this format <div class="hm"><img src="(.*?)" then we can do this one shot like below without two findall methods.

print re.findall(r'<div class="hm"><img src="(.*?)"', dennisov_html, flags=re.S)
['/files/collections/554047f513e867adb4e8a12582d6d1be3.png', '/files/collections/554047f513e867adb4e8a12582d6d1be5.png', '/files/collections/554047f513e867adb4e8a12582d6d1be8.png', '/files/collections/00d2820b69a4a696f5a4f13052ebbc9b6.png', '/files/collections/2de67103cc07d7da05a09f3ae1220e271.png', '/files/collections/32b5f974b36b97867bab345f82ffeb139.png', '/files/collections/d1956c3a8b8702dd2b89ca86fdaedbfb7.png', '/files/collections/e4a0b9404ce3b2f2494664e0431522fd3.png', '/files/collections/1f218a38b9d8427923ceba21fb93a4953.png']

Code:

from urllib import urlopen

from re import findall

import re

dennisov_url = 'https://denissov.ru/en/'

dennisov_html = urlopen(dennisov_url).read()

# Print all images between div class="grid" and div class="orderplacebut"

# Because the regex spans over several lines, use DOTALL flag to include

# every character between, including new lines

watch_image_raw = findall(r'<div class="grid">.*<div class="orderplacebut">', dennisov_html, flags=re.DOTALL)

watch_image_urls = findall(r'<div class="hm"><img src="(.*?)"', str(watch_image_raw))

print watch_image_urls

Execution and output:

186590cb0725:Python bonkv$ python checking.py

['/files/collections/554047f513e867adb4e8a12582d6d1be3.png', '/files/collections/554047f513e867adb4e8a12582d6d1be5.png', '/files/collections/554047f513e867adb4e8a12582d6d1be8.png', '/files/collections/00d2820b69a4a696f5a4f13052ebbc9b6.png', '/files/collections/2de67103cc07d7da05a09f3ae1220e271.png', '/files/collections/32b5f974b36b97867bab345f82ffeb139.png', '/files/collections/d1956c3a8b8702dd2b89ca86fdaedbfb7.png', '/files/collections/e4a0b9404ce3b2f2494664e0431522fd3.png', '/files/collections/1f218a38b9d8427923ceba21fb93a4953.png']