Python 2.7 regex code and HTML I am trying to get my code just to print out the
ID: 3841418 • Letter: P
Question
Python 2.7 regex code and HTML
I am trying to get my code just to print out the URLs of the images of watches on a specific page and have located the div container where they are stored. My initial problem was that my .* would not search through the new lines in that container, so I first used [sS]*? in my regex where I wanted it to search everything between the div container title and the image source, and then everything from the image source to the next div container title. Yet when I did this I only got one URL print out, when I knew that there were many more images inside that container.
So I did some research and found out how to use the DOTALL flag. Yet when I use this I get the same result (naturally I guess because in theory the function is the same as the one above). I have asked around on the forums and some have suggested using .*? to make it non-greedy, yet this hasn't helped at all. I am just getting one result still, yet now it is the first image in the container, not the last one. Can someone tell me what I am doing wrong and provide the code to fix this please. The code I am using is below.
from urllib import urlopen
from re import findall
import re
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()
# Print all images between div class="grid" and div class="orderplacebut"
# Because the regex spans over several lines, use DOTALL flag to include
# every character between, including new lines
watch_image_urls = findall('<div class="grid".*<img src="([^"]+)".*<div class="orderplacebut"', dennisov_html, flags=re.DOTALL)
print watch_image_urls
Explanation / Answer
Explanation: The regex used in findall method '<div class="grid".*<img src="([^"]+)".*<div class="orderplacebut" actually matches the last occurence between the divs because thats the best best match between those divs. Thats why the last image url which is "/files/collections/1f218a38b9d8427923ceba21fb93a4953.png" is returned. So according to me its better we get all the lines between two divs(grid and orderplacebut) then extract image urls which is what i did below and it works fine. Below are the details.
On another note if we are very sure that image urls in between required divs of this format <div class="hm"><img src="(.*?)" then we can do this one shot like below without two findall methods.
print re.findall(r'<div class="hm"><img src="(.*?)"', dennisov_html, flags=re.S)
['/files/collections/554047f513e867adb4e8a12582d6d1be3.png', '/files/collections/554047f513e867adb4e8a12582d6d1be5.png', '/files/collections/554047f513e867adb4e8a12582d6d1be8.png', '/files/collections/00d2820b69a4a696f5a4f13052ebbc9b6.png', '/files/collections/2de67103cc07d7da05a09f3ae1220e271.png', '/files/collections/32b5f974b36b97867bab345f82ffeb139.png', '/files/collections/d1956c3a8b8702dd2b89ca86fdaedbfb7.png', '/files/collections/e4a0b9404ce3b2f2494664e0431522fd3.png', '/files/collections/1f218a38b9d8427923ceba21fb93a4953.png']
Code:
from urllib import urlopen
from re import findall
import re
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()
# Print all images between div class="grid" and div class="orderplacebut"
# Because the regex spans over several lines, use DOTALL flag to include
# every character between, including new lines
watch_image_raw = findall(r'<div class="grid">.*<div class="orderplacebut">', dennisov_html, flags=re.DOTALL)
watch_image_urls = findall(r'<div class="hm"><img src="(.*?)"', str(watch_image_raw))
print watch_image_urls
Execution and output:
186590cb0725:Python bonkv$ python checking.py
['/files/collections/554047f513e867adb4e8a12582d6d1be3.png', '/files/collections/554047f513e867adb4e8a12582d6d1be5.png', '/files/collections/554047f513e867adb4e8a12582d6d1be8.png', '/files/collections/00d2820b69a4a696f5a4f13052ebbc9b6.png', '/files/collections/2de67103cc07d7da05a09f3ae1220e271.png', '/files/collections/32b5f974b36b97867bab345f82ffeb139.png', '/files/collections/d1956c3a8b8702dd2b89ca86fdaedbfb7.png', '/files/collections/e4a0b9404ce3b2f2494664e0431522fd3.png', '/files/collections/1f218a38b9d8427923ceba21fb93a4953.png']
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.