July 6, 2020 at 7:29 pm #892Abhishek TyagiKeymaster
Following Links in Python
In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment
Sample problem: Start at http://py4e-data.dr-chuck.net/known_by_Fikret.html
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Annick.html
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: M
# Enter URL: http://py4e-data.dr-chuck.net/known_by_Annick.html # Enter count: 7 # Enter position: 18 # Retrieving: http://py4e-data.dr-chuck.net/known_by_Annick.html # Retrieving: http://py4e-data.dr-chuck.net/known_by_Nicki.html # Retrieving: http://py4e-data.dr-chuck.net/known_by_Peebles.html # Retrieving: http://py4e-data.dr-chuck.net/known_by_Chantelle.html # Retrieving: http://py4e-data.dr-chuck.net/known_by_Kamila.html # Retrieving: http://py4e-data.dr-chuck.net/known_by_Domenico.html # Retrieving: http://py4e-data.dr-chuck.net/known_by_Nassir.html # Last Url: http://py4e-data.dr-chuck.net/known_by_Mhea.html ###########Final Answer is: #Name: Mhea import urllib.request as ur from bs4 import * current_repeat_count = 0 url = input('Enter URL: ') repeat_count = int(input('Enter count: ')) position = int(input('Enter position: ')) def parse_html(url): html = ur.urlopen(url).read() soup = BeautifulSoup(html, 'html.parser') tags = soup('a') return tags while current_repeat_count < repeat_count: print('Retrieving: ', url) tags = parse_html(url) for index, item in enumerate(tags): if index == position - 1: url = item.get('href', None) name = item.contents break else: continue current_repeat_count += 1 print('Last Url: ', url)
July 5, 2021 at 5:05 am #1535Pavel NgaybeParticipant
Thanks, this was helpful. The assignment is a true test of all what I have learnt in this course this far. If I may ask, what prompted you to use the line “def parse_html(url)”? As a beginner, it is kind of challenging for me to figure that out. Is there any other approach(code) to solve this problem?
- You must be logged in to reply to this topic.