Extracting Links – Intro to Computer Science


So now you know enough about Python to be able to solve the problem that we started with at the beginning of this unit, which the problem of extracting a link from its page. Before we get to the code, I want to describe a little more carefully what’s going on in a webpage. So we’ve talked about strings in Python and all a web page really is, is a long string. When you see a web page in your browser, it doesn’t look like that. So here’s an example web page, one of my favorite XKCD comics. And hopefully, you’re starting to learn enough about Python to appreciate the power of Python to make you fly. Probably the rest of the comic, if you haven’t done anything other than using Python, is a little hard to relate for now. But it’s making fun of other languages where there’s an awful lot of work to do something simple, like we’ve seen here, just being able to print out a string. But with Python, we can fly quickly, and you’re going to learn to fly very quickly in this class. This doesn’t lok like just a string. We’ve seen just a string is a sequence of characters. When we look at a webpage like this, well we see images. We see buttons. We see some text. We see things that are links and you can see the underlines these are all links. And the browser renders the webpage in a way that looks attractive. What actually was there though, started just as a stream of text. If you right-click on the webpage, one of the options you see is View Page Source. When you click on that, you’ll see the actual source code. This is what came into the browser. So, your browser sent a request, the URL is what’s shown in the address bar. So, it’s sent a request to xkcd.com/355. It sent that request and this is what came back. What came back is just a stream of text. We can look at that text and some of it is fairly hard to understand. So what’s important is the links. Here’s an example of a link. So, the link starts with a tag like this. The language HTML uses these angle brackets. And the angle bracket a href equals is how we start a link. That’s followed by a string which is surrounded by double quotes, similarly to a string in Python. So, we have a double quote. Between the double quotes is a URL. The URL is the way of locating content on the web so here we have the URL http colon, that means it’s a web request. We’ll talk more in a later class about what http means and the protocols used to request web pages. What’s important now is, that’s a location If we open that in a web browser, that will give us another page. What I’m looking at here is the link that is underneath the text for News/Blag. If we click on that link, that will take us to the page blag.skcd.com. That was the page that we saw in the link here it said vlad.skcd.com. When we click on the link, that’s where we went. So to build our crawler, what we want to do for each webpage, we want to find these links in the page. We’re going to keep track of those links and we’re going to follow them to find more content on the web. This is similar to what someone would do if their browsing. If they’re clicking on every link of a page, following all the links they find, looking at all that content. That’s a really good way to waste a horrendous amount of time if you do that yourself. We’re going to build a web crawler that can do that automatically. So our goal is to take the text that came back from a web request, find a link in that text, which is going to be a tag that starts with a href equals and then extract from that tag the URL of the webpage that it links to. Those are the URLs that we’re going to use in our crawler to make progress. So by using what we’ve learned about strings, and what you’ve learned about variables, you know enough to be able to do that. What we want to do is find the beginning of a tag. And what the beginning of a tag is this text right we’re looking for something that matches exactly the a href equals part. That’s what the tags were here they all start with a href equals. Not all webpages have the same structure. There are lots of other ways to make a tag. The A could be a capital letter for example. There could be more spaces between the a and the href. The double quote doesn’t actually need to be there. For what we do now, we’re going to assume that all our webpages follow the same structure that we’re seeing here. That each link starts with an a h ref without any funny spaces or anything else. Has an equal, has a double quote, has the URL following that, and then another double quote. So that means we’re looking for strings like this, we’re looking to find the a href; that’s followed by a double quote. After the double quote is the URL. This is what we actually care about; we want to find the URLs on the Web page. That’s followed by a closing double quote and then, there’s more that closes the tag. And there’s lots of other stuff on both sides of this. But this is what we want to do. We want to find the tags that are links and then, within the tags that are links, we want to find the URLs. So we’re going to assume that we start with the page contents in a variable. We’ll call that page, and we’re not going to worry today about how we got those page contents. We’re going to provide a function that does that. For the code that you have today, let’s going to assume the page is already initialized. That it contains the content of some web page stored as a string and our goal is to find the URL of the first link in the page. That’s going to involve a couple steps. SO what we want to do is find the start of the link. We want to find where we have the a href equals. We can’t just look for the first string we find, because there’s lots of other strings on the page that aren’t URLs. So I think you know enough to do that, so we’ll make it a quiz. So your goal for this quiz is to write some Python code that will initialize the variable start link to be the value of the position where the first a href equals. So the first tag that starts a link occurs in page, so you should assume that page starts with the content of some web page, and what we’re doing is looking for the place where the first a href equals occurs, and that’s the first link on the page.

2 Comments

Add a Comment

Your email address will not be published. Required fields are marked *