How does the Google search engine work?

how

This is the information age . . . knowing how to find information will be a survival skill.

The other night I was watch one of those outdoor survival shows, where the guy simulates being the survivor of a plane crash and he has to survive in the wilderness for a few days until the producers of the show pickup him up. It was fun watch him start a fire with some gasoline from the plane and two wires to the planes batteries.

At some point in the show, I turned to my wife and said, that is sooo 20th century cave man stuff. An information age techie, whouldn’t need to worry about starting a fire . . . his big concern would be how do I find the phone number for the helicopter rescue service. Using his handi GPS enabled, smart phone he could connect to the Internet, go to Google Mobile, search Google Local, and find the closest helicopter rescue team. He’d be dinning on fine linnen and silver in a local resturant within the hour.

Cave man skills are not what we are going to need . . . information acqusition skills will be your Swiss Army knife of the future.

To that end, you better know how it works, here it is in Google’s own words.

How does Google collect and rank results?


One of the most common questions we hear from librarians is “How does Google decide what result goes at the top of the list?” Here, from quality engineer Matt Cutts, is a quick primer on how we crawl and index the web and then rank search results. Matt also suggests exercises school librarians can do to help students.

Crawling and Indexing
A lot of things have to happen before you see a web page containing your Google search results. Our first step is to crawl and index the billions of pages of the World Wide Web. This job is performed by Googlebot, our “spider,” which connects to web servers around the world to fetch documents“spider,” which connects to web servers around the world to fetch documents. The crawling program doesn’t really roam the web; it instead asks a web server to return a specified web page, then scans that web page for hyperlinks, which provide new documents that are fetched the same way. Our spider gives each retrieved page a number so it can refer to the pages it fetched.

Our crawl has produces an enormous set of documents, but these documents aren’t searchable yet. Without an index, if you wanted to find a term like civil war, our servers would have to read the complete text of every document every time you searched.

So the next step is to build an index. To do this, we “invert” the crawl data; instead of having to scan for each word in every document, we juggle our data in order to list every document that contains a certain word. For example, the word “civil” might occur in documents 3, 8, 22, 56, 68, and 92, while the word “war” might occur in documents 2, 8, 15, 22, 68, and 77.

Once we’ve built our index, we’re ready to rank documents and determine how relevant they are. Suppose someone comes to Google and types in civil war. In order to present and score the results, we need to do two things:

  1. Find the set of pages that contain the user’s query somewhere
  2. Rank the matching pages in order of relevance

We’ve developed an interesting trick that speeds up the first step: instead of storing the entire index on one very powerful computer, Google uses hundreds of computers to do the job. Because the task is divided among many machines, the answer can be found much faster. To illustrate, let’s suppose an index for a book was 30 pages long. If one person had to search for several pieces of information in the index, it would take at least several seconds for each search. But what if you gave each page of the index to a different person? Thirty people could search their portions of the index much more quickly than one person could search the entire index alone. Similarly, Google splits its data between many machines to find matching documents faster.

How do we find pages that contain the user’s query? Let’s return to our civil war example. The word “civil” was in documents 3, 8, 22, 56, 68, and 92; the word “war” was in documents 2, 8, 15, 22, 68, and 77. Let’s write the documents across the page and look for those with both words.

civil 3 8 22 56 68 92
war 2 8 15 22 68 77
both words 8 22 68

Arranging the documents this way makes clear that the words “civil” and “war” appear in three documents (8, 22, and 68). The list of documents that contain a word is called a “posting list,” and looking for documents with both words is called “intersecting a posting list.” (A fast way to intersect two posting lists is to walk down both at the same time. If one list skips from 22 to 68, you can skip ahead to document 68 on the other list as well.)

An exercise for students

Once you see how to intersect two words in an index, it’s not hard to do it for three or more words as well. Here’s a fun exercise: try to find all the documents below that contain the words “civil” and “war” and “reconstruction.”

civil: 1 9 15 19 22 35 38 48 53 55 65 68 73 78 82 88 91 99
war: 15 18 25 29 31 35 37 40 42 46 48 65 75 85 91 96
reconstruction: 35 42 48 64 73 91 95

The answer is at the end of the article.

,,

Read the entire article

Further information about this topic is available in the online class:

Related Articles

Technorati Tags: how search engines work, search spider, webcrawler


Got a question, war story or comment about this topic? Share it in the link at the very bottom of this article. Some of my best ideas for future articles come from reading reader comments. I’d love to hear from you!

Related posts: Finding what your passionate about, Breaking through personal barriers, 7 Steps Of Mega Adsense Earners, Tips for Increasing Sales, Converting files, PDF or Word to HTML, 7 Steps to getting great technical support, Full-time freelancing: 10 things learned in 180 days, Organize your email

  • More on finding niche content
  • Adding RSS capability to an existing Web site
  • How to become the “Go To” in your niche
  • Internet Christmas Sales - Can you still get in on the action?
  • Are you confused about implementing Adsense?

  • Save this page to: del.icio.us - Digg it - Yahoo MyWeb

    No Comments »

    No comments yet.

    RSS feed for comments on this post. TrackBack URI

    Leave a comment

    You must be logged in to post a comment.

        
    Companion site for Gary Fugere's online classes for those who earn a living without a job or those who would like to. News and resources for telecommuting, freelancing, time manage-ment, independent contract-ing, financial management, ecommerce, teaching on the Internet and much more.



    Enter email to subscribe to new articles




    Most Read Articles



    Some Sites I Like



    Suggest Content for this Blog with Skribit!




    Sign up for PayPal and start accepting credit card payments instantly.

      Creative Commons License
    Licensed to www.gsinet.org under a Creative Commons Attribution License.