How Google Wors

Essay by review • January 16, 2011 • Essay • 1,275 Words (6 Pages) • 1,404 Views

Essay Preview: How Google Wors

prev next

Page 1 of 6

How Google Works

If you aren't interested in learning how Google creates the index and the database of documents that it accesses when processing a query, skip this description. I adapted the following overview from Chris Sherman and Gary Price's wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001).

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

* Googlebot, a web crawler that finds and fetches web pages.

* The indexer that sorts every word on every page and stores the resulting index of words in a huge database.

* The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

Let's take a closer look at each part.

Googlebot, Google's Web Crawler

Googlebot is Google's web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It's easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn't traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google's indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it's capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated "letter-guessers"; it asks you to enter the letters you see -- something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of "visit soon" URLs must be constantly examined and compared with URLs already in Google's index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it's a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

Google's Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google's index database. This index

...

Download as: txt (7.7 Kb) pdf (104.1 Kb) docx (12.1 Kb)

Continue for 5 more pages »

Read Full Essay Save

Only available on ReviewEssays.com

Similar Essays

Google Analysis

Google, Inc. Overview Google is a global technology leader, focused on improving the ways people connect with information. Through innovations in web search and advertising,

1,640 Words | 7 Pages
Google in a Whole

Google in a whole According to Google lore, company founders Larry Page and Sergey Brin were not terribly fond of each other when they first

656 Words | 3 Pages
A Quick Guide to Google Analytics

Google, the best search engine ever built, has been famous for everything that it provides - free of course. It has even increased the size

564 Words | 3 Pages
Google Swot

Google The name Googol was founded by a gentleman named Milton Sirotta. Mr. Sirotta was the nephew of an American mathematician named Edward Kasner (Google,

1,464 Words | 6 Pages
Google History

Google is a play on the word googol, which was coined by Milton Sirotta, nephew of American mathematician Edward Kasner, and was popularized in the

3,695 Words | 15 Pages
Google Swot Analysis

SWOT Analysis Yahoo! Strengths. * Yahoo!'s Overture is a tremendously profitable Internet advertising business. It focuses on affiliate advertising for large adverting accounts, in the

551 Words | 3 Pages
Leadership Style of Google Ceo; Eric Schmidt

This paper analyzes the leadership style of Google CEO; Eric Schmidt based on the of leadership concepts outlined by David Messick in his essay

2,343 Words | 10 Pages
Google Takes on the World

CASE STUDY: Google Takes on the World 1. Evaluate Google using the competitive forces and value chain models. 1.1 Value Chain Analysis The value chain

1,090 Words | 5 Pages

High Quality Term Papers and Essays
Join 374,000+ Other Members
Get Better Grades