by [TC]²

May 2005


In Search of…

 By Walt McKinney, [TC]²

I was just sitting at my desk wondering what year Christopher Columbus was born (not really, but let's pretend, shall we?) Immediately, I launched my Web browser and Googled the phrase ‘Christopher Columbus birth date'.

Google served up the answer lightning fast, considering that hundreds of millions of web pages stand at the ready; poised to deliver content covering just about any conceivable topic known to man. In the old days, I would have had to find an encyclopedia, thumb through a multitude of pages, and possibly inflict a paper cut upon myself in order to achieve what a Google search did for me in two seconds flat.

Google is my personal search tool of choice, but there are certainly alternatives. In this article, I'd like to share a few basic things about search engines. I'm not going into the history of the search engine, but rather focus on its current state and how to maximize its usage.

What is it?

So what exactly IS a search engine, you ask? Basically, it's a program that searches documents for specified keywords and returns a list of the documents where the keywords were found.

Typically, a search engine works by sending out a spider (or WebCrawler) to collect as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document. When a user submits a query, only relevant results should be returned. Search engines each use proprietary algorithms in the creation of its indexes.

A search directory is somewhat different than a search engine. A directory is a collection of categorized information about Web pages, rather than containing information from Web pages. Smaller search directories may have automated means by which Webmasters can submit URL's under specified categories; or with some of the larger directories, staff members may evaluate submitted sites and place the information in the proper categories. Search directories do not use spiders or Web crawlers.

The most significant search directories are owned by Yahoo! (dir.yahoo.com) and the Open Directory Project (www.dmoz.org). Google actually has its own directory (directory.google.com), but it doesn't create the directory itself; it gets it from the Directory Open Project.

How Does it Work?

So just how does a search engine work? The aforementioned spider (or WebCrawler) will scan a Web page's content and create key search words that enable online users to find pages they're looking for. Today's top search engines will index hundreds of millions of pages, and respond to tens of millions of queries per day. Astonishing!

Indexed Web pages are scanned for words occurring in the title, subtitles, meta tags and other positions of relative importance. The Google spider, for example, was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.

Suppose you have a Web page and you don't want it to be indexed by a spider? That's where the robot exclusion protocol comes in. Placing a properly configured robots.txt file on your Web server (in the correct location) will effectively ‘tell' visiting spiders to stay away from specified directories. For example, if you wanted to monitor Web traffic to a particular folder on your site, and didn't want spider visits showing up in your Web stat reports, the robots.txt file could be used. You may have a special short-term campaign running which points users to a Web page that is temporary, so being indexed by search engines is not a special concern. In this case, statistics regarding visits to the specified Web page could be monitored and not be skewed by spiders.

Of course, if you want to be found, the robot exclusion protocol is not for you!

For the most part, being found on the Web is what Webmasters want, and tweaking Web pages for maximum search engine friendliness can involve many things (but that's another topic!)

Neat Tricks

Web pages are constantly changing and spiders must continually crawl sites and collect updated information for indexing. Curious as to when a specific site was last indexed? Go to Google and type cache:www.microsoft.com (replace ‘Microsoft' with any Web site of your choosing).

Hit the return key, or click ‘Google Search'. The page displayed is from the Google cache. Details in the upper portion of the page describe when the snapshot of that particular page was taken. There is also a hyperlink here that will take you to the current live Web page.

If a particular site doesn't show up using the ‘cache' command, then the pages have not been found or indexed. That could be a red flag for poor keyword placement and ‘un-friendly' search engine page design. (Or, it could mean that the robot exclusion protocol is working!)

Suppose you want to pull up a list of all pages indexed within a particular site? You can type the following: site:www.microsoft.com (Again, replace ‘Microsoft' with any Web site of your choosing).

If you get bored checking the Google cache or searching for stuff, you can always go Googlewhacking ! (Click here for definition)

It's All in How You Ask

Finding the tiniest needle in the vast cyber haystack can be made a little easier through properly built queries. The query can be as simple as a single word. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of your search.

The Boolean operators most often seen are:

•  AND – All terms in the query joined by “AND” must appear on the pages returned

•  OR – At least one term in the query joined by “OR” must appear on the pages returned

•  NOT – The term (or terms) in the query following “NOT” must not appear on the pages returned. Some search engines use the operator "-" for the word NOT.

•  Quotation Marks – All words included between quotation marks are treated as a phrase, and that phrase must be found within the returned pages.

As Web content continues to grow by leaps and bounds, search engine query building will be an important factor in finding precisely what you are looking for.


Want more information about search engines? Here are a few good online resources:

http://searchenginewatch.com/

http://www.marketposition.com/

http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html

 

Oh and by the way, Christopher Columbus was born in 1451!


Library Index | Home

We Value Your Opinion! Please Rate This Article.
How helpful was this article?


Name (optional)

Comments / Suggestions
E-Mail (optional)