|
Types Of Search Engines
Search engines come in two major flavors:
· Web crawlers
· Web portals
These two types of search engines compete for your attention and loyalty, and they are two different philosophies for finding things. Both have their strengths and weaknesses, and you should use the two to complement one another rather than arbitrarily choose one over the other simply because it is easier to use.
Web Crawlers
A web crawler (also known as "indexer") scours the Internet looking for pages to "index". It generally starts with some default web addresses, and downloads them. For each page, it pulls out all of the addresses (links) contained within the page - it will search these later. It then indexes all of the words in that page - stores every word and phrase in that page in a database so that you can later search that database for a phrase that might exist in this page. Some other information is stored about the page - i.e. time when it was last downloaded, time when it was last updated, summary words, the title of the page, etc. Every word in that page is searchable by a user once it is saved to the database - this is how you can search for phrases or keywords in any document on the Internet.
Note that the crawler pulls out all of the links within that page for future reference - this is where the "crawler" concept comes in. Theoretically, a web crawler could start with one page, grab all of the links from that page, search those pages in turn (thereby grabbing more links) and continue until it has searched all pages on the Internet. The problem with this strategy is clear - you can't get to all parts of the Internet from a single point. Some pages just aren't referenced by other pages, and the crawler sometimes isn't terribly smart about which "path" it takes in crawling the Internet.
Also, this is a very intensive task as you can probably imagine. Imagine taking every word on a multi-page document, saving it to a database and linking all the words together in order (i.e. word 1 was "bah", word 2 was "blah", ...) so that one can search for phrases. This is very time consuming and computationally expensive. The crawler may analyze the words to determine if the content is adult, figure out the language (English, Spanish, etc) and try to summarize the content to help prioritize search results.
Web crawlers generally come in two flavors: global and local. Some web crawlers have been configured such that they won't stray too far away from the point at which they started. You may see some commercial sites that have these, i.e. "Search the site" features. These sites are fully indexed as described above, but if a link ever points off of the site (i.e. www.fakeCompanyZZZ.com) then the crawler ignores them. Otherwise, it would start racing around the Internet and fill up that company's search database very quickly! Local search crawlers are generally provided as a service within a site so that users can quickly find information (i.e. a product, answer to a question, etc).
Global web crawlers are just that - they crawl around the Internet trying to find as much information as they can to add to their databases. These web crawlers are much larger - phenomenally large in fact. Whereas your typical PC might have 64 MB of RAM, these machines will have 4000 MB of RAM or more, and their disk space may be on the order of 1,000 GB (yes GB). Massive research and development has gone into making these sites very fast - and they are usually quite fast. Considering the amount of data that is being searched, it is incredible that they generally find search results faster than you could look up something in an encyclopaedia.
Web crawlers are typically automatic, with only a bit of human maintenance. As a result, the information is stored away as a bunch of keywords associated with the document - no human summary or classification. This makes these types of search engines excellent for finding specific information, but not nearly as efficient for common information. If you type "tennis" into a crawler you will find hundreds of thousands of hits. A portal on the other hand may have a nice category that organizes all information associated with tennis.
Portals
A web portal is another general term, but is generally considered to be a site that organizes information by topic. Whereas the web indexers let you define the search criteria and searches all pages, a portal organizes the sites by topic to help you find what you're looking for. The problem with this is that someone else's recommendation as to where a site belongs may not be the same as your recommendation of a site's placement. For example, does "furniture refinishing" fit as a hobby, home repair, or antique topic? The large portals work hard to try to catch all of the intuitive categorizations of a given topic, but this doesn't always work. Portals will also let you search their archives much like a web indexer above, but you only search the summaries and titles of sites in the portal, not their contents.
Portals can be very useful for finding relatively straightforward information - i.e. if you type in "refinishing furniture", you will likely find what you want if you go to a major portal. However, if you're looking for occurences of a phrase, or something relatively complex (i.e. a paper on a certain scientific topic), you may not have much luck.
The goal with a portal is to look for types of information, not for the information itself. This is an important point. Whereas you might want to query a web indexer with "subsidence of tectonic plates in California", you would want to query a web portal with "geophysical research" and manually search through the resulting sites yourself. Some web portals constrain themselves to specific topics, i.e. everything about PC video games. These portals are generally maintained by a couple of administrators, and supported by a large user community. These sites are frequently of the best quality, as the administrators have the time to check links, review their quality and accuracy, and make sure that things are laid out properly. Large portals will frequently get stale, placement of web sites within the portal may be non-intuitive, and frankly the site is just too big to make sure that it is maintained superbly.
Portals are very efficient for finding common information, but they are unable to organize everything so specific information isn't nearly as easy to find. This is one of the first rules to know when deciding between a crawler and a portal in order to find information.
|