spacer
spacer search

Search Engines SEO

Search
spacer
header
 
Home

Home
Meta Search Engines

Meta Search Engines

 

 

In a meta-search engine, you submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages. Within a few seconds, you get back results from all the search engines queried. Meta-search engines do not own a database of Web pages; they send your search terms to the databases maintained by search engine companies.

 

The idea of meta-searching is much better than the reality in most cases. You would think you would save a lot of time by searching only in one place and sparing the need to use and learn several separate search engines. It depends a lot on what they search and how they organize the results. They cannot be better than the databases they query.

 

This page used to include a number of meta-searchers that do no more than what the preceding paragraph describes. Advances in technology have caused two types of meta-search engines to rise far above the others, and I no longer list anything but a selection of these "smarter" search engines.

 

There are two families of smarter meta-search engines at this time:

 

·         Meta-searchers that search good databases, accept complex searches, integrate results well, eliminate duplicates, and offer additional features such as clustering by subjects within your search results.

 

·         Tools for serious digging in many resources, with powerful abilities to help you find what you seek within search results. These are appropriate for very serious researchers to use for in depth probing of a topic.

 

So there is all of this hype associated with metasearch engines - sites that will take your keywords, send them to a large number of search engines at once and return the results to you. What's the big deal? Well, they are supposed to make it more convenient - the justification is "why search for something at several sites one after the other, when a computer can combine the results for you?". Personally, I can think of a couple reasons why I'd rather do it myself...

 

First off, a metasearch engine can only take inputs from you that are supported by all search engines that it uses. So, the lowest-common-demoninator of those sites' features will determine what you can enter. Also, you are now trusting the meta-search engine to interact with those search engines properly.

 

Secondly, what is the point of convenience? The real convenience is finding the best result quickly, not getting the largest number of bad results. You, as a human being, will be much more intelligent in how you search for something. Don't trust a meta-search engine which gives you the results of several sites in the order that they return results. You should go to the most likely search engine for your topic. If that doesn't show anything, you would then go to the next most likely. It's the thought process that makes searching the Internet more powerful. All a meta-search engine does is return more noise to you with less typing.

 
How Search Engines Work

How Search Engines Work

 

 

 

Before we get into some of the strengths and weaknesses of search engines, it is important to know a bit about search engine basics. There are a few concepts that are worth revisiting as they will help you to understand some of the how's and why's of the search engines later on. Search Engine Objectives

 

What exactly is a search engine trying to achieve? That's a good question, and the answers are not necessarily obvious. Consider the following as an incomplete list of goals:

 

·         Make money

·         Maintain user loyalty

·         Let users access information quickly

·         Spin off other services (shopping, advertisements, etc)

·         Let users access information easily

·         Stay up to date

·         Cover as much of the Internet or topic as possible

·         Provide value-added search features (more intelligent or customizable searches)

 

Some search engines achieve these goals more than others. Unfortunately, this list is probably ranked by priority, meaning that things like search GUI improvements aren't high on their list. Coverage of the Internet is always a problem as well. As I mentioned earlier, these machines are already massive in memory, disk space, and CPU performance. However, few of the search engines cover more than 10% of the Internet and none cover more than 20% of the Internet pages. If you find a topical search engine (i.e. Canadian recipes), you can bet that it has a much better chance of achieving its goal to cover the topic fully.

 

The fact that search engines can't achieve all of these objectives doesn't mean that they aren't useful. Of course they have to make money - who do you think pays for the expensive machines and high-bandwidth Internet connection? The trick is to know their weaknesses, know how they all fit together to complement one another, and then get comfortable at moving between the search engines as necessary.

 

Web Crawling

 

As I discussed briefly in a previous section of the book, web crawling is the technique of navigating around a part, or all of the web. The term "crawling" is a good one - from any point on a spider's web, you can trace paths that cover the entire web eventually. The trick is to pick a smart path so that you're not duplicating or retracing your steps too frequently.

 

Search engines have the same problem - starting from a single point, they want to get around as much of the Internet as possible. However, unlike a nicely structured spider's web, the Internet has all sorts of broken links, missing links, gaps and poor connections. Hence, navigating the spider's web is a lot easier than the web we call the Internet.

With web crawling, the general process that a crawler takes are as follows:

 

·         Check for the next page to download - the system keeps track of pages to download in a "queue"

·         Check to see if the page is "allowed" to be downloaded - this is done by checking a "robots exclusion" file and also reading the header of the page to see if any exclusion instructions were provided. Some people don't want their pages archived by search engines.

·         Download the whole page

·         Extract all links from the page (additional web site and page addresses) and add those to the queue mentioned above to be downloaded later

·         Extract all words, save them to a database associated with this page, and save the order of the words so that people can search for phrases, not just keywords

·         Optionally filter for things like adult content, language type for the page, etc.

·         Save the summary of the page and update the "last processed" date for the page so that the system knows when it should re-check the page at a later date.

 

These general steps describe how web crawlers work, although they are typically much more complex. The process of downloading the pages and indexing the contents is very difficult. For instance, if the system downloaded one page at a time, covering the Internet would take several years, as there are typically "bottlenecks" or slow-downs between the web crawler and the various sites it wants to analyze. So, the system generally downloads thousands of pages at the same time, and processes them in parallel.

 

Also, when indexing the pages and saving the contents for future searches, this must be done very quickly, and the information must be saved such that future searches don't take huge amounts of time. How this is done is beyond the scope of this text although it is frequently the topic of discussion in many computer science graduate research centers.

 

Web crawling strives to cover as much of the Internet as possible, but some paths through the Internet are not complete and will frequently lead to dead ends. Let's say that you create a web page and it is not referenced by one other page on the Internet. Unless you submit your page to search engines to be indexed, no one will even know it exists! There are quite a few pages like this that may contain very useful information but are not indexed or searchable because they are not referenced by many (if any) pages. Even if several of you and your friends put up web pages and reference each other, if none of you are being searched by a crawler, your pages will never be found by that crawler.

 

Web crawling is also a very time-consuming task - some search engines brag that their crawlers completely recheck their searched pages at least once a month! This isn't very useful if you're expecting to find current information via that crawler. The problem is one of sheer volume - search engines have to go through billions of pages of information and this takes a huge amount of time. Unfortunately, there aren't good ways for search engines to know how frequently information is updated. So, they will place equal importance in indexing and refreshing Shakespeare's online works versus your city's local events listing.

 

Engine Maintenance

 

 

Search engine maintenance is a broad term - these search engines run on very powerful and critical machines. Very rarely do these machines go down for maintenance - these companies can't afford to lose their user following, nor can they afford to lose the revenue associated with shutting down their search engines. Consequently, the design of these large search systems allows them to take down parts of the system without interrupting the users. For instance, Yahoo isn't a single machine - it's a cluster of many many machines. You don't realize this, as you just type in www.yahoo.com and assume that you're connecting to one machine. In reality you are sent to one of many machines - if that machine crashes, when you refresh your page you are automatically sent to another without being aware of the problems. This way, they can take down machines for maintenance without bringing down the service.

 

Web crawling engines are typically self-maintained, and humans only maintain the system to remove bugs, enhance features, update the look-and-feel of the page, etc. These engines are pretty much self-maintaining because they go out and find the pages themselves, process them, and update them periodically without human instruction. Occassionally humans may get involved in the process if they have been requested to remove inappropriate or illegal information - rarely does this happen however.

 

Portals require quite a bit more maintenance - portals are typically maintained by humans, perhaps with assistance from the online community. The reason that portals are typically more accurate (not necessarily complete) is that sites are added by humans into one or more appropriate categories. If the portal is well-maintained, new sites are added quickly, and old and outdated sites are frequently outdated. As the portal becomes larger, this task becomes more and more daunting and runs into some of the problems that the web crawler encounters.

 

Because search engines are very complex and have a huge number of users accessing them at any one time, it is clear why these companies are often hesitant to update or upgrade the software significantly. Search engines don't change a lot in their basic functionality - occassionally you get new colors, logos, or services tacked on, but the basic searching functionality remains the same. This is unfortunate, as we know much more about what makes a good search engine. However, updating the software and adding new features without breaking your existing system or slowing it down - this is a huge risk for the companies. They know that if they leave users unhappy for more than a few days that those users will typically move to a new search engine and be reluctant to return. Perhaps as people learn to work with multiple search engines rather than become blindly loyal to one or two, the companies will realize that they have to attract users with features, not flashiness.

 

Performance

 

Search engines have a few performance criteria that are critical to their success:

·         Speed in returning results based on a user's query

·         Speed in loading the page (modem users won't want long loads)

·         Compliance with a large number of browsers so that the engine isn't excluding any significant percentage of the users out there on the Net

·         Availability, i.e. how frequently the web site crashes or goes down for maintenance

·         Completeness of the results that are returned to the user - i.e. does the search engine know about many of the sites relevant to the topic for which the user is searching for information?

·         Fiscal performance, i.e. does the search engine actually make any money?

 

Some of these performance criteria are of concern to the user, others are not. It is a fact that the first five criteria will definitely impact the sixth. As a user, you will certainly want to be able to load the page quickly. This means that the site had better be on a fast network, and the company had better have designed the page for fast downloads (i.e. no massive images or bloated code). You will also want to be able to go to that site at any time - if the site isn't available to you, chances are you will go to another search engine that you decide you like every bit as much. You will want the search engine to be relatively current, i.e. not pointing you to a bunch of pages that no longer exist. Furthermore, you will want more than a few results if you know that there are a large number of pages out there that cover the topic for which you need information.

 

From the company's perspective, they have to turn a profit. How is this done? Well, they want to attract as many users as possible. Then, they have to figure out how to turn all of these loyal users into revenue. Generally this is done with advertising. I'm skeptical that people will ever pay money for standard searches - perhaps for credit reports, academic literature searches, etc, but not for general Internet information. Maybe this is not a good thing - perhaps more money needs to be funded into information searches to make it easy for people to find information. This could come from government, private corporations or individuals - but it is clear that as information grows exponentially, the search companies need to do more research into strategies for organizing and navigating this wealth of information.

 

At present, it is unclear how *.com sites are going to make money. Advertising can only get you so much money, and it is not clear how much on-line advertising impacts the average user. As a result, some of the *.com companies that have gone public on the stock exchanges are tremendously unprofitable yet their share prices are quite high. What money these companies do make goes towards salaries and advertising - very little of it seems to go into research and development. Hopefully this trend will change.

 

 
<< Start < Prev 1 2 3 4 5 6 7 8 9 10 Next > End >>

Results 7 - 12 of 501
spacer
Latest News

 
spacer
Thollo: Build a Free Website - Start a Free Blog