When search engines (like Yahoo or Alta Vista, for example) are used to find information on the Internet, the results one receives normally come from giant indexes or databases, instead of from the actual Internet in real time. Because the Internet changes constantly, a search engine's index must be continually updated. Spiders are the tools used for accomplishing this critical task. They work in tandem with indexes and search software to comb the Web for information. Without spiders, it would be difficult to find new Web sites, or current content on existing ones. Also called crawlers, ants, or wanderers, spiders technically are a member of the bot family—software programs that operate unattended, usually on the Internet. Therefore, spiders often are referred to as bots. However, it's important to know that spiders are not the same as intelligent agents—another kind of bot that has a wider range of capabilities, including interactivity.
Spiders travel from server to server, visiting different areas of the Internet—normally sites on the World Wide Web but also File Transfer Protocol (FTP) sites and Gopher archives. This process, known as discovery, can be performed blindly or in a more directed manner. When done blindly, a spider attempts to visit every possible Internet Protocol (IP) address—unique numbers assigned to every machine on the Internet. This approach takes longer than a directed approach, which involves searching registered domain names, or the names used to identify a site (such as Intel.com). Both approaches have advantages and drawbacks. Large search engines often employ many spiders at once, working in parallel on many different machines or servers to archive the online world in one database. After spiders report back to search engines with new information, it often takes additional time before the information is updated in the engine's index and made available for end-users to see in search results.
Some spiders record every word on a Web site for their respective indexes, while others only report certain keywords listed in meta tags. Although they usually aren't visible to someone using a Web browser, meta tags are special codes that provide keywords or Web site descriptions to spiders. Sometimes, the information listed in meta tags is incorrect or misleading, which causes spiders to deliver inaccurate descriptions of Web sites to indexes. In any case, the issue of keywords and how they are placed, either within actual Web site content or in meta tags, is important to online marketers. The majority of consumers reach e-commerce sites through search engines, and the right keywords increase the odds a company's site will be included in search results.
While spiders are critical elements of the online world, they also were a source of aggravation and controversy in the early 2000s. On the technical side, spiders sometimes slow down the performance of Web servers—the computers or applications that host Web sites—by visiting them over and over in a short period of time, sometimes as often as 100 times in a single minute. An example of this type of behavior includes spiders that search for up-to-the-minute news, or product or stock-market information. For companies without strong technical systems, spiders that exhibit this kind of behavior can cause major problems.
Another concern centered around how information collected by spiders was gathered, redistributed to other parties, and ultimately used. Part of this concern, which created related legal issues, involved security, because spiders sometimes uncover information a site's owner considers private or off-limits to the public. Another issue was misrepresentation, especially for distributors who risked having old, incorrect information about product availability or inventory displayed on other Web sites.
Because of these concerns, Web site administrators took measures to deny spiders access to their Web sites, or to certain areas within them. Administrators post specific rules about what spiders are allowed to access on their sites in an exclusion file called robots.txt, which spiders normally find and read. These rules can also be seen by the naked eye if one adds robots.txt to the end of a site's address or uniform resource locator (URL). By looking at an access log, site administrators are able to determine which spiders have visited their sites and what information they recorded. In this situation, spiders can be identified individually by name, which are given to them by their creators. This gives administrators the ability to exclude certain bots from visiting in the future, should they present a problem.
FURTHER READING:
Baljko, Shah. "Web Crawling: Sticky Issue For Distributors." Planet IT, November 1, 2000. Available from www.PlanetIT.com.
Champlin, Leslie. "E-firms Lure Search Spiders to Their Corner of the Web." The Business Journal of Kansas City, March 31, 2000. Available from www.kansascity.bcentral.com.
"Crawler." Tech Encyclopedia, February 1, 2001. Available from www.techweb.com/encyclopedia.
"How Search Engines Work." Search Engine Watch, February 1, 2001. Available from www.searchenginewatch.com.
Pallmann, David. Programming Bots, Spiders, and Intelligent Agents in Microsoft Visual C++. Redmond, Washington: Microsoft Press. 1999.
"Spider." Ecommerce Webopedia, February 1, 2001. Available from e-comm.webopedia.com.
"Spider." Netlingo, January 31, 2001. Available from www.netlingo.com/inframes.
"SpiderSpotting." Search Engine Watch, February 1, 2001. Available from searchenginewatch.internet.com.
"What's a Bot?" Internet.com , February 6, 2001. Available from www.bots.internet.com.
User Comments Add a comment…