| Home | Free Articles for Your Site | Submit an Article | Advertise | Link to Us | Search | Contact Us |
This site is an archive of old articles

    SEARCH ARTICLES
    Custom Search


vertical line

Article Surfing Archive



How Web Crawlers Work - Articles Surfing


A web crawler (also known as a web spider or web robot) is a program or automated script which browses the internet seeking for web pages to process.

Many applications mostly search engines, crawl websites everyday in order to find up-to-date data.

Most of the web crawlers save a copy of the visited page so they could easily index it later and the rest crawl the pages for page search purposes only such as searching for emails ( for SPAM ).

How does it work?

A crawler needs a starting point which would be a web address, a URL.

In order to browse the internet we use the HTTP network protocol which allows us to talk to web servers and download or upload data from and to it.

The crawler browses this URL and then seeks for hyperlinks (A tag in the HTML language).

Then the crawler browses those links and moves on the same way.

Up to here it was the basic idea. Now, how we move on it completely depends on the purpose of the software itself.

If we only want to grab emails then we would search the text on each web page (including hyperlinks) and look for email addresses. This is the easiest type of software to develop.

Search engines are much more difficult to develop.

When building a search engine we need to take care of a few other things.

1. Size - Some web sites are very large and contain many directories and files. It may consume a lot of time harvesting all of the data.

2. Change Frequency * A web site may change very often even a few times a day. Pages can be deleted and added each day. We need to decide when to revisit each site and each page per site.

3. How do we process the HTML output? If we build a search engine we would want to understand the text rather than just treat it as plain text. We must tell the difference between a caption and a simple sentence. We must look for bold or italic text, font colors, font size, paragraphs and tables. This means we must know HTML very good and we need to parse it first. What we need for this task is a tool called "HTML TO XML Converters". One can be found on my website. You can find it in the resource box or just go look for it in the Noviway website: www.Noviway.com.

That's it for now. I hope you learned something.

Submitted by:

Eran Aharonovich

Eran Aharonovich
Software Programmer
Home Page: http://www.Noviway.com
Web Crawler Page: http://www.noviway.com/Code/Web-Crawler.aspx
HTML To XML Converter Page: http://www.noviway.com/Code/HTML-To-XML.aspx



        RELATED SITES






https://articlesurfing.org/computers_and_internet/how_web_crawlers_work.html

Copyright © 1995 - Photius Coutsoukis (All Rights Reserved).










ARTICLE CATEGORIES

Aging
Arts and Crafts
Auto and Trucks
Automotive
Business
Business and Finance
Cancer Survival
Career
Classifieds
Computers and Internet
Computers and Technology
Cooking
Culture
Education
Education #2
Entertainment
Etiquette
Family
Finances
Food and Drink
Food and Drink B
Gadgets and Gizmos
Gardening
Health
Hobbies
Home Improvement
Home Management
Humor
Internet
Jobs
Kids and Teens
Learning Languages
Leadership
Legal
Legal B
Marketing
Marketing B
Medical Business
Medicines and Remedies
Music and Movies
Online Business
Opinions
Parenting
Parenting B
Pets
Pets and Animals
Poetry
Politics
Politics and Government
Real Estate
Recreation
Recreation and Sports
Science
Self Help
Self Improvement
Short Stories
Site Promotion
Society
Sports
Travel and Leisure
Travel Part B
Web Development
Wellness, Fitness and Diet
World Affairs
Writing
Writing B