Monday, October 13, 2003

The Site was crawled by Google

The Everett Family Blog was visited today by the "crawler" from Google.

What does that mean?

Google is one of the most widely used search engine websites on the Internet. To allow people to search for sites, Google needs to effectively visit every website at least once to find out what is on the site. That work is done by a "crawler", as in a webcrawler.

A crawler is a program that simply goes and visits a website, making sure to visit each and every link on the site, recording information about each page.

How do you know a crawler has visited the site?

The crawler on visiting a website will first try to access a file called "robots.txt". If programmers were consistent, I would have expected the crawler to look for a file that was in the motif of spiders (like the world wide web, crawler, etc)... but nope, Instead they went with to a different idea for this particular thing.

robots.txt is a file that tells the crawler what they can and can not crawl on a site. Allows a website to be able to pick and chose what they want searchable vs what they want to keep private. If robots.txt doesn't exists, then the site is telling the crawler to feel free to visit every part of the site. Everett Family Blog does not have a robot.txt file.

OK, I kinda understand that, but how do you know a crawler has visited?

That is simple. A log is kept every time somebody visits the site. I have another piece of software that does log analysis. In the log, information is stored including files that were requested that did not exist. Tonight for the first time, the file "robots.txt" was requested and it was displayed in the log analysis tool.

The last piece of the puzzle is that I can tell what domain each request is from. A domain being for example: aol.com, comcast.com covad.net, etc. Today a request was made from the domain googlebot.com.

So what did we learn today. Programmers have a strong tendency to mix metaphors... In this case the mixture was between the "spider" metaphor and "robot" metaphor. Makes all of this seem even more difficult than it actually is to people not as initimately familiar with it.

Enough computer crap for today...

1 comment:

Steve Everett said...

I noticed some activity on www.zone1000.com yesterday which is unusual since I haven't updated it in a while. The page hits were shortly after 7:00 pm eastern time. What time did Google visit your site? I'm thinking that it followed the links to my site, although I don't see the request for robots.txt. I did a reverse DNS lookup on the IP address, but couldn't find the domain.