Search Engine Enhancement

Wednesday, September 12, 2007

Getting timely search engine coverage of a site means people can find things soon after you change or post them.

Linked pages get searched by most search engines following external links or manual URL submissions every few days or so, but they won’t find unlinked pages or broken links, and it is likely that the ranking and efficiency of the search is suboptimal compared to a site that is indexed for easy searching using a sitemap.

There are three basic steps to having a page optimally indexed:

• Generating a Sitemap
• Creating an appropriate robots.txt file
• Informing search engines of the site’s existence

Sitemaps
It seems like the world has settled on sitemaps for making search engine’s lives easier. There is no indication that a sitemap actually improves rank or search rate, but it seems likely that it does, or that it will soon. The format was created by Google, and is supported by Google, Yahoo, Ask, and IBM, at least. The reference is at sitemaps.org.

Google has created a python script to generate a sitemap through a number of methods: walking the HTML path, walking the directory structure, parsing Apache-standard access logs, parsing external files, or direct entry. It seems to me that walking the server-side directory structure is the easiest, most accurate method. The script itself is on sourceforge . The directions are good, but if you’re only using directory structure, the config.xml file can be edited down to something like:

<?xml version="1.0" encoding="UTF-8"?>
<site
base_url="http://www.your-site.com/"
store_into="/www/data-dist/your_site_directory/sitemap.xml.gz"
verbose="1"
>

<url href="http://www.your-site.com/" />
<directory
path="/www/data-dist/your_site_directory"
url="http://www.your-site.com/"
default_file="index.html"
/>

Note that this will index every file on the site, which can be large. If you use your site for media files or file transfer, you might not want to index every part of the site. In which case you can use filters to block the indexing of parts of the site or certain file types. If you only want to index web files you might insert the following:

 <filter  action="pass"  type="wildcard"  pattern="*.htm"           />
<filter  action="pass"  type="wildcard"  pattern="*.html"          />
<filter  action="pass"  type="wildcard"  pattern="*.php"           />
<filter  action="drop"  type="wildcard"  pattern="*"               />

Running the script with

python sitemap_gen.py --config=config.xml

will generate the sitemap.xml.gz file and put it in the right place. If the uncompressed file size is over 10MB, you’ll need to pare down the files listed. This can happen if the filters are more inclusive than what I’ve given, particularly if you have large photo or media directories or something like that and index all the media and thumbnail files.

The sitemap will tend to get out of date. If you want to update it regularly , there are a few options: one is to use a wordpress sitemap generator (if that’s what you’re using and indexing) which does the right thing and generates a sitemap using relevant data available to wordpress and not to the file system (a good thing) and/or add a chron script to regenerate the sitemap regularly, for example

3  3  *  *  *  root python /path_to/sitemap_gen.py --config=/path_to/config.xml

will update the sitemap daily.

robots.txt

The robots.txt file can be used to exclude certain search engines, for example MSN if you don’t like Microsoft for some reason and are willing to sacrifice traffic to make a point, it also points search engines to your sitemap.txt file. There’s kind of a cool tool here that generates a robots.txt file for you but a simple one might look like:

User-agent: MSNBot                             % Agent I don't like for some reason
Disallow: /                                    % path it isn't allowed to traverse
User-agent: *                                  % For everything else
Disallow:                                      % Nothing is disallowed
Disallow: /cgi-bin/                            % Directory nobody can index
Sitemap: http://www.my_site.com/sitemap.xml.gz % Where my sitemap is.

Telling the world

Search engines are supposed to do the work, that’s their job, and they should find your robots.txt file eventually and then read the sitemap and then parse your site without any further assistance. But to expedite the process and possibly enhance search results there are some submission tools at Yahooo, Ask, and particularly Google that generally allow you to add meta information.
Ask.com allows you to submit your sitemap via URL (and that seems to be all they do)

Yahoo
Yahoo has some site submission tools and supports site authentication, which means putting a random string in a file they can find to prove you have write-access to the server. Their tools are at
https://siteexplorer.search.yahoo.com/mysites

with submissions at
https://siteexplorer.search.yahoo.com/submit.php

you can submit sites and feeds. I usually use the file authentication which means creating a file with some random string (y_key_random_string.html) with another random string as the only contents. They authenticate within 24 hours.
It isn’t clear that if you have a feed and submit it that it does not also add a site, it looks like it does. If you don’t have a feed you may not need to authenticate the site for submission.
Google has a lot of webmaster tools at

The verification process is similar but you don’t have to put data inside the verification file so

touch googlerandomstring.html

is all you need to get the verification file up. You submit the URL to the sitemap directly.
Google also offers blog tools at