Writing a good sitemap.xml file for google
The Google XML allows you to inform search engines about URLs on your websites that are available for crawling. Simply, a Sitemap that uses the Sitemap Protocol is an XML file that lists URLs for a site. It also allows web developers to include additional information about each URL ( last update; how often it changes; how important it is related to other URLs in the site) so that search engines can more intelligently crawl the site. This way search engines can crawl your website better.
Google now uses Sitemap Protocol 0.9 as dictated by sitemaps.org. Sitemaps created for Google using Sitemap Protocol 0.9 are compatible with other search engines.
A Sitemap must begin with an opening urlset tag and end with a closing urlset tag. and include a url entry for each URL as a parent XML tag, and a loc child entry for each url parent tag. It must be UTF-8 encoded.
In a Sitemap, urlset, url and loc are required, but the changefreq, lastmod, and priority are optional. urlset is the header and url tag is used before each urls, loc is where the files are located and it has to start with http:// and end with a slash "/", lastmod means when was the last time this URL was modified and the date there must be written according to W3C Date-Time format (YYYY-MM-DD) hour, minutes and seconds can be added too, changefreq means how often this URL is changed (always, hourly, daily, weekly, monthly, yearly and never), and priority means whats the priority of the URL related to the other URLs in this location, the value must be between 0 and 1, so we use a value like 0.8, the default is the 0.5, and changing the value to 1 does not effect the web site's position on Google, just the priority of this URL among other URLs in this location folder.
An Example XML for http://urbanoalvarez.es domain
< ?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc> http://urbanoalvarez.es/blog </loc> <lastmod>2008-03-12</lastmod> <changefreq>weekly</changefreq> <priority>1</priority> </url> <url> <loc>http://urbanoalvarez.es/portfolio.php</loc> <changefreq>monthly</changefreq> </url> <url> <loc>http://urbanoalvarez.es</loc> <lastmod>2008-03-12</lastmod> </url> </urlset>
As we see in the example, all we need to do is to use the required tags, which are urlset, url and loc, the others are optional, but of course for a better solution I recommend you to use all of them, make a hierarchy between your website's each URL, insert their priority values according to that and insert their last modified date time and change frequency as told above.
After completing the Sitemap file, I recommend you to compress it with a gzip, and don't forget the Sitemap file cannot contain more than 50.000 URLs and it cannot be larger than 10MB size, so you better compress it. Then it will be a sitemap.gz file. ( Or how you name it, the file extension will be .gz after gzip)
Now where to put the Sitemap file in the site folder ? Don't forget, wherever you insert the sitemap.gz, it can include the URLs in the same folder or in the sub folder. You can't include an URL which is not in the same directory or in an upper directory. With an example, if you insert the sitemap file in the directory, "http://frihost.com/sitemap.gz" we can include all the URLs starting with "http://frihost.com/". But if we insert the Sitemap file in a directory like " http://frihost.com/tools/sitemap.gz", then we can include only URLs which starting with the "http://frihost.com/tools/". So you better insert it in the root, I think it's clear enough.
Now it's time to validate our Sitemap file. Google uses an XML schema to define the elements and attributes that can appear in your Sitemap file. You can download this schema from the link below:
There are some tools to help you validate the structure of your Sitemap. You can find a list of XML-related tools at each of the following locations:
To validate your Sitemap you will need this header in the XML file you have created. It is something like this.
< ?xml version='1.0' encoding='UTF-8'?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> </urlset> <urlset> <url> .... <loc>....</loc> </url> </urlset>
If you are using a Sitemap Generator, it probably does this part for you.
Our Sitemap File is ready, in a shorter time Google and other Search Bots will crawl our site, easily, regularly and more often , and our site will get more traffic. It's what we aimed when we started to make a Sitemap.
If you want to make sure that Google crawls your pages, create an account in Google Webmaster Tools, and submit your sitemaps manually.
Original post by paskall, in frihost.com
Cheers
Yeah!! (Wrings hands)! Nice blog you have here. I’ve enjoyed much reading your last posts. Keep it that way.
Good one. Its very usefull for me.
Thank you for the FYI about the filename being “gz” and not gzip – good call, and very easily missed by any production team.