Generate Sitemap for a Website using Python

2023

In this project, we will explore how to generate a sitemap for a website using Python. We'll use the requests library to fetch web pages, the BeautifulSoup library to parse HTML, and the datetime library to add date information to the sitemap. We'll explain each step of the code and provide the final working code at the end.


Generating a Sitemap

To begin, we need to import the necessary libraries:

import requests
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urljoin, urlparse

We import requests for making HTTP requests, BeautifulSoup for parsing HTML content, datetime for date-related operations, and urljoin and urlparse from urllib.parse for URL manipulation.

Next, we define a function generate_sitemap(url) to generate the sitemap:

def generate_sitemap(url):
    root = url.rstrip("/")
    sitemap = set()
    visited = set()
    original_domain = urlparse(root).netloc

We initialize variables including the root URL, a set for the sitemap URLs, a set for visited URLs, and the original domain name.

We define nested functions add_url(root, path) to add URLs to the sitemap and extract_urls(soup, root) to extract URLs from a BeautifulSoup object.

We maintain a stack to process URLs:

    stack = [url]

    while stack:
        current_url = stack.pop()
        if current_url in visited:
            continue

        visited.add(current_url)

        if not current_url.startswith(root):
            continue

We iterate through the URLs in the stack, handling URL normalization and checking for visited URLs.

We use requests.get() to fetch web pages:

        try:
            response = requests.get(current_url)
        except requests.exceptions.RequestException as e:
            print(f"Error occurred while accessing {current_url}: {e}")
            continue

We handle potential exceptions while making requests.

For valid responses (status code 200), we parse HTML and extract URLs:

        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            extract_urls(soup, root)

            for link in soup.find_all("a", href=True):
                next_url = link["href"]
                if not next_url.startswith("http"):
                    next_url = urljoin(root, next_url)
                stack.append(next_url)

We use BeautifulSoup to parse the HTML content. We iterate through <a> tags to extract URLs, normalize them using urljoin(), and add them to the stack.

We generate the XML content for the sitemap:

    xml_content = '<?xml version="1.0" encoding="UTF-8"?>\n'
    xml_content += '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'

    for url in sitemap:
        xml_content += f'  <url>\n'
        xml_content += f'    <loc>{url}</loc>\n'
        xml_content += f'    <lastmod>{datetime.now().date().isoformat()}</lastmod>\n'
        xml_content += f'    <changefreq>weekly</changefreq>\n'
        xml_content += f'    <priority>0.8</priority>\n'
        xml_content += f'  </url>\n'

    xml_content += '</urlset>'

We create the XML content by iterating through URLs in the sitemap and adding various metadata such as location, last modification date, change frequency, and priority.

Finally, we write the sitemap to a file:

    with open("sitemap.xml", "w") as f:
        f.write(xml_content)

    print("Sitemap generated successfully.")

We write the generated XML content to a file named "sitemap.xml".

Example Usage

We provide an example usage of the generate_sitemap function:

generate_sitemap("https://learndevminds.vercel.app/")

Here, we generate a sitemap for the "https://learndevminds.vercel.app/" website.


Final Code

Here's the complete Python code for generating a sitemap for a website:

import requests
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urljoin, urlparse

def generate_sitemap(url):
    root = url.rstrip("/")
    sitemap = set()
    visited = set()
    original_domain = urlparse(root).netloc

    def add_url(root, path):
        if not path.startswith("http"):
            path = urljoin(root, path)
        sitemap.add(path)

    def extract_urls(soup, root):
        for link in soup.find_all("a", href=True):
            url = link["href"]
            parsed_url = urlparse(url)
            if parsed_url.netloc == original_domain or not parsed_url.netloc:
                add_url(root, url)

    stack = [url]

    while stack:
        current_url = stack.pop()
        if current_url in visited:
            continue

        visited.add(current_url)

        if not current_url.startswith(root):
            continue

        try:
            response = requests.get(current_url)
        except requests.exceptions.RequestException as e:
            print(f"Error occurred while accessing {current_url}: {e}")
            continue

        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            extract_urls(soup, root)

            for link in soup.find_all("a", href=True):
                next_url = link["href"]
                if not next_url.startswith("http"):
                    next_url = urljoin(root, next_url)
                stack.append(next_url)

    xml_content = '<?xml version="1.0" encoding="UTF-8"?>\n'
    xml_content += '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'

    for url in sitemap:
        xml_content += f'  <url>\n'
        xml_content += f'    <loc>{url}</loc>\n'
        xml_content += f'    <lastmod>{datetime.now().date().isoformat()}</lastmod>\n'
        xml_content += f'    <changefreq>weekly</changefreq>\n'
        xml_content += f'    <priority>0.8</priority>\n'
        xml_content += f'  </url>\n'

    xml_content += '</urlset>'

    with open("sitemap.xml", "w") as f:
        f.write(xml_content)

    print("Sitemap generated successfully.")

# Example usage
generate_sitemap("https://learndevminds.vercel.app/")

That's it! You now have a Python code snippet that allows you to generate a sitemap for a website using libraries like requests and BeautifulSoup. Feel free to modify the code or use it as a starting point for your own projects. Enjoygenerating sitemaps for your websites!

Back