Open Source Wikipedia To Markdown Generator

Posted in category Software on July 14, 2023 at 1:37 AM
672 Words ~4 Minute Reading Time • Subscribe to receive updates on Software

Eric David Smith

Software Engineer / Musician / Entrepreneur

Click image to view on GitHub

If you want to convert a Wikipedia article to Markdown, you can use my open source package I wrote to do it in seconds.

I made it because I wanted to convert some Wikipedia articles to Markdown for my personal notes and some AI / ML projects. I couldn't find a simple script to do this, so I wrote one myself. I hope you find it useful.

This is a simple script to convert a Wikipedia article to Markdown and optionally download the images too.

Prerequisites

Python 3

Installation

git clone https://github.com/erictherobot/wikipedia-markdown-generator.git
cd wikipedia-markdown-generator
pip3 install -r requirements.txt

Usage

python3 wiki-to-md.py <topic_name>

Output

The output is a Markdown file with the same name as the topic name under the newly created directory md_output if using wiki-to-md.py. If you want to download images too, use the wiki-to-md-images.py file and the images will be placed inside md_output/images/.

Note: eventually, wiki-to-md.py and wiki-to-md-images.py will be combined into one script with a flag to download images or not.

Why?

I wanted to convert some Wikipedia articles to Markdown for my personal notes. I couldn't find a simple script to do this, so I wrote one myself.

Is This Open Source?

Yes, I wouldn't have it any other way. I hope you find it useful.

Code

There are two scripts, one that downloads images and one that doesn't. I'll show you both.

Without Images

Here's the wiki-to-md.py file:

import os
import wikipedia
import argparse
import re


def generate_markdown(topic):
    try:
        page = wikipedia.page(topic)
    except wikipedia.exceptions.DisambiguationError as e:
        print(e.options)
        return None
    except wikipedia.exceptions.PageError:
        print(f"Page not found for the topic: {topic}")
        return None

    markdown_text = f"# {topic}\n\n"

    page_content = re.sub(r"=== ([^=]+) ===", r"### \1", page.content)
    page_content = re.sub(r"== ([^=]+) ==", r"## \1", page_content)

    sections = re.split(r"\n(## .*)\n", page_content)
    for i in range(0, len(sections), 2):
        if i + 1 < len(sections) and any(
            line.strip() for line in sections[i + 1].split("\n")
        ):
            markdown_text += f"{sections[i]}\n{sections[i+1]}\n\n"

    # Create a directory for markdown files
    directory = "md_output"
    os.makedirs(directory, exist_ok=True)

    filename = os.path.join(directory, f"{topic.replace(' ', '_')}.md")

    with open(filename, "w") as md_file:
        md_file.write(markdown_text)

    print(f"Markdown file created: {filename}")
    return filename


parser = argparse.ArgumentParser(
    description="Generate a markdown file for a provided topic."
)
parser.add_argument(
    "topic",
    type=str,
    help="The topic to generate a markdown file for.",
)

args = parser.parse_args()

topic = f"{args.topic}"

generate_markdown(topic)

With Images

Here's the wiki-to-md-images.py file (incase you want to scrape images too):

import os
import wikipedia
import argparse
import re
import requests
import urllib.parse


def generate_markdown(topic):
    try:
        page = wikipedia.page(topic)
    except wikipedia.exceptions.DisambiguationError as e:
        print(e.options)
        return None
    except wikipedia.exceptions.PageError:
        print(f"Page not found for the topic: {topic}")
        return None

    markdown_text = f"# {topic}\n\n"

    page_content = re.sub(r"=== ([^=]+) ===", r"### \1", page.content)
    page_content = re.sub(r"== ([^=]+) ==", r"## \1", page_content)

    sections = re.split(r"\n(## .*)\n", page_content)
    for i in range(0, len(sections), 2):
        if i + 1 < len(sections) and any(
            line.strip() for line in sections[i + 1].split("\n")
        ):
            markdown_text += f"{sections[i]}\n{sections[i+1]}\n\n"

    # Create a directory for markdown files
    output_directory = "md_output"
    os.makedirs(output_directory, exist_ok=True)

    # Create a directory for image files
    image_directory = os.path.join(output_directory, "images")
    os.makedirs(image_directory, exist_ok=True)

    for image_url in page.images:
        image_filename = urllib.parse.unquote(os.path.basename(image_url))
        image_path = os.path.join(image_directory, image_filename)
        image_data = requests.get(image_url).content
        with open(image_path, "wb") as image_file:
            image_file.write(image_data)
        markdown_text += f"![{image_filename}](./images/{image_filename})\n"

    filename = os.path.join(output_directory, f'{topic.replace(" ", "_")}.md')

    with open(filename, "w") as md_file:
        md_file.write(markdown_text)

    print(f"Markdown file created: {filename}")
    return filename


parser = argparse.ArgumentParser(
    description="Generate a markdown file for a provided topic."
)
parser.add_argument(
    "topic",
    type=str,
    help="The topic to generate a markdown file for.",
)

args = parser.parse_args()

topic = f"{args.topic}"

generate_markdown(topic)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

If you find this useful as is please let me know. If you find any bugs, please feel free to submit a pull request or open an issue. If you have any questions, you can contact me.

Supporting My Work

Please consider Buying Me A Coffee. I work hard to bring you my best content and any support would be greatly appreciated. Thank you for your support!

Contact

Eric David Smith

Software Engineer / Musician / Entrepreneur

Python Web Scraping Dev Tools Productivity

← Back

Open Source Wikipedia To Markdown Generator

Prerequisites

Installation

Usage

Output

Why?

Is This Open Source?

Code

Without Images

With Images

License

Contributing

Supporting My Work

Contact

Related Blog Posts

Building a Traffic Light System in Ada Programming Language

Building a Traffic Light System in Julia Programming Language

Checkit Chrome Extension

Create Quick Stack

Habit Track Pro - Track and Improve Your Habits

iOS App Store Screenshot Icon Generator in Python

Leptos Web Framework - Build fast web applications with Rust

LinkLock Pro - The Ultimate Bookmarking App

Python Text To Speech

Quote Replacer VSCode Extension

React Native with Tailwind CSS

Roblox Game Development

Rust Dependency Checker

Safe Or Dangerous iOS App

TinyDB Julia

Traffic Lights in Various Programming Languages

Continue Discovering

Blog Post Tags