ALL AUTOMATE WORKFLOW EVERYWHERE BUSINESS DATA ANALYSIS ENGLISH ARTICLES MY PROJECTS PYTHON

Building an AI-Powered Web Scraper with Ollama & ChromaDB

Building an AI-Powered Web Scraper with Ollama & ChromaDB

A Step-by-Step Breakdown of a Python Script for Intelligent Web Scraping and Q&A

In this blog post, we’ll analyze a Python script that automates web scraping, text processing, and AI-powered question answering using Ollama and ChromaDB. This script is a powerful tool for extracting website content and enabling users to ask questions about the extracted data interactively.

View the script: RAG Web Scraper 6


What Does This Script Do?

This Python script:

  1. Detects and Stops Running Ollama Processes
  2. Restarts Ollama to Ensure a Fresh AI Model is Running
  3. Scrapes Webpage Content Dynamically
  4. Extracts and Displays the Webpage Title
  5. Processes and Stores Text Data in ChromaDB for Fast Retrieval
  6. Uses Ollama’s AI Model to Answer User Questions
  7. Allows Users to Change URLs & Scrape Different Pages Without Restarting

Let’s break down how each part of the script works.


1️⃣ System Information & Ollama Process Management

The script starts by printing system information and managing Ollama processes to avoid conflicts.

Checking System Information

Before starting, the script prints:

  • Operating System Name
  • Platform Details
  • CPU Core Count
  • Available Memory Details

ollama

Stopping Any Running Ollama Processes

The script checks for existing Ollama processes and terminates them to ensure a clean restart:

ollama

Restarting Ollama

Once all running instances are stopped, the script restarts Ollama to serve AI models.

ollama

2️⃣ Web Scraping & Dynamic Content Extraction

The script asks the user for a URL and extracts content from the page.

Asking for the Webpage URL

ollama

Extracting the Page Title

The script looks for the <h1 class="post-title"> tag and extracts the article title.

ollama

Extracting the Article Content

The script then extracts the main article content inside <div class="post-content">.

ollama

Storing Extracted Text as a Document

To process the extracted text efficiently, it is wrapped in a Document object (from LangChain).

ollama

3️⃣ Text Processing & ChromaDB Storage

The extracted text is split into smaller chunks and stored in ChromaDB for fast retrieval.

✂️ Splitting the Text into Chunks

To enable efficient question-answering, the script splits the text into small, searchable chunks.

ollama

Storing Data in ChromaDB

The text chunks are converted into vector embeddings and stored in ChromaDB.

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

local_embeddings = OllamaEmbeddings(model=”all-minilm”)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=local_embeddings, persist_directory=”chroma_db”)

4️⃣ AI-Powered Question Answering

Once the content is processed and stored, the script allows the user to ask questions about the article.

Interactive Q&A Loop

Users can ask multiple questions without restarting the script.

ollama

Retrieving Relevant Information

When a question is asked, ChromaDB retrieves the most relevant text chunks.

ollama

Answering the Question with AI

The retrieved text is sent to the Ollama model for generating a response.

ollama

Displaying the Answer

ollama

5️⃣ Additional Features

Changing the URL Without Restarting

Users can type "change url" to scrape a new webpage dynamically without restarting the script.

ollama

Key Takeaways

✔️ Automates web scraping and AI-powered Q&A
✔️ Handles dynamic URL changes efficiently
✔️ Uses ChromaDB for fast text retrieval
✔️ Manages system processes, ensuring Ollama runs smoothly
✔️ Provides a continuous chatbot-like experience


Final Thoughts

This Python script is a powerful AI-driven tool that combines web scraping, vector search, and AI question-answering into one seamless workflow. It can be used for automated research, knowledge extraction, and real-time information retrieval.

Would you like to integrate this into your own projects? Send me an email: info@mindstorm.gr !

Views: 13

Comments are closed.

Pin It