Crawl devops-pm-25.educ8.se and convert to LaTeX for Quarto
  • HTML 92.9%
  • TeX 5.4%
  • JavaScript 1.3%
  • CSS 0.3%
  • Python 0.1%
Find a file
root b4e3cf3c7f feat: Standardize presentation styling and add deployment docs
- Identified three distinct styling themes for the site and presentations (MkDocs, Swedish Tech, Quarto).
- Added the missing 'swedish-tech-slides.css' to ensure dark-themed presentations render correctly.
- Updated the root Containerfile for a simpler, more direct build process.
- Created DEPLOYMENT.md to document styling architecture and container deployment steps.
2026-03-01 11:48:43 +00:00
_book Add complete offline replica: knowledge base, LaTeX, presentations with print styles 2026-02-27 11:22:32 +00:00
crawler Add Selenium-based LinkedIn crawler for local use 2026-02-27 17:01:30 +00:00
docs Document LinkedIn crawling attempts and solution 2026-02-27 17:07:25 +00:00
output feat: Standardize presentation styling and add deployment docs 2026-03-01 11:48:43 +00:00
presentations Set print to landscape A4 with larger fonts 2026-02-27 11:40:29 +00:00
scripts feat: Add Podman/nginx site replica 2026-02-26 09:31:31 +00:00
src Add external link crawling with more allowed domains, fix Quarto config 2026-02-26 17:39:51 +00:00
.gitignore Add complete offline replica: knowledge base, LaTeX, presentations with print styles 2026-02-27 11:22:32 +00:00
_quarto.yml Add external link crawling with more allowed domains, fix Quarto config 2026-02-26 17:39:51 +00:00
Containerfile feat: Standardize presentation styling and add deployment docs 2026-03-01 11:48:43 +00:00
DEPLOYMENT.md feat: Standardize presentation styling and add deployment docs 2026-03-01 11:48:43 +00:00
index.qmd feat: Add Quarto configuration with chalkboard 2026-02-26 11:05:42 +00:00
LICENSE docs: Add professional README with GPL2 license 2026-02-26 12:02:38 +00:00
logo.png Add project logo 2026-02-27 15:17:28 +00:00
nginx.conf feat: Add Podman/nginx site replica 2026-02-26 09:31:31 +00:00
PROMPTS.md docs: Update prompts with final sessions 2026-02-26 13:53:13 +00:00
pyproject.toml feat: Implement crawler framework with Playwright and deduplication 2026-02-26 06:45:07 +00:00
README.md Add LinkedIn crawler documentation and Selenium script 2026-02-27 17:17:33 +00:00
SPEC.md docs: Add workflow documentation and prompts file 2026-02-26 06:45:58 +00:00
WORKFLOW.md docs: Add workflow documentation and prompts file 2026-02-26 06:45:58 +00:00

DevOps PM IPL25 LaTeX Converter

License Python Status

A comprehensive crawler system that transforms the devops-pm-25.educ8.se educational website into LaTeX/Quarto format with integrated LLM-powered relevance scoring, multimedia extraction, and chalkboard-enabled presentations.

Features

Feature Description
🤖 AI-Powered Crawling LLM-based relevance scoring for intelligent link prioritization
📄 Multi-Format Output HTML site replica + LaTeX documents + Quarto presentations
🎨 Chalkboard Interactive presentations with saveable annotations
🖨️ Print Handouts Print presentations to landscape A4 PDFs
🎬 Media Extraction Download videos, podcasts, and embedded media
🐳 Containerized Run offline replica with Podman/nginx
🔗 Selective Crawling Smart external link handling - crawl only what matters

The Story: How This Site Was Captured

This project began with a simple question: "Can we create an offline replica of an educational website for use in environments with limited internet connectivity?"

The Challenge

The original site at devops-pm-25.educ8.se is a Hugo-based educational platform containing:

  • 221+ pages of structured learning content
  • 37 presentations using Reveal.js
  • Video tutorials and embedded media
  • Complex navigation with collapsible menus

Standard web crawlers (wget, httrack) failed because:

  1. JavaScript-rendered navigation - The menu system loads dynamically
  2. Single-page application elements - Content loads on demand
  3. Relative URL rewriting - Links need adjustment for offline use

Our Solution

We built a custom crawler using Playwright that:

  1. Renders JavaScript - Uses a real browser to execute all client-side code
  2. Analyzes screenshots - Uses AI (LLama Vision) to understand page content
  3. Scores links intelligently - Uses LLM to determine which links are worth following
  4. Handles multimedia - Downloads videos and podcasts automatically

The Crawling Process

flowchart TD
    A[Start: Homepage] --> B[Playwright Renders Page]
    B --> C[Extract All Links]
    C --> D{Link Relevance AI}
    D -->|High Value| E[Crawl Full Page]
    D -->|Low Value| F[Crawl Summary Only]
    D -->|Navigation| G[Skip]
    E --> H[Convert to Static HTML]
    F --> H
    H --> I[Fix Relative URLs]
    I --> J[Build Search Index]
    J --> K[Package for Offline]

Key Discoveries During Crawling

During the process, we discovered several interesting things about the original site:

  1. 404 Pages - Many pages were returning 404 error content (the crawler bug placed index.html files at wrong paths like /1-what-is-a-server/index.html instead of /1-what-is-a-server/what-is-a-server/index.html)

  2. Missing CSS - The crawled HTML was missing <link> tags for CSS files and had inline CSS without <style> tags

  3. Search Index - Hugo generates a dynamic index.json for search that doesn't exist in static crawls - we had to generate our own

  4. 37 Presentations - Not just the 3 week recaps, but mini-lectures and infrastructure presentations as well

What Was Captured

Content Type Count
HTML Pages 221+
Presentations 37
CSS Files 6
Images Many
Total Size ~50MB

Post-Processing Fixes

After crawling, we applied several fixes:

  • Fixed nested directory paths (200+ pages)
  • Added missing <style> tags for inline CSS
  • Added missing CSS <link> tags
  • Generated index.json for search functionality
  • Applied dark theme to presentations
  • Removed Google Analytics tracking

Quick Start

# Install dependencies
pip install -e .

# Crawl the site
PYTHONPATH=. python -c "from crawler import app; app()" crawl --max-pages 500

# Run local replica
podman build -t educ8-replica .
podman run -p 8080:80 educ8-replica

Architecture

graph TB
    subgraph "Crawler"
        C[Playwright Crawler] --> D[Deduplication]
        D --> L[LaTeX Converter]
        D --> E[External Handler]
        D --> M[Multimedia]
    end
    
    subgraph "LLM Layer"
        V[Vision Analyzer] --> R[Relevance Scorer]
    end
    
    subgraph "Output"
        L --> S[Site Replica]
        L --> Q[Quarto Docs]
        M --> VDO[Videos/Audio]
    end
    
    C --> V
    R --> D

Project Structure

devops-pm-ipl25/
├── src/                    # Core modules
│   ├── crawler.py          # Playwright-based crawler
│   ├── deduplication.py    # URL deduplication with SHA256
│   ├── html_to_latex.py    # HTML to LaTeX converter
│   ├── llm_client.py      # Ollama LLM integration
│   ├── vision_analyzer.py  # Screenshot analysis
│   ├── relevance_analyzer.py # Link relevance scoring
│   ├── external_handler.py # Selective external crawling
│   ├── multimedia.py       # Video/audio extraction
│   └── presentation_handler.py # Reveal.js → Beamer
├── crawler/                # CLI commands
├── docs/                   # Documentation
├── output/                # Crawled content (gitignored)
│   ├── site/             # HTML site replica (221+ pages)
│   ├── latex/            # LaTeX files
│   └── external/         # Downloaded media
├── presentations/         # Quarto presentations with dark theme
├── _quarto.yml           # Quarto config with chalkboard
├── Containerfile          # Podman definition
└── README.md             # This file

Documentation Index

Document Description
📖 README This file - overview and quick start
📋 Specification Detailed technical specification
🔧 Architecture System design and diagrams
📖 API Reference Module documentation
🚀 Usage Guide Detailed usage instructions
🏗️ Development Contributing and development
📜 CHANGELOG Version history

Key Modules

Crawler (src/crawler.py)

Playwright-based web crawler with JavaScript rendering support.

LLM Integration (src/llm_client.py, src/vision_analyzer.py)

Dual-LLM system:

  • Vision: llama3.2-vision:latest - Analyze page screenshots
  • Decision: qwen2.5:7b - Score link relevance

Relevance Scoring

flowchart LR
    A[URL] --> B{Allowed Domain?}
    B -->|No| C[Skip]
    B -->|Yes| D{Generic Link?}
    D -->|Yes| C
    D -->|No| E[Score 0-3]
    E --> F{Crawl?}
    F -->|High| G[Crawl Full]
    F -->|Low| H[Crawl Summary]
  • Only specific allowed domains (docs.microsoft.com, github, stackoverflow)
  • Skip generic/navigation links
  • Download videos/podcasts automatically

LinkedIn Crawler

For Lars Appel Profile

Since LinkedIn blocks automated access, we provide a crawler that requires manual login:

# Install dependencies
pip install selenium

# Run the crawler (opens browser window)
python crawler/lars_appel_selenium.py

The crawler will:

  1. Open a visible Chrome window
  2. Wait for you to log into LinkedIn manually
  3. Navigate to the profile
  4. Extract: name, headline, about, location, external links
  5. Save to output/lars_appel/profile_data.json

See docs/LARS_APPEL_CONTEXT.md for full details.

Presentations

Available Presentations

The project includes custom Quarto presentations with dark theme styling:

Presentation Description
week-1-recap.html Week 1 Technical Recap - Server perspectives, Defense in Depth
week-2-recap.html Week 2 Technical Recap
week-3-recap.html Week 3 Technical Recap

Printing to PDF

Presentations are optimized for printing as A4 landscape handouts:

  1. Open a presentation (e.g., /presentations/week-1-recap.html)
  2. Press Ctrl+P (or Cmd+P on Mac)
  3. Print settings are automatically configured:
    • Orientation: Landscape
    • Paper Size: A4
    • Each slide prints on its own page

The print styles include:

  • Readable font sizes (24pt headings, 12pt body)
  • High contrast black text on white background
  • All fragments/animations visible
  • Tables and code blocks properly formatted

Building Presentations

# Render all presentations
quarto render presentations/

# Or render specific presentation
quarto render presentations/week-1-recap.qmd

Configuration

LLM Connection

export OLLAMA_HOST=10.32.64.105:11434

Crawler Options

--max-depth 2       # Crawl depth
--max-pages 500     # Max pages to crawl

Documentation

Detailed documentation is available in the docs/ directory:

Document Description
docs/USAGE.md Installation, commands, configuration
docs/ARCHITECTURE.md System design and components
docs/API.md API specifications
docs/SPEC.md Project specification
PROMPTS.md Development prompts and workflow
WORKFLOW.md Development workflow
SPEC.md Technical specification

License

GNU General Public License v2.0 - See LICENSE file.

Copyright (C) 2026 LoopAware AB


Built with ❤️ by LoopAware