Crawl devops-pm-25.educ8.se and convert to LaTeX for Quarto

HTML 92.9%
TeX 5.4%
JavaScript 1.3%
CSS 0.3%
Python 0.1%

Find a file

root b4e3cf3c7f feat: Standardize presentation styling and add deployment docs - Identified three distinct styling themes for the site and presentations (MkDocs, Swedish Tech, Quarto). - Added the missing 'swedish-tech-slides.css' to ensure dark-themed presentations render correctly. - Updated the root Containerfile for a simpler, more direct build process. - Created DEPLOYMENT.md to document styling architecture and container deployment steps.		2026-03-01 11:48:43 +00:00
_book	Add complete offline replica: knowledge base, LaTeX, presentations with print styles	2026-02-27 11:22:32 +00:00
crawler	Add Selenium-based LinkedIn crawler for local use	2026-02-27 17:01:30 +00:00
docs	Document LinkedIn crawling attempts and solution	2026-02-27 17:07:25 +00:00
output	feat: Standardize presentation styling and add deployment docs	2026-03-01 11:48:43 +00:00
presentations	Set print to landscape A4 with larger fonts	2026-02-27 11:40:29 +00:00
scripts	feat: Add Podman/nginx site replica	2026-02-26 09:31:31 +00:00
src	Add external link crawling with more allowed domains, fix Quarto config	2026-02-26 17:39:51 +00:00
.gitignore	Add complete offline replica: knowledge base, LaTeX, presentations with print styles	2026-02-27 11:22:32 +00:00
_quarto.yml	Add external link crawling with more allowed domains, fix Quarto config	2026-02-26 17:39:51 +00:00
Containerfile	feat: Standardize presentation styling and add deployment docs	2026-03-01 11:48:43 +00:00
DEPLOYMENT.md	feat: Standardize presentation styling and add deployment docs	2026-03-01 11:48:43 +00:00
index.qmd	feat: Add Quarto configuration with chalkboard	2026-02-26 11:05:42 +00:00
LICENSE	docs: Add professional README with GPL2 license	2026-02-26 12:02:38 +00:00
logo.png	Add project logo	2026-02-27 15:17:28 +00:00
nginx.conf	feat: Add Podman/nginx site replica	2026-02-26 09:31:31 +00:00
PROMPTS.md	docs: Update prompts with final sessions	2026-02-26 13:53:13 +00:00
pyproject.toml	feat: Implement crawler framework with Playwright and deduplication	2026-02-26 06:45:07 +00:00
README.md	Add LinkedIn crawler documentation and Selenium script	2026-02-27 17:17:33 +00:00
SPEC.md	docs: Add workflow documentation and prompts file	2026-02-26 06:45:58 +00:00
WORKFLOW.md	docs: Add workflow documentation and prompts file	2026-02-26 06:45:58 +00:00

README.md

DevOps PM IPL25 LaTeX Converter

A comprehensive crawler system that transforms the devops-pm-25.educ8.se educational website into LaTeX/Quarto format with integrated LLM-powered relevance scoring, multimedia extraction, and chalkboard-enabled presentations.

Features

Feature	Description
🤖 AI-Powered Crawling	LLM-based relevance scoring for intelligent link prioritization
📄 Multi-Format Output	HTML site replica + LaTeX documents + Quarto presentations
🎨 Chalkboard	Interactive presentations with saveable annotations
🖨️ Print Handouts	Print presentations to landscape A4 PDFs
🎬 Media Extraction	Download videos, podcasts, and embedded media
🐳 Containerized	Run offline replica with Podman/nginx
🔗 Selective Crawling	Smart external link handling - crawl only what matters

The Story: How This Site Was Captured

This project began with a simple question: "Can we create an offline replica of an educational website for use in environments with limited internet connectivity?"

The Challenge

The original site at devops-pm-25.educ8.se is a Hugo-based educational platform containing:

221+ pages of structured learning content
37 presentations using Reveal.js
Video tutorials and embedded media
Complex navigation with collapsible menus

Standard web crawlers (wget, httrack) failed because:

JavaScript-rendered navigation - The menu system loads dynamically
Single-page application elements - Content loads on demand
Relative URL rewriting - Links need adjustment for offline use

Our Solution

We built a custom crawler using Playwright that:

Renders JavaScript - Uses a real browser to execute all client-side code
Analyzes screenshots - Uses AI (LLama Vision) to understand page content
Scores links intelligently - Uses LLM to determine which links are worth following
Handles multimedia - Downloads videos and podcasts automatically

The Crawling Process

flowchart TD
    A[Start: Homepage] --> B[Playwright Renders Page]
    B --> C[Extract All Links]
    C --> D{Link Relevance AI}
    D -->|High Value| E[Crawl Full Page]
    D -->|Low Value| F[Crawl Summary Only]
    D -->|Navigation| G[Skip]
    E --> H[Convert to Static HTML]
    F --> H
    H --> I[Fix Relative URLs]
    I --> J[Build Search Index]
    J --> K[Package for Offline]

Key Discoveries During Crawling

During the process, we discovered several interesting things about the original site:

404 Pages - Many pages were returning 404 error content (the crawler bug placed index.html files at wrong paths like /1-what-is-a-server/index.html instead of /1-what-is-a-server/what-is-a-server/index.html)
Missing CSS - The crawled HTML was missing <link> tags for CSS files and had inline CSS without <style> tags
Search Index - Hugo generates a dynamic index.json for search that doesn't exist in static crawls - we had to generate our own
37 Presentations - Not just the 3 week recaps, but mini-lectures and infrastructure presentations as well

What Was Captured

Content Type	Count
HTML Pages	221+
Presentations	37
CSS Files	6
Images	Many
Total Size	~50MB

Post-Processing Fixes

After crawling, we applied several fixes:

Fixed nested directory paths (200+ pages)
Added missing <style> tags for inline CSS
Added missing CSS <link> tags
Generated index.json for search functionality
Applied dark theme to presentations
Removed Google Analytics tracking

Quick Start

# Install dependencies
pip install -e .

# Crawl the site
PYTHONPATH=. python -c "from crawler import app; app()" crawl --max-pages 500

# Run local replica
podman build -t educ8-replica .
podman run -p 8080:80 educ8-replica

Architecture

graph TB
    subgraph "Crawler"
        C[Playwright Crawler] --> D[Deduplication]
        D --> L[LaTeX Converter]
        D --> E[External Handler]
        D --> M[Multimedia]
    end
    
    subgraph "LLM Layer"
        V[Vision Analyzer] --> R[Relevance Scorer]
    end
    
    subgraph "Output"
        L --> S[Site Replica]
        L --> Q[Quarto Docs]
        M --> VDO[Videos/Audio]
    end
    
    C --> V
    R --> D

Project Structure

devops-pm-ipl25/
├── src/                    # Core modules
│   ├── crawler.py          # Playwright-based crawler
│   ├── deduplication.py    # URL deduplication with SHA256
│   ├── html_to_latex.py    # HTML to LaTeX converter
│   ├── llm_client.py      # Ollama LLM integration
│   ├── vision_analyzer.py  # Screenshot analysis
│   ├── relevance_analyzer.py # Link relevance scoring
│   ├── external_handler.py # Selective external crawling
│   ├── multimedia.py       # Video/audio extraction
│   └── presentation_handler.py # Reveal.js → Beamer
├── crawler/                # CLI commands
├── docs/                   # Documentation
├── output/                # Crawled content (gitignored)
│   ├── site/             # HTML site replica (221+ pages)
│   ├── latex/            # LaTeX files
│   └── external/         # Downloaded media
├── presentations/         # Quarto presentations with dark theme
├── _quarto.yml           # Quarto config with chalkboard
├── Containerfile          # Podman definition
└── README.md             # This file

Documentation Index

Document	Description
📖 README	This file - overview and quick start
📋 Specification	Detailed technical specification
🔧 Architecture	System design and diagrams
📖 API Reference	Module documentation
🚀 Usage Guide	Detailed usage instructions
🏗️ Development	Contributing and development
📜 CHANGELOG	Version history

Key Modules

Crawler (`src/crawler.py`)

Playwright-based web crawler with JavaScript rendering support.

LLM Integration (`src/llm_client.py`, `src/vision_analyzer.py`)

Dual-LLM system:

Vision: llama3.2-vision:latest - Analyze page screenshots
Decision: qwen2.5:7b - Score link relevance

Relevance Scoring

flowchart LR
    A[URL] --> B{Allowed Domain?}
    B -->|No| C[Skip]
    B -->|Yes| D{Generic Link?}
    D -->|Yes| C
    D -->|No| E[Score 0-3]
    E --> F{Crawl?}
    F -->|High| G[Crawl Full]
    F -->|Low| H[Crawl Summary]

External Link Handling

Only specific allowed domains (docs.microsoft.com, github, stackoverflow)
Skip generic/navigation links
Download videos/podcasts automatically

LinkedIn Crawler

For Lars Appel Profile

Since LinkedIn blocks automated access, we provide a crawler that requires manual login:

# Install dependencies
pip install selenium

# Run the crawler (opens browser window)
python crawler/lars_appel_selenium.py

The crawler will:

Open a visible Chrome window
Wait for you to log into LinkedIn manually
Navigate to the profile
Extract: name, headline, about, location, external links
Save to output/lars_appel/profile_data.json

See docs/LARS_APPEL_CONTEXT.md for full details.

Presentations

Available Presentations

The project includes custom Quarto presentations with dark theme styling:

Presentation	Description
`week-1-recap.html`	Week 1 Technical Recap - Server perspectives, Defense in Depth
`week-2-recap.html`	Week 2 Technical Recap
`week-3-recap.html`	Week 3 Technical Recap

Printing to PDF

Presentations are optimized for printing as A4 landscape handouts:

Open a presentation (e.g., /presentations/week-1-recap.html)
Press Ctrl+P (or Cmd+P on Mac)
Print settings are automatically configured:
- Orientation: Landscape
- Paper Size: A4
- Each slide prints on its own page

The print styles include:

Readable font sizes (24pt headings, 12pt body)
High contrast black text on white background
All fragments/animations visible
Tables and code blocks properly formatted

Building Presentations

# Render all presentations
quarto render presentations/

# Or render specific presentation
quarto render presentations/week-1-recap.qmd

Configuration

LLM Connection

export OLLAMA_HOST=10.32.64.105:11434

Crawler Options

--max-depth 2       # Crawl depth
--max-pages 500     # Max pages to crawl

Documentation

Detailed documentation is available in the docs/ directory:

Document	Description
docs/USAGE.md	Installation, commands, configuration
docs/ARCHITECTURE.md	System design and components
docs/API.md	API specifications
docs/SPEC.md	Project specification
PROMPTS.md	Development prompts and workflow
WORKFLOW.md	Development workflow
SPEC.md	Technical specification

License

GNU General Public License v2.0 - See LICENSE file.

Built with ❤️ by LoopAware