- HTML 92.9%
- TeX 5.4%
- JavaScript 1.3%
- CSS 0.3%
- Python 0.1%
- Identified three distinct styling themes for the site and presentations (MkDocs, Swedish Tech, Quarto). - Added the missing 'swedish-tech-slides.css' to ensure dark-themed presentations render correctly. - Updated the root Containerfile for a simpler, more direct build process. - Created DEPLOYMENT.md to document styling architecture and container deployment steps. |
||
|---|---|---|
| _book | ||
| crawler | ||
| docs | ||
| output | ||
| presentations | ||
| scripts | ||
| src | ||
| .gitignore | ||
| _quarto.yml | ||
| Containerfile | ||
| DEPLOYMENT.md | ||
| index.qmd | ||
| LICENSE | ||
| logo.png | ||
| nginx.conf | ||
| PROMPTS.md | ||
| pyproject.toml | ||
| README.md | ||
| SPEC.md | ||
| WORKFLOW.md | ||
DevOps PM IPL25 LaTeX Converter
A comprehensive crawler system that transforms the devops-pm-25.educ8.se educational website into LaTeX/Quarto format with integrated LLM-powered relevance scoring, multimedia extraction, and chalkboard-enabled presentations.
Features
| Feature | Description |
|---|---|
| 🤖 AI-Powered Crawling | LLM-based relevance scoring for intelligent link prioritization |
| 📄 Multi-Format Output | HTML site replica + LaTeX documents + Quarto presentations |
| 🎨 Chalkboard | Interactive presentations with saveable annotations |
| 🖨️ Print Handouts | Print presentations to landscape A4 PDFs |
| 🎬 Media Extraction | Download videos, podcasts, and embedded media |
| 🐳 Containerized | Run offline replica with Podman/nginx |
| 🔗 Selective Crawling | Smart external link handling - crawl only what matters |
The Story: How This Site Was Captured
This project began with a simple question: "Can we create an offline replica of an educational website for use in environments with limited internet connectivity?"
The Challenge
The original site at devops-pm-25.educ8.se is a Hugo-based educational platform containing:
- 221+ pages of structured learning content
- 37 presentations using Reveal.js
- Video tutorials and embedded media
- Complex navigation with collapsible menus
Standard web crawlers (wget, httrack) failed because:
- JavaScript-rendered navigation - The menu system loads dynamically
- Single-page application elements - Content loads on demand
- Relative URL rewriting - Links need adjustment for offline use
Our Solution
We built a custom crawler using Playwright that:
- Renders JavaScript - Uses a real browser to execute all client-side code
- Analyzes screenshots - Uses AI (LLama Vision) to understand page content
- Scores links intelligently - Uses LLM to determine which links are worth following
- Handles multimedia - Downloads videos and podcasts automatically
The Crawling Process
flowchart TD
A[Start: Homepage] --> B[Playwright Renders Page]
B --> C[Extract All Links]
C --> D{Link Relevance AI}
D -->|High Value| E[Crawl Full Page]
D -->|Low Value| F[Crawl Summary Only]
D -->|Navigation| G[Skip]
E --> H[Convert to Static HTML]
F --> H
H --> I[Fix Relative URLs]
I --> J[Build Search Index]
J --> K[Package for Offline]
Key Discoveries During Crawling
During the process, we discovered several interesting things about the original site:
-
404 Pages - Many pages were returning 404 error content (the crawler bug placed index.html files at wrong paths like
/1-what-is-a-server/index.htmlinstead of/1-what-is-a-server/what-is-a-server/index.html) -
Missing CSS - The crawled HTML was missing
<link>tags for CSS files and had inline CSS without<style>tags -
Search Index - Hugo generates a dynamic
index.jsonfor search that doesn't exist in static crawls - we had to generate our own -
37 Presentations - Not just the 3 week recaps, but mini-lectures and infrastructure presentations as well
What Was Captured
| Content Type | Count |
|---|---|
| HTML Pages | 221+ |
| Presentations | 37 |
| CSS Files | 6 |
| Images | Many |
| Total Size | ~50MB |
Post-Processing Fixes
After crawling, we applied several fixes:
- Fixed nested directory paths (200+ pages)
- Added missing
<style>tags for inline CSS - Added missing CSS
<link>tags - Generated
index.jsonfor search functionality - Applied dark theme to presentations
- Removed Google Analytics tracking
Quick Start
# Install dependencies
pip install -e .
# Crawl the site
PYTHONPATH=. python -c "from crawler import app; app()" crawl --max-pages 500
# Run local replica
podman build -t educ8-replica .
podman run -p 8080:80 educ8-replica
Architecture
graph TB
subgraph "Crawler"
C[Playwright Crawler] --> D[Deduplication]
D --> L[LaTeX Converter]
D --> E[External Handler]
D --> M[Multimedia]
end
subgraph "LLM Layer"
V[Vision Analyzer] --> R[Relevance Scorer]
end
subgraph "Output"
L --> S[Site Replica]
L --> Q[Quarto Docs]
M --> VDO[Videos/Audio]
end
C --> V
R --> D
Project Structure
devops-pm-ipl25/
├── src/ # Core modules
│ ├── crawler.py # Playwright-based crawler
│ ├── deduplication.py # URL deduplication with SHA256
│ ├── html_to_latex.py # HTML to LaTeX converter
│ ├── llm_client.py # Ollama LLM integration
│ ├── vision_analyzer.py # Screenshot analysis
│ ├── relevance_analyzer.py # Link relevance scoring
│ ├── external_handler.py # Selective external crawling
│ ├── multimedia.py # Video/audio extraction
│ └── presentation_handler.py # Reveal.js → Beamer
├── crawler/ # CLI commands
├── docs/ # Documentation
├── output/ # Crawled content (gitignored)
│ ├── site/ # HTML site replica (221+ pages)
│ ├── latex/ # LaTeX files
│ └── external/ # Downloaded media
├── presentations/ # Quarto presentations with dark theme
├── _quarto.yml # Quarto config with chalkboard
├── Containerfile # Podman definition
└── README.md # This file
Documentation Index
| Document | Description |
|---|---|
| 📖 README | This file - overview and quick start |
| 📋 Specification | Detailed technical specification |
| 🔧 Architecture | System design and diagrams |
| 📖 API Reference | Module documentation |
| 🚀 Usage Guide | Detailed usage instructions |
| 🏗️ Development | Contributing and development |
| 📜 CHANGELOG | Version history |
Key Modules
Crawler (src/crawler.py)
Playwright-based web crawler with JavaScript rendering support.
LLM Integration (src/llm_client.py, src/vision_analyzer.py)
Dual-LLM system:
- Vision:
llama3.2-vision:latest- Analyze page screenshots - Decision:
qwen2.5:7b- Score link relevance
Relevance Scoring
flowchart LR
A[URL] --> B{Allowed Domain?}
B -->|No| C[Skip]
B -->|Yes| D{Generic Link?}
D -->|Yes| C
D -->|No| E[Score 0-3]
E --> F{Crawl?}
F -->|High| G[Crawl Full]
F -->|Low| H[Crawl Summary]
External Link Handling
- Only specific allowed domains (docs.microsoft.com, github, stackoverflow)
- Skip generic/navigation links
- Download videos/podcasts automatically
LinkedIn Crawler
For Lars Appel Profile
Since LinkedIn blocks automated access, we provide a crawler that requires manual login:
# Install dependencies
pip install selenium
# Run the crawler (opens browser window)
python crawler/lars_appel_selenium.py
The crawler will:
- Open a visible Chrome window
- Wait for you to log into LinkedIn manually
- Navigate to the profile
- Extract: name, headline, about, location, external links
- Save to
output/lars_appel/profile_data.json
See docs/LARS_APPEL_CONTEXT.md for full details.
Presentations
Available Presentations
The project includes custom Quarto presentations with dark theme styling:
| Presentation | Description |
|---|---|
week-1-recap.html |
Week 1 Technical Recap - Server perspectives, Defense in Depth |
week-2-recap.html |
Week 2 Technical Recap |
week-3-recap.html |
Week 3 Technical Recap |
Printing to PDF
Presentations are optimized for printing as A4 landscape handouts:
- Open a presentation (e.g.,
/presentations/week-1-recap.html) - Press
Ctrl+P(orCmd+Pon Mac) - Print settings are automatically configured:
- Orientation: Landscape
- Paper Size: A4
- Each slide prints on its own page
The print styles include:
- Readable font sizes (24pt headings, 12pt body)
- High contrast black text on white background
- All fragments/animations visible
- Tables and code blocks properly formatted
Building Presentations
# Render all presentations
quarto render presentations/
# Or render specific presentation
quarto render presentations/week-1-recap.qmd
Configuration
LLM Connection
export OLLAMA_HOST=10.32.64.105:11434
Crawler Options
--max-depth 2 # Crawl depth
--max-pages 500 # Max pages to crawl
Documentation
Detailed documentation is available in the docs/ directory:
| Document | Description |
|---|---|
| docs/USAGE.md | Installation, commands, configuration |
| docs/ARCHITECTURE.md | System design and components |
| docs/API.md | API specifications |
| docs/SPEC.md | Project specification |
| PROMPTS.md | Development prompts and workflow |
| WORKFLOW.md | Development workflow |
| SPEC.md | Technical specification |
License
GNU General Public License v2.0 - See LICENSE file.
Copyright (C) 2026 LoopAware AB
Built with ❤️ by LoopAware