The Yocto Jester

Online home of that Mender, Yocto Project and OpenEmbedded guy

View My GitHub Profile

Extracting Text from Scanned PDFs with Marker

January 6, 2026

I recently needed to extract text from a 30 MB scanned book PDF (German language). After some research, I settled on marker-pdf, an ML-based tool that converts PDFs directly to Markdown with good structure preservation. This post documents the setup process on both macOS and Windows with NVIDIA GPU acceleration.

Why Marker?

For scanned documents, you need OCR (Optical Character Recognition). The two main local approaches are:

  1. Tesseract pipeline (traditional OCR) — Mature, lightweight, but outputs plain text without structure. Markdown formatting requires post-processing.

  2. Marker (ML-based) — Uses deep learning models for layout detection and text recognition. Outputs Markdown with preserved headings, paragraphs, lists, and tables. Heavier dependencies but better results for structured documents.

Marker supports 90+ languages (including German), extracts images, and can optionally use an LLM to improve accuracy.

macOS Setup

Prerequisites

Marker requires Python 3.10-3.12. Python 3.13+ is too new and will cause compatibility issues with dependencies.

If you’re using Homebrew and have a newer Python as default:

# Install Python 3.12 (won't affect your default python3)
brew install python@3.12

# Create venv using that specific version
/opt/homebrew/opt/python@3.12/bin/python3 -m venv ~/marker-env
source ~/marker-env/bin/activate

# Verify version
python --version

Installation

pip install marker-pdf

The first run will download several GB of ML models.

Basic Usage

marker_single /path/to/your/book.pdf --output_dir ./output --force_ocr

Key flags:

Note: On macOS without NVIDIA GPU, processing runs on CPU (or MPS on Apple Silicon). Expect slower performance compared to CUDA-accelerated systems.

Windows 11 Setup with NVIDIA GPU

Prerequisites

  1. Install Python 3.12 from python.org (not the Microsoft Store version)
  2. Check “Add to PATH” during installation
  3. Note your CUDA version: run nvidia-smi in cmd

Installation (Order Matters!)

The installation order is important. Install PyTorch with CUDA before marker-pdf to avoid dependency conflicts.

:: Create and activate venv
python -m venv C:\marker-env
C:\marker-env\Scripts\activate.bat

PowerShell Note: If using PowerShell and you get an execution policy error, either:

  • Use cmd.exe with activate.bat instead, or
  • Run as Admin: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
:: Install PyTorch with CUDA FIRST (for CUDA 12.x)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

:: Then install marker
pip install marker-pdf

Fixing Dependency Conflicts

Marker’s installation may downgrade torch or cause version mismatches. After installation, check:

python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

If CUDA is not available (False), reinstall PyTorch:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

You may also need to pin Pillow to a compatible version:

pip install "pillow<11.0.0"

Dependency warnings about version mismatches are often not fatal—test the actual conversion before troubleshooting further.

Verify GPU Acceleration

python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

Expected output: 2.5.1+cu121 True (or similar with True for CUDA)

Usage

marker_single "C:\path\to\book.pdf" --output_dir .\output --force_ocr

Output Structure

Marker creates a directory containing:

Handling Extracted Images

Marker extracts all images, including decorative elements like headers, footers, and separators. Options to clean this up:

Option 1: Disable Image Extraction

marker_single ... --disable_image_extraction

Option 2: Post-Process by Size/Aspect Ratio

Decorative elements tend to be small or have extreme aspect ratios. Simple cleanup:

# Delete images smaller than 10KB
find ./output -name "*.png" -size -10k -delete

More sophisticated filtering (Python):

from pathlib import Path
from PIL import Image

output_dir = Path("./output")
for img_path in output_dir.rglob("*.png"):
    with Image.open(img_path) as img:
        w, h = img.size
        aspect = max(w, h) / min(w, h) if min(w, h) > 0 else 999
        area = w * h
        
        # Remove if: tiny, or extremely wide/tall (likely decorative)
        if area < 5000 or aspect > 8:
            img_path.unlink()
            print(f"Removed: {img_path.name}")

Tune thresholds based on your specific document.

Common Issues

Problem Solution
Python 3.13+ compatibility errors Use Python 3.10-3.12
--langs flag not recognized Removed in recent versions; language detection is automatic
CUDA not available after install Reinstall torch with --force-reinstall from CUDA index
Pillow version conflict pip install "pillow<11.0.0"
PowerShell blocks activate script Use cmd.exe or change execution policy
Out of memory Use --page_range to process in chunks

Advanced Options

See marker_single --help for all options.

Performance Notes

Resources


This article was drafted with assistance from Claude.