Historical OCR Analysis

LLM Performance Visualization Platform for 18th Century Russian Documents

Historical OCR Analysis is a comprehensive web application for analyzing and visualizing OCR model performance on 18th century Russian Civil Font documents. This tool provides detailed insights into how modern Large Language Models handle historical text recognition challenges.

๐ŸŽฏ Research Context

This application supports the research paper โ€œWhy and Where LLMs Still Go Wrong: Gaps in Historical Linguistic Competenceโ€ by providing an interactive platform to explore OCR errors and model performance across a dataset of 1,029 historical Russian documents from 1752-1801.

๐Ÿ“Š Dataset Overview

Scale and Scope

  • 1,029 images from 428 books
  • Time Period: 1752-1801 (18th century Russian Civil Font)
  • Content Coverage: 28,662 text lines, 933K characters
  • Subject Distribution: Fiction, science, religion, history, geography, drama, education, poetry
  • Difficulty Range: Easy to hard, with majority classified as medium difficulty

Historical Significance

  • Peak Coverage: 1780s-1790s representing golden age of Russian literature
  • Civil Font: Transition period from Church Slavonic to modern Russian typography
  • Literary Genres: Comprehensive coverage of 18th century Russian literary production

๐Ÿ”ง Key Features

Analysis View (Dataset-Wide Insights)

  • Comprehensive Statistics: Overview of 428 books and temporal distribution
  • Performance Metrics: Model comparison with top-3 highlighting across evaluation metrics
  • Error Analysis: Detailed breakdown of character substitutions and foreign character insertions
  • Subject Analysis: Performance patterns across different literary genres

Document View (Individual Analysis)

  • Interactive Image Viewer: Zoom, pan, click-to-select with ALTO XML overlay
  • Line-by-Line Comparison: Ground truth vs. model outputs
  • Smart Model Comparison: 3-column layout with intelligent line matching
  • Real-Time Metrics: Character preservation, case sensitivity, word accuracy

๐Ÿค– Models Evaluated

The platform analyzes performance across 12 state-of-the-art language models:

OpenAI Family

  • GPT-4.1, GPT-4o, O3, O4-mini

Google Family

  • Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.0 Flash

Anthropic Family

  • Claude 3.5 Sonnet, Claude 3.7 Sonnet

Meta Family

  • Llama 4 Maverick, Llama 4 Scout

Alibaba

  • Qwen 2.5 VL

๐Ÿ” Key Research Findings

Historical Character Preservation

  • Gemini models demonstrate superior preservation of historical Russian characters (ัฃ, ั–, ัŠ)
  • Character-specific challenges: Old orthography poses consistent difficulties across model families

Performance Hierarchy

  • Latest generation models show 5-15% accuracy improvements over earlier versions
  • Model family differences: Distinct strengths in different aspects of historical text processing

Error Pattern Analysis

  • Case sensitivity remains a primary challenge across all models
  • Foreign character insertion shows systematic patterns
  • Dataset complexity: 880 medium difficulty and 146 hard documents provide substantial OCR challenges

๐Ÿ’ป Technology Stack

Frontend Architecture

  • React 18 with Vite for modern development experience
  • Tailwind CSS for responsive and maintainable styling
  • Recharts for interactive data visualization

Image Processing

  • Canvas API for ALTO XML overlays and annotations
  • Smart line matching algorithms for accurate model comparison
  • Interactive zoom/pan for detailed document inspection

Deployment

  • GitHub Pages with automated CI/CD pipeline
  • Optimized builds for fast loading and performance

๐Ÿ“ˆ Visualization Capabilities

Statistical Dashboards

  • Performance metrics with comparative analysis
  • Error distribution across different document types
  • Temporal analysis showing historical period coverage
  • Model comparison with highlighting of top performers

Interactive Document Analysis

  • ALTO XML overlay showing recognition boundaries
  • Side-by-side comparison of multiple model outputs
  • Character-level accuracy visualization
  • Error highlighting with detailed explanations

๐ŸŽ“ Academic Impact

This platform enables:

Research Applications

  • Systematic evaluation of LLM performance on historical texts
  • Error pattern analysis for improving OCR accuracy
  • Dataset insights for historical corpus development
  • Model comparison for selecting optimal tools

Educational Use

  • Interactive exploration of OCR challenges
  • Visual demonstration of technological limitations
  • Historical linguistics teaching resource
  • Digital humanities methodology showcase

๐ŸŒŸ Innovation Highlights

  • First comprehensive platform for LLM OCR analysis on historical Russian texts
  • Interactive visualization of complex model performance data
  • Real-time analysis with immediate visual feedback
  • Scalable architecture for additional historical corpora
  • Open source approach enabling community contribution

This platform represents a significant contribution to digital humanities research by providing tools for systematic analysis of how modern AI handles historical text recognition challenges, with particular focus on preserving cultural and linguistic heritage through accurate digitization.