ALTO-to-TEI Converter
Document Processing Toolkit for eScriptorium Output
ALTO-to-TEI Converter is a comprehensive Python toolkit for converting eScriptorium ALTO XML files to TEI (Text Encoding Initiative) format, with full support for the Segmonto ontology for document structure classification. This tool bridges the gap between OCR/HTR output and structured scholarly markup.
π― Two Conversion Modes
π Page-Level Converter (alto2tei.py)
- Individual file processing: Converts ALTO files to corresponding TEI files
- Structure preservation: Maintains original document layout and line breaks
- Small collections: Ideal for analyzing individual pages or focused document sets
π Book-Level Converter (alto2teibook.py)
- Complete book conversion: Uses METS.xml for proper page ordering
- Advanced text processing: Cross-page paragraph merging with hyphenation handling
- Unified output: Generates single TEI document with optional facsimile zones
- Machine-readable texts: Perfect for computational analysis and digital editions
ποΈ Core Architecture Principles
1. Segmonto Ontology Compliance
The converter implements the Segmonto ontology - a controlled vocabulary for semantic annotation of historical document layouts:
- Zone Classification: Standardized zone types (
MainZone
,MarginZone:note
,GraphicZone
) - Hierarchical Names: Colon-separated classification system for precise semantic markup
- Semantic Consistency: Maintains relationships between document regions across manuscripts
2. YAML-Driven Configuration
All transformation rules are externalized in YAML files, enabling:
- Maintainability: Updates without code changes
- Flexibility: Support for different document types and ontologies
- Transparency: Clear, readable transformation rules
- Extensibility: Simple addition of new zone and line types
3. TEI Best Practices
Generates standards-compliant TEI XML with:
- Semantic Markup: Meaningful element names reflecting content structure
- Metadata Preservation: Page numbers, image references, document hierarchy
- Poetry Handling: Proper
<lg>
and<l>
element grouping - Footnote Management: Structured annotation with automatic symbol recognition
π§ Document Processing Pipeline
ALTO XML β Parse Tags β Classify Zones β Process Lines β Generate TEI
β β β β β
eScriptorium Segmonto YAML Rules Line Types TEI XML
π Zone Classification System
Zone Type | Segmonto Category | Processing | TEI Output |
---|---|---|---|
MainZone | Primary text content | Process all lines | <p> , <lg> , <head> |
MarginTextZone:note | Footnote annotations | Extract as footnotes | <note type="footnote"> |
MarginTextZone:outer | Outer margin content | Process lines | <p> or specialized elements |
NumberingZone | Page numbering | Extract page numbers | <pb n="..." facs="..."> |
RunningTitleZone | Running headers/titles | Process lines | <p> with context |
QuireMarksZone | Quire markings | Process lines | <p> with context |
π€ Line Type Processing
Line Type | Purpose | TEI Element | Container |
---|---|---|---|
HeadingLine | Section headings | <head> | Standalone |
CustomLine:paragraph_start | Begin paragraph | <p> | Auto-closed |
CustomLine:verse | Poetry lines | <l> | <lg type="verse"> |
CustomLine:speaker | Speaker names | <speaker> | Standalone |
CustomLine:catchword | Page catchwords | <fw type="catchword"> | Standalone |
CustomLine:signature | Technical marks | <fw type="signature"> | Standalone |
DefaultLine | Regular text | Add to <p> | Current paragraph |
βοΈ Configuration Example
block_types:
MainZone:
description: "Main text content blocks"
process_lines: true
skip_content: false
"MarginTextZone:note": # Segmonto hierarchical naming
description: "Footnote blocks"
process_lines: false
extract_footnote: true
tei_element: "note"
attributes:
type: "footnote"
line_types:
"CustomLine:paragraph_start":
description: "Beginning of paragraph"
action: "start_paragraph"
tei_element: "p"
closes:
- "poetry"
"CustomLine:verse":
description: "Poetry lines"
tei_element: "l"
container: "lg"
container_attributes:
type: "verse"
π Usage Examples
Page-Level Conversion
# Convert all ALTO files in 'alto' folder to 'tei' folder
python alto2tei.py
# Specify input and output folders
python alto2tei.py manuscripts output_tei
# Use custom configuration
python alto2tei.py --config custom_mapping.yaml
Book-Level Conversion
# Convert entire book using METS.xml for page ordering
python alto2teibook.py alto_book/
# Enable cross-page paragraph merging with facsimile zones
python alto2teibook.py alto_book/ --merge-lines True --facsimile True
# Process without facsimile for pure text output
python alto2teibook.py alto_book/ --merge-lines True --facsimile False
π Real-World Performance
Based on processing historical manuscript collections:
Block Types (6 actively used):
- MainZone: 100 files (102 instances) - Primary content
- NumberingZone: 89 files (91 instances) - Page numbers
- GraphicZone: 32 files (35 instances) - Graphics/decorations
- MarginTextZone:outer: 6 files (8 instances) - Margin content
- RunningTitleZone: 5 files (5 instances) - Running headers
- MarginTextZone:note: 4 files (4 instances) - Footnotes
Line Types (6 actively used):
- DefaultLine: 100 files (2,775 instances) - Regular text
- CustomLine:catchword: 75 files (75 instances) - Page catchwords
- CustomLine:verse: 2 files (50 instances) - Poetry lines
- CustomLine:paragraph_start: 7 files (30 instances) - Paragraph starts
- HeadingLine: 3 files (13 instances) - Section headers
- CustomLine:signature: 7 files (7 instances) - Signatures
π Key Features
- Segmonto Ontology Integration: Full support for hierarchical document structure classification
- Flexible Configuration: YAML-driven rules for different manuscript types
- Cross-Page Processing: Advanced paragraph merging across page boundaries
- Hyphenation Handling: Intelligent word reconstruction across page breaks
- TEI Compliance: Standards-compliant output with proper semantic markup
- Extensible Architecture: Easy addition of new document types and processing rules
This toolkit serves as essential infrastructure for digital humanities projects working with historical documents, providing a robust bridge between automated transcription tools and scholarly digital editions.