BertAlign API
Multilingual Sentence Alignment Service for DiScEPT Project
BertAlign API is a FastAPI-based web service for multilingual sentence alignment developed as part of the DiScEPT (Digital Scholarly Editions Platform and aligned Translations) project at the Istituto Italiano di Studi Germanici. The service uses sentence-transformers (LaBSE model) to support both plain text and TEI XML document alignment, deployed on Google Cloud Run with Docker containerization.
🎯 DiScEPT Project Context
This service is a key component of the DiScEPT project, which aims to develop a sustainable digital environment for producing and publishing Digital Scholarly Editions (DSE). DiScEPT is designed to create multilingual scholarly editions by integrating digital publishing tools with services that can align various versions of texts or entire corpora across multiple languages.
Project Coordination
- Coordinator: Hansmichael Hohenegger (IISG)
- Partners: Fabio Ciotti (Tor Vergata), Tiziana Mancinelli (IISG), Federico Boschetti (ILC), Angelo Mario Del Grosso (ILC)
- Collaborating Institutions:
- Istituto di Linguistica Computazionale “A. Zampolli” – CNR (ILC)
- Istituto per il Lessico Intellettuale Europeo e Storia delle Idee – CNR (ILIESI)
- Università Tor Vergata
- Università della Tuscia
🚀 Key Features
- Multilingual Support: Align sentences across 25 languages including English, French, German, Russian, Chinese, and more
- Semantic Alignment: Uses LaBSE (Language-agnostic BERT Sentence Embeddings) for high-quality alignments
- Flexible Alignment: Support for 1-1, 1-many, many-1, and many-many alignment patterns
- TEI XML Support: Specialized endpoint for aligning TEI documents with standOff annotations
- FastAPI Backend: Auto-generated OpenAPI documentation and request validation
- Cloud Ready: Dockerized for easy deployment on Google Cloud Run
- Fast Processing: Optimized response times ranging from 0.2-3 seconds
🌍 Supported Languages
The service supports alignment across 25 languages: Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Icelandic, Italian, Lithuanian, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, and Turkish.
🔧 Technical Implementation
Core Technology Stack
- FastAPI: Modern Python web framework for building APIs
- sentence-transformers: LaBSE model for multilingual sentence embeddings
- Docker: Containerization for consistent deployment
- Google Cloud Run: Serverless deployment platform
API Endpoints
1. Basic Text Alignment
POST /align
Aligns plain text inputs between source and target languages with configurable similarity thresholds.
2. TEI XML Document Alignment
POST /align/tei
Specialized endpoint for aligning TEI-encoded documents, preserving XML structure and generating standOff annotations.
Example Response Format
{
"alignments": [
{
"source_sentences": ["Hello world."],
"target_sentences": ["Bonjour le monde."],
"alignment_score": 0.89,
"source_indices": [0],
"target_indices": [0]
}
],
"processing_time": 0.23,
"total_source_sentences": 2,
"total_target_sentences": 2
}
🎯 Use Cases
Digital Humanities Applications
- Multilingual Digital Editions: Aligning source texts with their translations
- Comparative Literature: Analyzing translation patterns across languages
- TEI Document Processing: Creating aligned multilingual corpora with proper markup
Translation Studies
- Translation Quality Assessment: Comparing alignment patterns between different translations
- Corpus Linguistics: Building parallel corpora for linguistic analysis
- Educational Tools: Supporting language learning through aligned text pairs
🏗️ Architecture
The service is built with a microservices architecture:
- API Layer: FastAPI handles HTTP requests and response validation
- Processing Engine: sentence-transformers provides semantic embeddings
- Alignment Algorithm: Custom logic for flexible many-to-many alignments
- TEI Handler: Specialized XML processing for digital humanities workflows
📈 Performance
- Response Time: 0.2-3 seconds depending on text length
- Scalability: Horizontal scaling via Google Cloud Run
- Memory Efficient: Optimized for cloud deployment constraints
- Model Loading: Cached LaBSE model for fast inference
🔗 Integration within DiScEPT
The BertAlign API serves as a core infrastructure component within the DiScEPT ecosystem, specifically designed to support:
Digital Scholarly Editions
- Multilingual text alignment for scholarly editions across language barriers
- TEI-compliant processing maintaining XML structure and scholarly markup standards
- Translation quality assessment for scholarly publication workflows
DiScEPT Platform Integration
- Microservice architecture allowing seamless integration with other DiScEPT tools
- API-first design enabling flexible integration with editorial workflows
- Open access compatibility supporting both research and commercial publishing pathways
Research and Innovation Goals
- Social innovation: Supporting interaction between academic research and publishing industry
- Educational applications: Enabling use in secondary education and university teaching
- International collaboration: Facilitating multilingual scholarly communication
This service represents a crucial component in DiScEPT’s vision of creating a sustainable platform that bridges research, education, and publishing, providing reliable sentence alignment capabilities that support the production of high-quality multilingual digital scholarly editions while maintaining open access principles for the digital humanities community.