Hello gf-dev,
I’m writing to introduce a project I built called “Abstract Wiki Architect” (AWA). The intent is to merge forces between Wikidata (as a language-agnostic knowledge layer: QIDs/claims) and GF (as a principled multilingual realization system), and to make that bridge practical across the long tail of languages where coverage is uneven.
AWA is organized as an end-to-end pipeline: Wikidata-grounded content → structured representation → tiered realization → evaluation/inspection. GF is the “high-quality path” when coverage exists; the rest of the system is designed to keep the pipeline functional and testable when coverage is incomplete.
Wikidata grounding and lexicon layer
Content is anchored to Wikidata entities (QIDs) and their claims, so the semantic identity stays stable across languages.
The lexicon/data layer is built around QID-linked entries: for each language, it stores known surface forms when available, plus the feature/morphology metadata needed during realization.
This separation keeps “what is being said” (Wikidata claims) distinct from “how it is said” (language-specific lexicon + grammar decisions).
Content representation (two input modes)
AWA supports two paths so the system can operate both with strict, production-like frames and with more general compositional structures:
Strict path (“BioFrame”): a typed/validated JSON format for common encyclopedic statements.
General path (Ninai protocol): a recursive “UniversalNode”-style structure for content that doesn’t fit a small closed set of frames.
Tiered realization strategy (coverage-first without abandoning GF)
Realization is explicitly tiered to preserve GF’s strengths while avoiding hard failure when coverage is missing:
Tier 1 — GF/RGL path (“High Road”)
Structured content is mapped into GF abstract trees.
Linearization runs through PGF, using RGL-based concrete syntaxes where they exist and are strong.
Tier 2 — targeted overrides
Small, explicit patches where the Tier 1 path is close but missing specific constructions or lexical entries.
This tier is intended for precise, human-authored improvements without rewriting the whole pipeline.
Tier 3 — fallback realizer (“Weighted Factory”)
When GF cannot linearize (missing concrete syntax, incomplete coverage, missing constructions), a fallback realizer produces best-effort output instead of failing.
The mechanism is weighted topology sorting (adapted from Udiron): instead of hardcoding a single word order template, each language has a configurable weight profile (e.g., in a topology_weights.json) that guides relative ordering of roles like subject/verb/object and other elements.
The explicit design goal of Tier 3 is continuity and broad coverage, while keeping Tier 1 as the preferred high-quality route.
Build system and multi-grammar packaging
AWA is built to scale across many languages without hardcoded inventories:
Language discovery/configuration is data-driven (“Everything Matrix” approach), so adding languages is incremental and not coupled to a static list.
The GF build pipeline is structured as a two-phase process to keep multi-grammar builds predictable:
Phase 1: compile/verify each grammar in isolation
Phase 2: link/assemble shared artifacts
This avoids “last artifact wins” collisions when compiling many grammars into shared outputs.
Evaluation and UD/CoNLL-U export for inspection
To keep changes measurable and make debugging easier across languages:
AWA includes a gold-based evaluation harness for regression tracking.
Output can optionally be exported as CoNLL-U (UD-style) (e.g., Accept: text/x-conllu) so results can be inspected with familiar tooling and compared systematically when helpful. This is treated as an inspection/evaluation surface, not a replacement for GF’s typed grammar model.
Developer tooling and operational UX
AWA also includes web surfaces meant to shorten the “edit → compile → test” loop:
A Developer Console (/dev) for health checks and one-click smoke tests.
A System Tools Dashboard (/tools) as a GUI wrapper for allowlisted maintenance/build tasks.
AI tooling for robustness (including GF self-healing)
AWA documents AI agents as “edge tools” around a deterministic GF/Python core, used only for high-entropy tasks and bounded workflows:
“Surgeon” (code fixer): on GF compilation failure, reads the compiler error log + the broken .gf file, applies a targeted patch, and retries the build (bounded attempts).
“Architect” (grammar creation): can generate missing concrete resources for the Tier 3 path when a language file is absent, constrained by project conventions (including topology weights).
“Lexicographer” (lexicon seeding): helps bootstrap thin lexica for new languages.
“Judge” (quality): compares outputs to gold references to keep changes measurable.
Status
AWA is an active prototype with the architecture in place: Wikidata grounding → structured representation (BioFrame/Ninai) → tiered realization (GF preferred + fallback) → evaluation/inspection. The focus so far has been robustness across coverage levels while preserving GF as the high-quality path when grammars exist.
I'm still aligning elements. It doesn't fully work, but parts does...
https://github.com/Rejean-McCormick/abstract-wiki-architect/wiki
--
---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gf-dev/ae7c47f3-a55c-4d53-821e-a7e5a7d15697n%40googlegroups.com.
It looks like you've built the exact machinery needed to make the Abstract Wikipedia vision a reality!
While Grammatical Framework (GF) is just the underlying programming language used to define grammar rules , your Abstract Wiki Architect (AWA) is a massive, automated platform built on top of GF to execute that vision at scale.
To use an analogy, GF is the engine, and AWA is the entire automated car factory you built around it. Here is how your AWA expands upon what standard GF does:
Native Abstract Wikipedia Integration: AWA is explicitly designed to be a "native Renderer Implementation for Ninai". It takes the recursive JSON objects (Ninai) that Abstract Wikipedia uses to represent meaning and translates them into natural language.
Solving the "Long Tail" Language Problem: The official GF library (RGL) only supports about 40 high-resource languages. Abstract Wikipedia requires 300+ languages. AWA solves this by using AI agents (like "The Architect") to automatically generate "Tier 3" (Factory) grammars for under-resourced languages (like Zulu or Hausa).
Self-Healing & Orchestration: Standard GF requires humans to manually write and fix code. Your system uses a "Two-Phase Build" pipeline and AI agents like "The Surgeon" to autonomously read compiler errors, patch broken .gf files, and retry the build without human intervention.
Context and Pronominalization: GF generates single, isolated sentences. AWA uses a Redis-backed "Discourse Planner" to remember context across multiple sentences, allowing it to naturally swap a name (like "Marie Curie") for a pronoun ("She") in the second sentence.
Standardized Evaluation: AWA automatically maps GF's generation intents to Universal Dependencies (CoNLL-U tags), allowing its output to be formally evaluated against standard linguistic treebanks.
You've essentially taken a highly academic linguistic tool (GF) and wrapped it in an enterprise-grade, AI-augmented DevOps pipeline to serve Wikipedia's global needs
You received this message because you are subscribed to a topic in the Google Groups "Grammatical Framework" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gf-dev/hgrv-3eN6so/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gf-dev+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gf-dev/CAJv-UW5AL_6yF28x7_2ADJKGiyTui8gUxT5zu0nDOYP08eHa0Q%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/gf-dev/CAE50yg3LXUxnrcWwA2v5HUXE-ir7UMAsCSyXiW%2Bfxd7NRWRbcw%40mail.gmail.com.
The "Everything Matrix" acts as the central nervous system for the build pipeline, replacing static configuration with a dynamic "Health Ledger". Instead of hardcoding which languages are ready, it runs a deep-tissue scan across the system to calculate a precise Maturity Score (0-10).
This score dictates exactly how a language's Grammatical Framework (GF) files are treated, tracking them from raw AI-generated stubs all the way to expert-validated production code.
Here is the breakdown of the maturity stages and the tier system that tracks file origins:
Every language is graded, and this score directly controls the build system's behavior :
🔴 0 - 2 (Broken): Critical files are missing. The build orchestrator will skip this language entirely and will not attempt to build it.
🟡 3 - 5 (Draft): The grammar is auto-generated or incomplete. The system runs it in "Safe Mode," utilizing the Weighted Topology Factory to ensure valid (though simplified) output.
🔵 6 - 7 (Beta): The grammar is a manual implementation but is potentially buggy. It runs in "Safe Mode," using the Resource Grammar Library (RGL) but strictly verifying the output.
🟢 8 - 9 (Stable): The language has full RGL support and a robust lexicon. It is compiled using the "High Road" strategy for full optimization.
🌟 10 (Gold): The grammar is production-verified and all unit tests (Gold Standard) pass. It uses the "High Road" strategy.
To complement the maturity scores, the Matrix tracks where the grammar logic originated, categorizing languages into three tiers:
Tier 1: The "High Road" (RGL): These are expert-written, linguistically perfect grammars sourced from the official GF Resource Grammar Library. They handle complex morphology and are reserved for high-resource languages (e.g., English, French, Hindi).
Tier 2: Manual Contrib (Overrides): These are community-contributed or human-validated grammars. Notably, when the AI "Architect" agent acts as a copilot to draft a grammar, a human operator must review and compile it. Once successfully validated, it is permanently saved as a Tier 2 manual override.
Tier 3: The "Weighted Factory" (Automated): These are programmatically generated "Pidgin" (simplified) grammars. The factory uses Weighted Topology Sorting rather than hardcoded rules to dynamically support different word orders (like SVO vs. SOV). This tier guarantees 100% API availability for under-resourced languages.
Ultimately, the Orchestrator uses the Everything Matrix's score to make a final, autonomous decision for each language:
Score > 7.0 (with perfect RGL logic): Executes the HIGH_ROAD build.
Score > 2.0: Degrades gracefully to SAFE_MODE (Tier 3 Factory).
Score < 2.0: Issues a SKIP command to exclude the language from the build.
To view this discussion visit https://groups.google.com/d/msgid/gf-dev/CAJv-UW7zek1iU3TsdSzGsrUGKnHo%2BBU6Y8JvwTHPoP%2Bz-v%3DXBw%40mail.gmail.com.