Project idea: Scottish CivTech challenge

32 views
Skip to first unread message

Michal Měchura

unread,
Aug 11, 2025, 10:39:43 AMAug 11
to gf-...@googlegroups.com

Hello all in the GF community,

I have an idea for an interesting thing to do with GF, hear me out.

The Scottish Government runs a programme called CivTech. This programme functions like an start-up incubator for people who want to solve important social problems through tech. Every year the organisers announce a list of “challenges” which people can “pitch” for.

This year one of the challenges is data sparsity in Scottish Gaelic. The motivation for this challenge is the worry that Gaelic and other minority languages are being left out of modern AI-driven language technologies due to a shortage of texts in these languages for training language models. The Scottish Government is looking for ideas to fill this gap.

My idea is that we build something with GF to generate a large-ish amount of synthetic texts. As you may know, synthetic data is a thing in machine learning, as a trick to fill (quantitative and qualitative) gaps in organic training data. We will do the same, not for “data” but for “texts”, NLG-style. These are the things we could do:

  • We piggyback on Abstract Wikipedia to generate lots of Wikipedia-style texts in Scottish Gaelic.

  • We do something I call pattern expansion: we start with FrameNet-style valency patterns such as ”[somebody] loves [somebody]” or ”[somebody] takes [transport] to [destination]” and we expand them into fully-formed sentences such as ”she will never stop loving her husband” and ”which train are we taking to Glasgow tomorrow?”. This way we end up with a large bank of sentences, either monolingual (Gaelic only) or bilingual (Gaelic plus English translations) or as a treebank (Gaelic plus UD dependency trees).

Of course these “robotic” texts will have all the downsides of synthetic data, they will lack the unbridled variety of organic texts. But they will have upsides too: they will be guaranteed to be grammatically correct (because GF), we can supply them with 100% reliable gold-standard tagging, and so on. In combination with organic texts, this could be high-protein fodder for machine learning.

Would anybody be interested in working with me on preparing a proposal like this for CivTech? The way CivTech works is:

  1. We write up our idea and, if we are selected, we are given three weeks and 5,000 GBP to explore the idea further – that’s the “Exploration Stage”.

  2. Then we resubmit a more fully worked-out proposal and, if they like it, we are given fifteen weeks and 35,000 GBP to actually develop a thing that works – that’s the “Accelerator Stage”.

  3. Finally, if our thing looks commercially viable, there’s an optional “Pre-Commercialisation” stage where the Scottish Government will invest up to 600,000 GBP to develop us as a start-up. For us specifically, this could mean developing the tech for other languages and peddling synthetic texts as a commodity to anyone who wants to buy them.

I’ve read up on the terms and conditions and it seems like a good deal to me. We don’t have to invest any of our own money, we get paid for our time, we are allowed to fail without consequences, and if we succeed we own the business 100%, all the Scottish Government wants in exchange is a perpetual licence for themselves.

The deadline for submissions to the “Exploration Stage” is 2 September – that’s in three weeks, right after the Summer School. I’m happy to drive everything and do most of the writing, and we can polish the proposal up together during the Summer School (I’m attending the second week only).

So… any thoughts, anyone?

Michal

Hamza Shezad

unread,
Aug 12, 2025, 3:44:30 AMAug 12
to gf-...@googlegroups.com
Hi Michal,

Such initiatives are especially valuable for low-resource languages like Gaelic. This idea resonates with me, as I had considered using GF for another low-resource language (Urdu) during my master’s thesis. However, I ultimately decided against it due to the availability of curated dataset of real-world Urdu sentences, which made the approach less necessary for my project. Moreover, as you mention, synthetic data lacks diversity.

I’d perhaps have time but only for the initial stages so will not be committing anything. Out of curiosity, what sort of experience level with GF and perhaps more importantly Gaelic would you reckon be good for joining this project?

Hamza


On 11 Aug 2025, at 19:39, Michal Měchura <vals...@gmail.com> wrote:


--

---
You received this message because you are subscribed to the Google Groups "Grammatical Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gf-dev+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gf-dev/CACWB%2Brmpu6vAjpDGexC3hrgvxZn9aND2YCt112OdAzB9zV%3D7tw%40mail.gmail.com.

Michal Měchura

unread,
Aug 12, 2025, 7:25:26 AMAug 12
to gf-...@googlegroups.com

Out of curiosity, what sort of experience level with GF and perhaps more importantly Gaelic would you reckon be good for joining this project?

I guess we will need to write a resource grammar for Gaelic during the project, or some mini/midi approximation. And we will be writing application grammars on top of that resource grammar. And we will be using the PGF Python binding, or similar, to do stuff with the application grammars. So, those are the levels of GF expertise we would need. Also, someone who has a foot in the Abstract Wikipedia camp would be a good addition to the team.

For Gaelic language skills, I can “fake” a lot of that myself because I speak Irish, a language that’s similar enough. But we will need to recruit an actual Gaelic speaker later; we can figure out during the “Exploration Stage” who that should be. I’m reasonably well connected in these circles, so finding someone good probably won’t be a problem.

M.

Reply all
Reply to author
Forward
0 new messages