weathercan: https://github.com/ropensci/weathercan/pull/197 and https://github.com/ropensci/weathercan/pull/193
stopwords‑iso: https://github.com/stopwords-iso/stopwords-iso/pull/29
stopwords‑bn: https://github.com/stopwords-iso/stopwords-bn/pull/2
I am very interested in applying to GSoC 2026 with the R project and would like to propose BornoR: Bangla text cleaning, tokenization, and lexical resources for R. There are several Python toolkits for Bangla NLP, but there is no cohesive, well‑documented R package that provides standard Bangla text preprocessing, tokenization, and lexicons. This makes it difficult for applied researchers in Bangladesh and South Asia, many of whom work primarily in R, to analyze Bangla text in a reproducible way.
My proposed scope for BornoR during GSoC is:
1. Text cleaning and stopwords
Unicode‑aware normalization for Bangla digits, punctuation, and common spacing issues.
Curated Bangla stopword lists (e.g., standard, formal, social‑media oriented variants).
Simple R helpers for stopword removal and frequency/keyword exploration.
2. Tokenization and integration with R text workflows
Functions that return:
quanteda::tokens and dfm objects for Bangla text, and
tidytext‑compatible tibbles for token‑level analysis.
3. Bangla text resources packaged for R
Packaged Bangla lexical resources such as:
stopword lists,
a small sentiment or polarity lexicon,
a small profanity or “toxic words” list, and
sample benchmark corpora for examples and testing.
4. Package infrastructure and documentation
Unit tests, CI, and basic benchmarks on real Bangla corpora.
2–3 vignettes demonstrating end‑to‑end workflows (cleaning + tokenization + analysis).
Preparing the package for CRAN submission by the end of the project.
I am currently drafting the package design and plan to publish a minimal bornoR prototype on GitHub in the next 1–2 weeks (starting with normalization, stopwords, and a basic tokenizer) so potential mentors can review and comment on early implementation.
I am looking for two mentors (primary and secondary) who would be interested in supervising this project under the R organization for GSoC 2026, or advising on how to refine the scope or level of ambition to better match R’s priorities and GSoC expectations. I am happy to adjust the deliverables to fit either a 175‑hour or a 350‑hour project.
Thank you very much for considering this idea. I would greatly appreciate any feedback or suggestions.
Best regards,
Mahjabin Oyshi
GitHub: https://github.com/mahjabinoyshi