Looking for mentors – BornoR: Bangla text cleaning, tokenization, and resources for R (GSoC 2026 idea)

144 views

Skip to first unread message

Mahjabin Oyshi

unread,

Feb 26, 2026, 10:14:45 AMFeb 26

to gsoc-r

Dear all,

My name is Mahjabin Siddika Oyshi, and I am a statistics undergraduate from Bangladesh. Over the last few months, I have been contributing to several R packages with merged pull requests focused on examples, documentation, tests, and small feature fixes, including:

tidytext: https://github.com/juliasilge/tidytext/pull/252
forcats: https://github.com/tidyverse/forcats/pull/388
srvyr: https://github.com/gergness/srvyr/pull/197
weathercan: https://github.com/ropensci/weathercan/pull/197 and https://github.com/ropensci/weathercan/pull/193
stopwords‑iso: https://github.com/stopwords-iso/stopwords-iso/pull/29
stopwords‑bn: https://github.com/stopwords-iso/stopwords-bn/pull/2

I am very interested in applying to GSoC 2026 with the R project and would like to propose BornoR: Bangla text cleaning, tokenization, and lexical resources for R. There are several Python toolkits for Bangla NLP, but there is no cohesive, well‑documented R package that provides standard Bangla text preprocessing, tokenization, and lexicons. This makes it difficult for applied researchers in Bangladesh and South Asia, many of whom work primarily in R, to analyze Bangla text in a reproducible way.

My proposed scope for BornoR during GSoC is:

1. Text cleaning and stopwords

Unicode‑aware normalization for Bangla digits, punctuation, and common spacing issues.
Curated Bangla stopword lists (e.g., standard, formal, social‑media oriented variants).
Simple R helpers for stopword removal and frequency/keyword exploration.

2. Tokenization and integration with R text workflows

Sentence and word tokenization tailored to Bangla script.
Functions that return:
- quanteda::tokens and dfm objects for Bangla text, and
- tidytext‑compatible tibbles for token‑level analysis.

3. Bangla text resources packaged for R

Packaged Bangla lexical resources such as:
- stopword lists,
- a small sentiment or polarity lexicon,
- a small profanity or “toxic words” list, and
- sample benchmark corpora for examples and testing.

4. Package infrastructure and documentation

Unit tests, CI, and basic benchmarks on real Bangla corpora.
2–3 vignettes demonstrating end‑to‑end workflows (cleaning + tokenization + analysis).
Preparing the package for CRAN submission by the end of the project.

I am currently drafting the package design and plan to publish a minimal bornoR prototype on GitHub in the next 1–2 weeks (starting with normalization, stopwords, and a basic tokenizer) so potential mentors can review and comment on early implementation.

I am looking for two mentors (primary and secondary) who would be interested in supervising this project under the R organization for GSoC 2026, or advising on how to refine the scope or level of ambition to better match R’s priorities and GSoC expectations. I am happy to adjust the deliverables to fit either a 175‑hour or a 350‑hour project.

Thank you very much for considering this idea. I would greatly appreciate any feedback or suggestions.

Best regards,
Mahjabin Oyshi
GitHub: https://github.com/mahjabinoyshi

Linkedin:https://www.linkedin.com/in/mahjabinoyshi

Reply all

Reply to author

Forward

0 new messages