Python code and tutorials for Japanese text available online

29 views
Skip to first unread message

Molly Des Jardin

unread,
Aug 13, 2025, 2:56:14 PMAug 13
to Digital Humanities Japan
Hi everyone,

I'm writing to share that I've made some Python projects public for others to reuse the code more easily (without needing to contact me for files or permission). They are focused on the pre-analysis steps of data retrieval and prep, using two different text sources as examples:

Aozora Bunko -- https://github.com/mollydesjardin/aozora/
NINJAL Taiyo corpus (CD-ROM) -- https://github.com/mollydesjardin/taiyo-corpus-tools/
(BTW: Neither project includes actual source data as part of the repository)

I've overhauled the Aozora code more recently to modernize it a bit while trying to keep things simple. The Taiyo version is not as updated, but gets the job done. Both include a detailed guide and resource list, with the aim of being self-contained resources for people who are new-ish to working with Japanese text data (whether it's the programming or the Japanese that's less familiar).

If you have feedback on improving the documentation or parts of the code that aren't clear, I'd love to hear from you!

I'm also actively developing an AWS cloud version of the Aozora corpus builder, available in a simple form that just does individual file conversion* for the moment: https://github.com/mollydesjardin/aozora-lambda

The end result will be an interconnected set of components that automate tracking, downloading, and converting newly-added HTML files from Aozora Bunko.  Again, if you have feedback or thoughts please do get in touch. I'll continue to add updates, more resources, and better documentation on Github.

(*Another BTW: "Just file conversion" ended up not being trivial and easy to migrate, as I wrongly assumed, because of the large dictionary I use with MeCab. But this process made me grasp what we are doing just to put spaces between words: ML inference using a pretrained model. Feel free to tell that to your non-Japanese-using colleagues!)


Best wishes and hope what I'm sharing can be useful!

Molly Des Jardin

Molly Des Jardin

unread,
Aug 15, 2025, 10:09:25 AMAug 15
to dhj...@googlegroups.com
Quick addition: I'm not sure if my email address is fully displayed by the message I just sent out, after all my encouraging people with feedback to get in touch! Just in case, you can directly contact me at mollyde...@gmail.com.

Molly

--
You received this message because you are subscribed to the Google Groups "Digital Humanities Japan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dhjapan+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dhjapan/a531e077-523d-4885-83ac-14d1bb39b627n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages