Hi everyone,
I'm writing to share that I've made some Python projects public for others to reuse the code more easily (without needing to contact me for files or permission). They are focused on the pre-analysis steps of data retrieval and prep, using two different text sources as examples:
Aozora Bunko --
https://github.com/mollydesjardin/aozora/NINJAL Taiyo corpus (CD-ROM) --
https://github.com/mollydesjardin/taiyo-corpus-tools/(BTW: Neither project includes actual source data as part of the repository)
I've overhauled the Aozora code more recently to modernize it a bit while trying to keep things simple. The Taiyo version is not as updated, but gets the job done. Both include a detailed guide and resource list, with the aim of being self-contained resources for people who are new-ish to working with Japanese text data (whether it's the programming or the Japanese that's less familiar).
If you have feedback on improving the documentation or parts of the code that aren't clear, I'd love to hear from you!
I'm also actively developing an AWS cloud version of the Aozora corpus builder, available in a simple form that just does individual file conversion* for the moment:
https://github.com/mollydesjardin/aozora-lambda
The end result will be an interconnected set of components that automate tracking, downloading, and converting newly-added HTML files from Aozora Bunko. Again, if you have feedback or thoughts please do get in touch. I'll continue to add updates, more resources, and better documentation on Github.
(*Another BTW: "Just file conversion" ended up not being trivial and easy to migrate, as I wrongly assumed, because of the large dictionary I use with MeCab. But this process made me grasp what we are doing just to put spaces between words: ML inference using a pretrained model. Feel free to tell that to your non-Japanese-using colleagues!)
Best wishes and hope what I'm sharing can be useful!
Molly Des Jardin