I have a conflict this afternoon and likely will miss the meeting.
I wanted to provide a little update on my experiments in generating a
rough draft of the annual report and RAG-augmented search in general.
After talking with Herb last week, I went back and revised my email
extraction code to pull out more of the message metadata (Subject, Date,
Message ID, To, CC, etc.). I also spent some time improving the handling
of multipart MIME encoded messages. This last part is still a work in
progress. I need to think a little more about the handling of email
bodies. Simple emails don't present a problem, however longer ones are
creating problems when importing the JSON file I create into Google
Vertex's RAG engine. The processor seems to break down on long lines,
and there are 10 emails with over 10,000 characters in the body. Most
of that is html such as what Fathom generates in its meeting summaries.
These emails are problematic.
Stripping out all the html is the simplest solution. However, there is
context in how information is formatted that may be valuable. And, I
suspect, the solution to this problem will carry over into how I handle
MIME encoded attachments. Another area I haven't touched yet.
I'll keep plugging away on this over the next week. Ideally, once I get
this work further along, we will be able to perform queries on the email
archive from 2025 (and beyond) and get useful information back.
--
Bill Stumbo
wst...@charter.net