Hi @marmarek @ben-grande @marmarta,
I've been working on the file converter project for GSoC 2026 and have several PRs in progress across the ecosystem:
PR #38 (pdf-converter): Added password-protected PDF support with zenity GUI prompts
PR #39 (pdf-converter): Batch conversion for multiple files
PR #463 (core-admin-client): Propagating device assign options
PR #448 (qubes-manager): Template manager crash fixes
I've studied the codebase and the lessons from PR #9, and I want to ask for guidance on the architecture before implementing broader file format support.
Key Questions:
Format Scope: Should the initial GSoC work focus on Office documents (DOCX, ODT, XLSX, PPTX) and exclude audio/video? I understand FFmpeg's attack surface is very large.
qrexec Protocol Design: For handling multiple formats with different options (passwords, sheet selection, resolution), what's the preferred approach?
Extend the current protocol with a format header: --format=docx --password=X\n[data]
Create a new service: qubes.FileConvert
Keep format-specific services separate?
PR #9 was criticized for using "raw sockets"—what approach would you prefer?
Output Standardization: Should all formats (DOCX, XLSX, PPTX) convert to PDF via the existing bitmap pipeline, or preserve format?
File Manager Integration: Should the Nautilus extension use magic bytes for format detection (not extension), and return non-zero for unsupported formats?
What should I read? Beyond the original Qubes PDF converter blog post, are there specific security papers or design docs on file conversion risks I should study?
I want to get the architecture right before investing time in implementation. Your guidance would be really helpful.
I think this goes along with another question: is the converter
meant to preserve the ability to edit the files? To the best of my
knowledge, this ability is what makes OpenDocument and OOXML better
than PDF in certain cases.
In my opinion, there's no point in preserving the format if the
output is not going to be something that would make sense to edit.
PDF is a better choice for an archival-only format. Marek, of course,
gets to make the decision, though.
(I'm not associated with the Qubes project anymore, just a potential
user of the feature.)
Hi Demi, Andrew,
Thanks for the input. Andrew sir, the "same format back" approach makes sense — a sanitized DOCX is more useful than a PDF if the user still needs to edit it.
I'll wait for Marek's thoughts before settling on the architecture. i can go either direction.
I did say that this this was a personal preference,
although it isnt
clear to me WHY you would want to give back the same format if you have
sanitised all macros, images and the like. What's the final conversion
to docx adding, except making the output less readable?
https://redlib.perennialte.ch/r/libreoffice/comments/ydguan/is_odt_better_than_docx/