Re: File Converter - Architecture guidance for DOCX/XLSX support

29 views
Skip to first unread message

Jayant Saxena

unread,
Apr 18, 2026, 2:58:11 AM (7 days ago) Apr 18
to qubes-devel

Hi @marmarek @ben-grande @marmarta,


I've been working on the file converter project for GSoC 2026 and have several PRs in progress across the ecosystem:


PR #38 (pdf-converter): Added password-protected PDF support with zenity GUI prompts

PR #39 (pdf-converter): Batch conversion for multiple files

PR #463 (core-admin-client): Propagating device assign options

PR #448 (qubes-manager): Template manager crash fixes

I've studied the codebase and the lessons from PR #9, and I want to ask for guidance on the architecture before implementing broader file format support.


Key Questions:


Format Scope: Should the initial GSoC work focus on Office documents (DOCX, ODT, XLSX, PPTX) and exclude audio/video? I understand FFmpeg's attack surface is very large.


qrexec Protocol Design: For handling multiple formats with different options (passwords, sheet selection, resolution), what's the preferred approach?


Extend the current protocol with a format header: --format=docx --password=X\n[data]

Create a new service: qubes.FileConvert

Keep format-specific services separate?

PR #9 was criticized for using "raw sockets"—what approach would you prefer?


Output Standardization: Should all formats (DOCX, XLSX, PPTX) convert to PDF via the existing bitmap pipeline, or preserve format?


File Manager Integration: Should the Nautilus extension use magic bytes for format detection (not extension), and return non-zero for unsupported formats?


What should I read? Beyond the original Qubes PDF converter blog post, are there specific security papers or design docs on file conversion risks I should study?


I want to get the architecture right before investing time in implementation. Your guidance would be really helpful.

Demi Marie Obenour

unread,
Apr 18, 2026, 3:06:35 AM (7 days ago) Apr 18
to Jayant Saxena, qubes-devel
On 4/18/26 02:55, Jayant Saxena wrote:
>
>
> Hi @marmarek @ben-grande @marmarta,
>
> I've been working on the file converter project for GSoC 2026 and have
> several PRs in progress across the ecosystem:
>
> - PR #38 (pdf-converter): Added password-protected PDF support with
> zenity GUI prompts
> - PR #39 (pdf-converter): Batch conversion for multiple files
> - PR #463 (core-admin-client): Propagating device assign options
> - PR #448 (qubes-manager): Template manager crash fixes
>
> I've studied the codebase and the lessons from PR #9, and I want to ask for
> guidance on the architecture before implementing broader file format
> support.
>
> *Key Questions:*
>
> 1.
>
> *Format Scope*: Should the initial GSoC work focus on Office documents
> (DOCX, ODT, XLSX, PPTX) and exclude audio/video? I understand FFmpeg's
> attack surface is very large.
> 2.
>
> *qrexec Protocol Design*: For handling multiple formats with different
> options (passwords, sheet selection, resolution), what's the preferred
> approach?
> - Extend the current protocol with a format header: --format=docx
> --password=X\n[data]
> - Create a new service: qubes.FileConvert
> - Keep format-specific services separate?
>
> PR #9 was criticized for using "raw sockets"—what approach would you
> prefer?
> 3.
>
> *Output Standardization*: Should all formats (DOCX, XLSX, PPTX) convert
> to PDF via the existing bitmap pipeline, or preserve format?

I think this goes along with another question: is the converter
meant to preserve the ability to edit the files? To the best of my
knowledge, this ability is what makes OpenDocument and OOXML better
than PDF in certain cases.

In my opinion, there's no point in preserving the format if the
output is not going to be something that would make sense to edit.
PDF is a better choice for an archival-only format. Marek, of course,
gets to make the decision, though.

(I'm not associated with the Qubes project anymore, just a potential
user of the feature.)

> 4.
>
> *File Manager Integration*: Should the Nautilus extension use magic
> bytes for format detection (not extension), and return non-zero for
> unsupported formats?
> 5.
>
> *What should I read?* Beyond the original Qubes PDF converter blog post,
> are there specific security papers or design docs on file conversion risks
> I should study?
>
> I want to get the architecture right before investing time in
> implementation. Your guidance would be really helpful.
>
> Thanks!
>


--
Sincerely,
Demi Marie Obenour (she/her/hers)
OpenPGP_0xB288B55FFF9C22C1.asc
OpenPGP_signature.asc

Andrew Clausen

unread,
Apr 18, 2026, 3:11:24 AM (7 days ago) Apr 18
to Demi Marie Obenour, Jayant Saxena, qubes-devel
Hi all,

On Sat, 18 Apr 2026 at 08:06, Demi Marie Obenour <demio...@gmail.com> wrote:
I think this goes along with another question: is the converter
meant to preserve the ability to edit the files?  To the best of my
knowledge, this ability is what makes OpenDocument and OOXML better
than PDF in certain cases.

In my opinion, there's no point in preserving the format if the
output is not going to be something that would make sense to edit.
PDF is a better choice for an archival-only format.  Marek, of course,
gets to make the decision, though.

(I'm not associated with the Qubes project anymore, just a potential
user of the feature.)

I second this.  I also add: PDF has serious usability and accessibility problems.  For example, it is quite difficult to read on small screens, because text can't be reflowed into a new shape.

I would think a good default would be to give the same file format back again?

Best wishes,
Andrew

Jayant Saxena

unread,
Apr 18, 2026, 3:24:23 AM (7 days ago) Apr 18
to qubes-devel

Hi Demi, Andrew,

Thanks for the input. Andrew sir, the "same format back" approach makes sense — a sanitized DOCX is more useful than a PDF if the user still needs to edit it.

I'll wait for Marek's thoughts before settling on the architecture. i can go either direction.

unman

unread,
Apr 18, 2026, 8:36:48 AM (7 days ago) Apr 18
to Jayant Saxena, qubes-devel
I already have something like this, but it's focussed on my needs. I dont
see a great advantage in giving back the same format, partly because I'm
not clear what a sanitised doc/docx would be.
I favor extracting the text and serving it in txt or rtf formats, but
that's a personal preference.

Andrew Clausen

unread,
Apr 18, 2026, 8:57:24 AM (7 days ago) Apr 18
to unman, Jayant Saxena, qubes-devel
Hi Unman,

An empty Word document produced in a clean dispVM is safe, not because Word documents are safe, but because the input was clean.

Conceptually, if you assume RTF is unable to express anything hostile, then converting docx -> rtf -> docx ought to do perfect sanitization.  Converting back to docx at the end is harmless in most use cases?

So I think the challenge amounts to: what intermediate format is appropriate for sanitization?

I would think that something like a minimalist version of Pandoc-markdown would be appropriate, but it is not an easy question to answer.  For example, how should hyperlinks and images be handled?  (Possible answers: external hyperlinks should not be clickable, and images should be sanitized by converting to a bitmap.)

Best wishes,
Andrew

unman

unread,
Apr 18, 2026, 9:19:39 AM (7 days ago) Apr 18
to Andrew Clausen, Jayant Saxena, qubes-devel
Hi Andrew

I did say that this this was a personal preference, although it isnt
clear to me WHY you would want to give back the same format if you have
sanitised all macros, images and the like. What's the final conversion
to docx adding, except making the output less readable?

Best

unman


--
I never presume to speak for the Qubes team.
When I comment in the mailing lists I speak for myself.

Andrew Clausen

unread,
Apr 18, 2026, 9:41:15 AM (7 days ago) Apr 18
to unman, Andrew Clausen, Jayant Saxena, qubes-devel
Hi Unman,

On Sat, 18 Apr 2026 at 14:19, unman <un...@thirdeyesecurity.org> wrote:
I did say that this this was a personal preference,

Yes, you raised a good question: what does it mean for a docx document to be sanitized?  I attempted to answer it.
 
although it isnt
clear to me WHY you would want to give back the same format if you have
sanitised all macros, images and the like. What's the final conversion
to docx adding, except making the output less readable?

I would propose returning a DOCX document that is just as readable as before.  It would still have images, but they would be sanitized images. And so on.  You don't need macros, tracked changes, hyperlinks, or 99% of the features of Word has to get a perfectly readable document.

I think there are a few useful goals:

1. DOCX readers have vulnerabilities on edge cases.  By restricting to a small subset of DOCX, you reduce the attack surface.

2. Removing malicious hyperlinks, macros, and other embedded objects (via OLE).

3. Removing hidden data, e.g. ensuring that redaction happens correctly.   ("What you get is what you see." not just "what you see is what you get")

Best wishes,
Andrew

qubist

unread,
Apr 18, 2026, 1:22:53 PM (6 days ago) Apr 18
to qubes...@googlegroups.com

Jayant Saxena

unread,
Apr 24, 2026, 2:13:26 PM (11 hours ago) Apr 24
to qubes-devel
Thanks for sharing that link. The ODT vs DOCX compatibility discussion is exactly the kind of context that's useful for thinking about the right intermediate format for the sanitizer.

On Saturday, 18 April 2026 at 22:52:53 UTC+5:30 qubist wrote:
https://redlib.perennialte.ch/r/libreoffice/comments/ydguan/is_odt_better_than_docx/

Reply all
Reply to author
Forward
0 new messages