Re: File Converter - Architecture guidance for DOCX/XLSX support

58 views
Skip to first unread message

Jayant Saxena

unread,
Apr 18, 2026, 2:58:11 AMApr 18
to qubes-devel

Hi @marmarek @ben-grande @marmarta,


I've been working on the file converter project for GSoC 2026 and have several PRs in progress across the ecosystem:


PR #38 (pdf-converter): Added password-protected PDF support with zenity GUI prompts

PR #39 (pdf-converter): Batch conversion for multiple files

PR #463 (core-admin-client): Propagating device assign options

PR #448 (qubes-manager): Template manager crash fixes

I've studied the codebase and the lessons from PR #9, and I want to ask for guidance on the architecture before implementing broader file format support.


Key Questions:


Format Scope: Should the initial GSoC work focus on Office documents (DOCX, ODT, XLSX, PPTX) and exclude audio/video? I understand FFmpeg's attack surface is very large.


qrexec Protocol Design: For handling multiple formats with different options (passwords, sheet selection, resolution), what's the preferred approach?


Extend the current protocol with a format header: --format=docx --password=X\n[data]

Create a new service: qubes.FileConvert

Keep format-specific services separate?

PR #9 was criticized for using "raw sockets"—what approach would you prefer?


Output Standardization: Should all formats (DOCX, XLSX, PPTX) convert to PDF via the existing bitmap pipeline, or preserve format?


File Manager Integration: Should the Nautilus extension use magic bytes for format detection (not extension), and return non-zero for unsupported formats?


What should I read? Beyond the original Qubes PDF converter blog post, are there specific security papers or design docs on file conversion risks I should study?


I want to get the architecture right before investing time in implementation. Your guidance would be really helpful.

Demi Marie Obenour

unread,
Apr 18, 2026, 3:06:35 AMApr 18
to Jayant Saxena, qubes-devel
On 4/18/26 02:55, Jayant Saxena wrote:
>
>
> Hi @marmarek @ben-grande @marmarta,
>
> I've been working on the file converter project for GSoC 2026 and have
> several PRs in progress across the ecosystem:
>
> - PR #38 (pdf-converter): Added password-protected PDF support with
> zenity GUI prompts
> - PR #39 (pdf-converter): Batch conversion for multiple files
> - PR #463 (core-admin-client): Propagating device assign options
> - PR #448 (qubes-manager): Template manager crash fixes
>
> I've studied the codebase and the lessons from PR #9, and I want to ask for
> guidance on the architecture before implementing broader file format
> support.
>
> *Key Questions:*
>
> 1.
>
> *Format Scope*: Should the initial GSoC work focus on Office documents
> (DOCX, ODT, XLSX, PPTX) and exclude audio/video? I understand FFmpeg's
> attack surface is very large.
> 2.
>
> *qrexec Protocol Design*: For handling multiple formats with different
> options (passwords, sheet selection, resolution), what's the preferred
> approach?
> - Extend the current protocol with a format header: --format=docx
> --password=X\n[data]
> - Create a new service: qubes.FileConvert
> - Keep format-specific services separate?
>
> PR #9 was criticized for using "raw sockets"—what approach would you
> prefer?
> 3.
>
> *Output Standardization*: Should all formats (DOCX, XLSX, PPTX) convert
> to PDF via the existing bitmap pipeline, or preserve format?

I think this goes along with another question: is the converter
meant to preserve the ability to edit the files? To the best of my
knowledge, this ability is what makes OpenDocument and OOXML better
than PDF in certain cases.

In my opinion, there's no point in preserving the format if the
output is not going to be something that would make sense to edit.
PDF is a better choice for an archival-only format. Marek, of course,
gets to make the decision, though.

(I'm not associated with the Qubes project anymore, just a potential
user of the feature.)

> 4.
>
> *File Manager Integration*: Should the Nautilus extension use magic
> bytes for format detection (not extension), and return non-zero for
> unsupported formats?
> 5.
>
> *What should I read?* Beyond the original Qubes PDF converter blog post,
> are there specific security papers or design docs on file conversion risks
> I should study?
>
> I want to get the architecture right before investing time in
> implementation. Your guidance would be really helpful.
>
> Thanks!
>


--
Sincerely,
Demi Marie Obenour (she/her/hers)
OpenPGP_0xB288B55FFF9C22C1.asc
OpenPGP_signature.asc

Andrew Clausen

unread,
Apr 18, 2026, 3:11:24 AMApr 18
to Demi Marie Obenour, Jayant Saxena, qubes-devel
Hi all,

On Sat, 18 Apr 2026 at 08:06, Demi Marie Obenour <demio...@gmail.com> wrote:
I think this goes along with another question: is the converter
meant to preserve the ability to edit the files?  To the best of my
knowledge, this ability is what makes OpenDocument and OOXML better
than PDF in certain cases.

In my opinion, there's no point in preserving the format if the
output is not going to be something that would make sense to edit.
PDF is a better choice for an archival-only format.  Marek, of course,
gets to make the decision, though.

(I'm not associated with the Qubes project anymore, just a potential
user of the feature.)

I second this.  I also add: PDF has serious usability and accessibility problems.  For example, it is quite difficult to read on small screens, because text can't be reflowed into a new shape.

I would think a good default would be to give the same file format back again?

Best wishes,
Andrew

Jayant Saxena

unread,
Apr 18, 2026, 3:24:23 AMApr 18
to qubes-devel

Hi Demi, Andrew,

Thanks for the input. Andrew sir, the "same format back" approach makes sense — a sanitized DOCX is more useful than a PDF if the user still needs to edit it.

I'll wait for Marek's thoughts before settling on the architecture. i can go either direction.

unman

unread,
Apr 18, 2026, 8:36:48 AMApr 18
to Jayant Saxena, qubes-devel
I already have something like this, but it's focussed on my needs. I dont
see a great advantage in giving back the same format, partly because I'm
not clear what a sanitised doc/docx would be.
I favor extracting the text and serving it in txt or rtf formats, but
that's a personal preference.

Andrew Clausen

unread,
Apr 18, 2026, 8:57:24 AMApr 18
to unman, Jayant Saxena, qubes-devel
Hi Unman,

An empty Word document produced in a clean dispVM is safe, not because Word documents are safe, but because the input was clean.

Conceptually, if you assume RTF is unable to express anything hostile, then converting docx -> rtf -> docx ought to do perfect sanitization.  Converting back to docx at the end is harmless in most use cases?

So I think the challenge amounts to: what intermediate format is appropriate for sanitization?

I would think that something like a minimalist version of Pandoc-markdown would be appropriate, but it is not an easy question to answer.  For example, how should hyperlinks and images be handled?  (Possible answers: external hyperlinks should not be clickable, and images should be sanitized by converting to a bitmap.)

Best wishes,
Andrew

unman

unread,
Apr 18, 2026, 9:19:39 AMApr 18
to Andrew Clausen, Jayant Saxena, qubes-devel
Hi Andrew

I did say that this this was a personal preference, although it isnt
clear to me WHY you would want to give back the same format if you have
sanitised all macros, images and the like. What's the final conversion
to docx adding, except making the output less readable?

Best

unman


--
I never presume to speak for the Qubes team.
When I comment in the mailing lists I speak for myself.

Andrew Clausen

unread,
Apr 18, 2026, 9:41:15 AMApr 18
to unman, Andrew Clausen, Jayant Saxena, qubes-devel
Hi Unman,

On Sat, 18 Apr 2026 at 14:19, unman <un...@thirdeyesecurity.org> wrote:
I did say that this this was a personal preference,

Yes, you raised a good question: what does it mean for a docx document to be sanitized?  I attempted to answer it.
 
although it isnt
clear to me WHY you would want to give back the same format if you have
sanitised all macros, images and the like. What's the final conversion
to docx adding, except making the output less readable?

I would propose returning a DOCX document that is just as readable as before.  It would still have images, but they would be sanitized images. And so on.  You don't need macros, tracked changes, hyperlinks, or 99% of the features of Word has to get a perfectly readable document.

I think there are a few useful goals:

1. DOCX readers have vulnerabilities on edge cases.  By restricting to a small subset of DOCX, you reduce the attack surface.

2. Removing malicious hyperlinks, macros, and other embedded objects (via OLE).

3. Removing hidden data, e.g. ensuring that redaction happens correctly.   ("What you get is what you see." not just "what you see is what you get")

Best wishes,
Andrew

qubist

unread,
Apr 18, 2026, 1:22:53 PMApr 18
to qubes...@googlegroups.com

Jayant Saxena

unread,
Apr 24, 2026, 2:13:26 PMApr 24
to qubes-devel
Thanks for sharing that link. The ODT vs DOCX compatibility discussion is exactly the kind of context that's useful for thinking about the right intermediate format for the sanitizer.

On Saturday, 18 April 2026 at 22:52:53 UTC+5:30 qubist wrote:
https://redlib.perennialte.ch/r/libreoffice/comments/ydguan/is_odt_better_than_docx/

Marek Marczykowski-Górecki

unread,
Apr 29, 2026, 9:05:45 AMApr 29
to Jayant Saxena, qubes-devel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, Apr 17, 2026 at 11:55:19PM -0700, Jayant Saxena wrote:
>
>
> Hi @marmarek @ben-grande @marmarta,
>
> I've been working on the file converter project for GSoC 2026 and have
> several PRs in progress across the ecosystem:
>
> - PR #38 (pdf-converter): Added password-protected PDF support with
> zenity GUI prompts
> - PR #39 (pdf-converter): Batch conversion for multiple files
> - PR #463 (core-admin-client): Propagating device assign options
> - PR #448 (qubes-manager): Template manager crash fixes
>
> I've studied the codebase and the lessons from PR #9, and I want to ask for
> guidance on the architecture before implementing broader file format
> support.
>
> *Key Questions:*
>
> 1.
>
> *Format Scope*: Should the initial GSoC work focus on Office documents
> (DOCX, ODT, XLSX, PPTX) and exclude audio/video? I understand FFmpeg's
> attack surface is very large.

IMO Office formats is the hardest of those parts, better move it to the
end and focus on simpler parts first.

> 2.
>
> *qrexec Protocol Design*: For handling multiple formats with different
> options (passwords, sheet selection, resolution), what's the preferred
> approach?
> - Extend the current protocol with a format header: --format=docx
> --password=X\n[data]
> - Create a new service: qubes.FileConvert
> - Keep format-specific services separate?

I think service should be related to the output format. If the output is
going to be a PDF document (transferred as a raw image), use the current
service regardless of the input format. If the output is going to be
video, use another one and so on. Source detection can be done in the
service itself, to minimize required parsing on the client side.

>
> PR #9 was criticized for using "raw sockets"—what approach would you
> prefer?

That part was related to multiple files, not alternative formats, no?
Anyway, I see there is use of `uno` python module - maybe it provides
more elegant interface? If not, I guess sockets can be used to control
LibreOffice programmatically...

> 3.
>
> *Output Standardization*: Should all formats (DOCX, XLSX, PPTX) convert
> to PDF via the existing bitmap pipeline, or preserve format?

This is very import question, as it highly influence how the file is
converted. In practice, I don't think it's realistic to keep the file
both (safely) editable and accurate especially in terms of formatting.
There are a lot of files that use custom styles, fonts, and sometimes
even scripts (for example in spreadsheets) - sanitizing them with
sufficient confidence is a lot of work, and even then I think some
information will be lost in some cases.
Theoretically, there could be a mode that produce safe editable output file
at the cost of lost formatting but in practice it may not be that
useful. And in Qubes OS, user always can open a file in a disposable
qube, having it both accurate and sandboxed.

So, I think it's okay to focus mainly on static output formats (like PDF
for documents), and just extend what source files can be used and maybe
make it more convenient (see below).

> 4.
>
> *File Manager Integration*: Should the Nautilus extension use magic
> bytes for format detection (not extension), and return non-zero for
> unsupported formats?

See above about filetype detection. I guess the simplest option would be
to allow conversion for all, but fail on unsupported formats.
I'd rather avoid parsing too much of the untrusted file on the client
side. Maybe filtering on file extension would be enough in practice?

> 5.
>
> *What should I read?* Beyond the original Qubes PDF converter blog post,
> are there specific security papers or design docs on file conversion risks
> I should study?
>
> I want to get the architecture right before investing time in
> implementation. Your guidance would be really helpful.

There is also Dangerzone project that implemented (or is in process)
some of the above. I would propose to:
1. Investigate Dangerzone project. For example, I was told the server
part is significantly extended compared to qubes, and should be fully
compatible (at least in theory) with qubes client part.

2. Then go to video formats

3. Bring in OCR support to text file formats (Dangerzone already has it,
but may require changes to use on qubes) - it isn't fully editable
output, but closes the critical limitation of the current PDF output -
copying text from the file.

4. And only at this point consider editable output formats.


- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEhrpukzGPukRmQqkK24/THMrX1ywFAmnyAiMACgkQ24/THMrX
1yyWjggAkTob4pLVlIFz00WxKGROpNBWvQOxBhqFSeOPxbXUWFacnKWTlaeRUawn
p8SDyeufeUhBuUv4yg1Lw/m/LDZK2//KfzeKaIDHTcjJVFtRlwLYUdiu+V28cf89
VXYanaU7LmcphtQtsWb4ojqOjiFrNfIcGya7hwN5bfOFjlpa3IAvfiDHnUlhWgnP
BpFkynY5SikpKIl/WEjlqMHGEMonVuSITVj6SYIfUXnqWU/oR6f3/9M8K2r6TiMQ
At6C+J1wlXCv/9AmZvjhB9UhxYiS8Hjq37rQ4w3YYDg0KeDzAqZZ57/6Yx2wwKGm
t0PpLuMSHHK1y8xL/1Xtz4nGGQrP/g==
=jG6q
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages