GitLab continuous integration/deployment (CD/CI) and sphinx-doc's change detection

46 views
Skip to first unread message

Denis Bitouzé

unread,
Jul 24, 2023, 5:19:40 PM7/24/23
to sphinx-users
Hi,

I’m building a (French LaTeX) FAQ as a static website with Sphinx-doc. In order to let other people easily contribute, I'm trying to regenerate the HTML pages corresponding to the changed source files thanks to:

- CD/CI
- GitLab pages

on a gilab.com instance.

That works well, except that even if only a single source file is changed, the HTML pages of all the source files are regenerated and, since they are more than 1200 source files, that takes too much time (more than 15 minutes).

In order to have a look at this issue, one can consider this minimal Sphinx-doc content:
  • with mainly default Sphinx settings,
  • with a minimal conf.py config file,
  • but with many (100) test files in order to highlight the issue I'm facing.
I first asked for help on the dedicated Discourse GitLab CI/CD forum (https://forum.gitlab.com/t/is-it-possible-to-have-a-kind-of-persistent-docker-image-for-gitlab-ci/89542), and people there are pretty sure the sphinx-doc's cache should be restored and don't understand what's going on.

I then got help outside this forum from a guy (but who has no clue about sphinx-doc). Here is what he said:

[Y]ou are shooting yourself in your foot because you use make. make is utterly ineffective in CI pipelines because at the beginning of each pipeline the repo is cloned afresh, meaning the file modification dates of the source files are usually newer than the cached output files even if nothing did change. Because make only considers file modification dates/timestamps, make is definitely not helpful for the first invocation. The second invocation obviously behaves correctly because the first make invocation rebuilds all outputs.

So you can reduce this minimal example by removing make and simply calling sphinx-build. And the pipelines for this new repository also show that a cache is pulled at the beginning and uploaded at the end of the pipeline. So GitLab’s caching is working. It’s now a question of what sphinx-build needs to determine whether to rebuild. If it works like make, you are out of luck because of file timestamps. If it works with another mechanism, you should look up where that is stored and check that all information needed for the rebuild are cached (maybe more is needed than just the doctrees directory).

I told him that:

  • I used to try with sphinx-build instead of make but I didn’t work either,
  • I’m afraid sphinx-build doesn’t work with another mechanism and, regarding the cache, only the doctrees directory is involved.

He answers then:

So you need to use one of the solutions modifying the file timestamps to match the git commits to have an effective solution here. There are a variety of tools out there that do this. Just search the internet for git checkouts with mtime preserved.


 I had the hope that `git-restore-mtime` would be the solution: https://forum.gitlab.com/t/deploy-only-changes-with-lftp/76180 but hope dashed:


The guy told me then: “Unfortunately, I have not been able to find any documentation on how sphinx's change detection works.“

Any help would be much appreciated.

Thanks.

Wols Lists

unread,
Jul 26, 2023, 8:27:04 PM7/26/23
to sphinx...@googlegroups.com
On 24/07/2023 22:19, Denis Bitouzé wrote:
> [Y]ou are shooting yourself in your foot because you use make. make is
> utterly ineffective in CI pipelines because at the beginning of each
> pipeline the repo is cloned afresh, meaning the file modification dates
> of the source files are usually newer than the cached output files even
> if nothing did change.

How does it clone the repo? If you can clone using "cp -a", that MIGHT
work, as it's supposed to copy everything.

The other approach to try is "cp -lR", because it copies the directory
structure, but links to and does not change any files. That also might
work, and actually will be a lot faster than cp -a.

That's assuming you can control how the repo is copied, of course. I
suspect they often do a "git clone" which will lose all the target files
on the spot ...

Cheers,
Wol

Denis Bitouzé

unread,
Jul 28, 2023, 8:17:01 AM7/28/23
to sphinx-users
Le 24/07/23 à 14h19, Denis Bitouzé a écrit :

> I’m building a (French LaTeX) FAQ
> <https://dbitouze.gitlab.io/test-faq-fr/index.html> as a static website
> with Sphinx-doc <https://www.sphinx-doc.org/>. In order to let other people
> easily contribute, I'm trying to regenerate the HTML pages corresponding to
> the changed source files thanks to:
>
> - CD/CI
> - GitLab pages
>
> on a gilab.com instance.
>
> That works well, except that even if only a single source file is changed,
> the HTML pages of all the source files are regenerated and, since they are
> more than 1200 source files, that takes too much time (more than 15
> minutes).

BTW, I followed (with some needed adaptations) the step by step
“Tutorial: Build your first project”:

┌────
https://www.sphinx-doc.org/en/master/tutorial/index.html
└────

including “Appendix: Deploying a Sphinx project online”, following the
“GitHub Pages” route in order to see whether my problem is
GitLab-specific or not:

┌────
https://github.com/dbitouze/lumache
└────

As you can see here:

┌────
https://github.com/dbitouze/lumache/commit/c4bcbd9c5fc239edb603e80a31b15e22ff574768
└────

only the `index.rst` file was changed but the build performed by the CI/CD:

┌────
https://github.com/dbitouze/lumache/actions/runs/5691725103/job/15427441621#step:4:25
└────

writes all the other source files:

- `api`
- `generated/lumache`
- `index`
- `usage`

(The deploy steps fails but that's not the point.)

Hence my problem is not a GitLab-specific one and seems to affect all
CI/CD ways of deploying a Sphinx project online :(
--
Denis

Denis Bitouzé

unread,
Jul 28, 2023, 8:37:31 AM7/28/23
to sphinx-users
Le 27/07/23 à 01h26, Wols Lists a écrit :


> On 24/07/2023 22:19, Denis Bitouzé wrote:
>> [Y]ou are shooting yourself in your foot because you use make. make is utterly
>> ineffective in CI pipelines because at the beginning of each pipeline the repo
>> is cloned afresh, meaning the file modification dates of the source files are
>> usually newer than the cached output files even if nothing did change.
>
> How does it clone the repo? If you can clone using "cp -a", that MIGHT work, as
> it's supposed to copy everything.

I've no idea how GitLab recovers the repo: the corresponding line is (I
guess) “Getting source from Git repository”. See e.g.:

  ┌────
  │ https://gitlab.com/gutenberg1/minimal-sphinx-minimal/-/jobs/4747563014#L14
  └────


> The other approach to try is "cp -lR", because it copies the directory
> structure, but links to and does not change any files. That also might
> work, and actually will be a lot faster than cp -a.
>
> That's assuming you can control how the repo is copied, of course. I suspect
> they often do a "git clone" which will lose all the target files on the spot ...

I'm afraid that's the case.

Many thanks for your answer!

I hope I could find a way to fix this issue!

Cheers,
--
Denis

Tuncay Güzel

unread,
Jul 28, 2023, 8:52:27 AM7/28/23
to sphinx...@googlegroups.com
İm at work now 3 hours later, thanx 

Wols Lists

unread,
Jul 29, 2023, 7:28:35 AM7/29/23
to Denis Bitouzé, sphinx...@googlegroups.com
On 28/07/2023 13:03, Denis Bitouzé wrote:
>> That's assuming you can control how the repo is copied, of course. I suspect
>> they often do a "git clone" which will lose all the target files on the spot ...
> I'm afraid that's the case.
>
> Many thanks for your answer!
>
> I hope I could find a way to fix this issue!

The more I think about it, the more it seems the only sensible option is
git clone.

Isn't the idea of CI/CD that you start afresh for every update? In which
case you want a bare repository, which means you have no target files,
which means you need a full rebuild as your first step.

Sorry.

Cheers,
Wol
Reply all
Reply to author
Forward
0 new messages