Overveiw of current scoutess architecture and philosophy

327 views
Skip to first unread message

Jeremy Shaw

unread,
Nov 3, 2012, 4:47:30 PM11/3/12
to scou...@googlegroups.com
Hello,

There are a lot of questions about why scoutess is the way it is, and if that is even the way we want it. In the post I will try to outline some of the ideas behind scoutess.

The Purpose
===========

The primary goal of scoutess is to allow project maintainers to automate many of tedious parts of maintaining a project. The name 'scoutess' was chosen to reflect that idea that a lot of the work done by scoutess is collecting information and creating reports about  the findings. Scoutess includes buildbot type functionality, but is not limited to just buildbot stuff. Additionally, unlike general purpose systems like Jenkins, scoutess is specialized for Cabal and Haskell development. Here are some of the many things we would like scoutess to be able to do:

 - watch source repositories and rebuild packages on new commits
 - watch for build-dependency on changes (on hackage or commits to source repos) and test that your packages still build
 - automatically notify your upper bounds are out of date (similar to what packdeps).
 - automatically build cross-linked haddock documentation and upload it to your project site
 - integration with an ircbot that reports important information
 - ability to test builds against multiple versions of the haskell platform and multiple OSes
 - automatically check that the *lower* bounds in your .cabal file are still valid
 - scan communities like reddit, irc, stackoverflow for relevant discussions that you may want to take part in
 - check for reverse dependencies that are broken as a result of changes to your package
 - check that if you upload a package to hackage that it is actually going to build (not missing other dependencies)
 - notify you if you bumped the .cabal version in source, but forgot to upload a new version
 - notify you if you uploaded a new version to hackage but forgot to push your patch to the repo
 - commit notification on irc

Key Realizations
================

modularity
----------

When looking at the tasks, it is clear that there are a lot of reusable parts. If you want to build the haddock documentation for the latest dev sources, you have to start by checking out the sources. If you want to check that the latest dev sources build, you have to start by doing a checkout.

Now, lets say you want to create a new module that runs hlint on the source and reports coding 'violations'. Once again, you need to start by getting the source. 

We would like to design scoutess so that is very modular and not at all monolithic. We would like to provide a bunch of 'block boxes' that you can wire together in different ways to get customized functionality. If you want to add new functionality, it should be easy to create a new self contained module that does that. 

Implicit external state sucks, make it explicit
-----------------------------------------------

One beautiful aspect of purely functional programming is referential transparency. It is much easier to reason about functions when all the information it needs is explicitly passed in via the arguments. Not having to worry about global variables, and getting a different answer for the same inputs makes things much easier to understand.

This is especially useful when using libraries that you don't really understand fully.

The problem with something like scoutess is that it involves loads of external, implicit state. For example, whether a darcs repository is up to date or not.

The solution in scoutess is that we always make any implicit dependencies explicit in the type signature. For example, if a function 'foo' requires that a darcs repository be up to date, it should take as an argument a value that can only be produced by first running the function that updates the darcs repository.

The idea is simply, when someone wants to call 'foo', they probably have no idea what needs to be done before 'foo' can be run. And they aren't really going to read the docs to find out either. However, since they can't call 'foo' with out all its required arguments, they will have to look for the function that generates the arguments they need to call 'foo'. And, in that way, they will be assured that they always call the prerequisite functions in the correct order.

Arrows? Really?
---------------

In the DataFlow module, we use the Arrow syntax. Yet if you look at the type, you will see that it just uses Kleisli to turn a monad into an Arrow. So what is the point?

The point is really just to make the idea of 'black boxes' being wired together even more explicit. you basically have something like:

  outputs <- blackbox arg1 arg2 <- inputs

This clearly shows the output from the blackbox, shows what inputs come from other black boxes, and shows what arguments are just normal arguments. We only use this at the top-level for building configurations.

Haskell as config language
--------------------------

Like, xmonad and other projects, we do not have a separate configuration language. Instead your configuration is written in a plain old Haskell file. This is because scoutess configs will often want to apply Haskell functions to generate values.

build configurations
--------------------

If you look at Scoutess.DataFlow

http://hub.darcs.net/alp/scoutess/browse/Scoutess/DataFlow.hs

You will see that it contains a function 'standard'. This is supposed to represent a fairly standard 'build' process. But it is not intended to be the only build process. It is expected that advanced users will create their own functions similar to this which create the type of workflow they are looking to have happen for different runs.

Architecture
============

We are currently thinking a lot about the design of scoutess to ensure that it can actually do all the things we want it to do. We don't want to spend a lot of time written code, and then find out it can't actually do what we need.

build scenarios
---------------

One thing we are working on is creating a list of all the weird and/or useful things that someone might want to do. Then for each thing, we try to identify all the steps that would be needed to do that.

You can see that list here:


Contributing new tricky cases would be incredibly useful.

prototype with types
---------------------

As said earlier, we want to make all the state explicit in the type signature. Once we have our list of build cases, and the steps we think are required, we can start turning those into code. But, our goal is to largely avoid writing function bodies at first. We want to start by writing down the names of the functions we will need and their types. Basically, create the haddock docs first, and then implement the functions second. The reason behind this is that in trying to write down the types for the functions, you realize what information you are going to need that you didn't think about it. If we can get the types correct first, then writing the function bodies will be a lot easier, and we won't have to rewrite as much code.

'standard'
----------

Now we will walk through the 'standard' function and try to figure out what it does, why it does that, and if it is actually going to work.

Here is 'standard' for reference:

standard :: Scoutess SourceSpec SourceSpec    -- ^ sourceFilter
         -> Scoutess VersionSpec VersionSpec  -- ^ versionFilter
         -> Scoutess (SourceSpec, TargetSpec, Maybe PriorRun) BuildReport
standard sourceFilter versionFilter = proc (sourceSpec, targetSpec, mPriorRun) -> do
    initialiseSandbox                                   -< targetSpec
    (availableVersions, targetVersion)
                       <- fetchVersionSpec sourceFilter -< (targetSpec, sourceSpec)
    consideredVersions <- versionFilter                 -< availableVersions
    buildSpec          <- produceBuildSpec              -< (targetSpec, targetVersion, consideredVersions, mPriorRun)
    buildSources       <- getSources                    -< buildSpec
    index              <- updateLocalHackage            -< (buildSpec, buildSources)
    build                                               -< (buildSpec, targetSource buildSources, index)

First we will look at the type signature:

standard :: Scoutess SourceSpec SourceSpec    -- ^ sourceFilter
         -> Scoutess VersionSpec VersionSpec  -- ^ versionFilter
         -> Scoutess (SourceSpec, TargetSpec, Maybe PriorRun) BuildReport

SourceSpec is a type that tells use where we can find Cabal packages. It is a Set of SourceLocation where a SourceLocation can be a RCS like darcs, or a cabal repository such as hackage.

standard takes two filters, one that filters sources and one that filters versions. We will see what those do in a second. The result of standard is a arrow.

TargetSpec specifies what we actually want to build. For example, SourceSpec might just contain hackage. And TargetSpec would be the specific packages that we want to build. PriorRun provides the information about what happened in previous runs. This is how we can determine if a package needs to be rebuilt or not. Note that when we specify what targets we want to build, there are varying levels of precision available. We might just provide a package name, or a package name and version, or a package name and source location. That is why we need a resolution step -- if we just specify a package name, then we need to look at all the locations and find out what the most recent version available is.

The result is a BuildReport that summarizes the findings.

    initialiseSandbox                                   -< targetSpec

The first line initialises the sandbox for the build. It needs the target spec because the targetspec specifies where the temp directories are, where the locahackage server is, and other information like that. The next line is:
    (availableVersions, targetVersion)
                       <- fetchVersionSpec sourceFilter -< (targetSpec, sourceSpec)

fetchVersionSpec looks at the source locations and hackage repositories and finds out what is available to us.
next we have:

consideredVersions <- versionFilter                 -< availableVersions

This allows use to eliminate certain available versions from consideration. The idea behind the versionFilter is to allow you do something like, "check that this package builds even if only the most recent version of every package is available on hackage". 

The next line is:

buildSpec          <- produceBuildSpec              -< (targetSpec, targetVersion, consideredVersions, mPriorRun)

This is where we actually calculate what to build. At this point we have a list of targets to build, and a narrowed down list of places to get those targets and their build-dependencies from. We also pass in an optional PriorRun. If a package has not changed, and non of its dependencies have changed, then we can skip rebuilding it. This is essential when you have  hundreds of targets. 

Next we have:
    buildSources       <- getSources                    -< buildSpec
We now know exactly what packages are going to be required for a build. So, next we need to get the cabal packages for them. getSources is probably not the best name for this function. I think getCabalPackages is probably more accurate. I believe that for source repositories, etc, it runs cabal sdist to produce the .tar.gz files. For Hackage, it downloads the package (or gets it from a local cache). Next we have:

    index              <- updateLocalHackage            -< (buildSpec, buildSources)

Now that we have all the cabal packages available, we need to make them visible to cabal-install. We do this by installing them all in the localhackage repo. We can share the archives directory across builds. But each build starts with a fresh empty package repo and we only add the packages listed in the buildSources.

finally we have:

build                                               -< (buildSpec, targetSource buildSources, index)

we have prepared the sandbox, and create a localhackage server that contains all the packages that we need. So now we can build the targets we requested.

LocalHackage?
--------------

One question that has been raised is why do we copy packages from hackage into LocalHackage. cabal allows you to specify a union of repositories -- so we could have only the local sources in localhackage and get the remote sources from hackage. However, that does require some extra trickery. In the build stage we are currently just doing 'cabal install foo' to install the packages. Because LocalHackage only contains the packages we want to consider, we do not need any version constraints. If we included hackage + localhackage, then we would need to add a bunch of --constraints flags during the build stage so that cabal does not try to use packages from hackage that we eliminated in previous steps. 

So, we are currently using a whitelist approach instead of a blacklist approach. There are a few advantages to the whitelist approach. One is that we have a LocalHackage repository that represents exactly what was available at the time of the build. In the case of build failures, this makes it easier for developers to diagnose the problems. They can simply point their cabal at that localhackage and run 'cabal install foo'. They do not need to also add a bunch of constraints flags.

Also, in the case of a buildfarm, the LocalHackage server could be on the same subnet as the buildbots. So, they would be able to get the source over a local connection instead of having to go to hackage if they don't already have the package cache locally.

It is not clear what the advantage of using remote hackage + constraints are. The packages have to be downloaded from hackage either way and it seems more complicated. The time required to generate the 00-index.tar should not be significantly affected by having to insert all the packages instead of just the local packages. 

Conclusion
==========

The current design is still very much open for discussion. If you think it is wrong, please let us know! One big problem has been that lack of information explaining the current design. I hope this message will serve as a starting point for understanding the design, goals, and philosophy. We are also very interested in poking holes in the design before we write too much code. Doing a health amount of upfront design and thought can save us a lot of pain in the long wrong. Adding new tricky cases would be a big help! Especially ones that the current design fails on.

Obviously, writing code is more fun than writing design docs. But there will be plenty of code to write soon enough.

- jeremy
Reply all
Reply to author
Forward
0 new messages