Travis, Jenkins, testing, coverage and technical debt

Don MacMillen

unread,

Dec 30, 2014, 3:31:38 PM12/30/14

to juli...@googlegroups.com

First, I will make my apologies. I am not saying below that anyone

needs to work smarter or harder, for I don't think I have ever come

across a community that is as smart or hardworking as the Julia

community, but perhaps in a slightly different way.

I have followed with interest the recent exchanges about Julia and

testing. (https://groups.google.com/forum/#!topic/julia-users/GyH8nhExY9I

and https://github.com/JuliaLang/julia/issues/9493).

The thing that struck me as odd were the directions from Tim on how to

look at the coverage. I thought that shouldn't you just be able to

point me to the CI url with your coverage summaries? Shouldn't we all

have eyes on the same set of results from the same setup and run? It

has been my experience through the years that if you rely on

developers to run coverage you will be (mostly) disappointed. You can

have it a priority, there will be a flurry of activity, but then it

will gradually fall by the wayside. It absolutely needs to be part of

the CI system.

I am not familiar with Travis, but have used Jenkins quite a bit and

assumed that Travis to be similar to that, so I took my first look at

the Travis results for Julia. The only thing I see is a consecutive

list of builds and clicking into one of them brings up the build log

file. Is there more to the story that I am missing? Usually the log

file is the last thing I want to be rummaging around in to find out

what's wrong.

There is a huge difference when _everything_ is available as a

dynamically updated url: an overall status dashboard, performance

tracking history, current coverage of all the source code (where you

can click into the files and see line by line coverage), artifact

promotion, etc. Again, we can all look at the same thing at the same

time. I am not saying that Jenkins is the only solution. Perhaps all

of this can be done with Travis, but the effort level is probably the

same.

Searching back through the dev mailing list, I see that the opinion on

Jenkins is that "it seems to be enterprisey bloatware". I believe you

have missed the mark on that one. In my last company with ~12

developers, one buildmeister and one QA, we could not have moved the

code base (C++ and Python) from a University project to a commercial

endeavor without Jenkins, and if you ask _anyone_ on the team, their

opinion would be the same. Yes, it is work to get what you want but

you do not have to get it all at once and the work is highly leveraged.

On the issue of technical debt, are there unit tests for the C code

and the lisp front end code? It wasn't obvious to me where they might

live. Do you have coverage numbers for that code base? If not, what

are the plans here? Yes, coverage is a flawed metric for code

quality. But until the coverage numbers are in a non-embarrassing

range (approaching 80%) then it is senseless to have a debate about

it. Coverage is best had by unit tests written concurrently with the

code. I am not just saying that but have lived it.

So what is to be done? First I applaud Tim Holy's call to action in

https://github.com/JuliaLang/julia/issues/9493 (Although you might

want to consider a little coordination there in order to avoid a

smash bros. melee.) Second, invest more in Travis (if feasible) or

move and invest in Jenkins (there is a "FOSS Free" program for hosted

Jenkins at Cloudbees https://www.cloudbees.com/resources/foss).

Third, encourage (or enforce) the habit of no code gets checked in

without coverage.

I would just like to end by saying that I am absolutely amazed at the

accomplishments of this group and congratulate you all for that. I

hope to be using Julia for many years to come and look forward to the

time when I can do that in production code.

Don

Stefan Karpinski

unread,

Dec 31, 2014, 11:36:00 AM12/31/14

to Julia Dev

If we used Jenkins, where would it run and who would maintain those servers? Travis is a hosted CI service, which means that we don't need to manage or maintain any servers. In my experience, unless it's someone's full time job is to keep the servers up and running, things stop working after a few of months and stay that way indefinitely.

Stefan Karpinski

unread,

Dec 31, 2014, 11:37:19 AM12/31/14

to Julia Dev

+100 – let's just put this straight into a document. The question is just where it belongs.

Keno Fischer

unread,

Dec 31, 2014, 11:53:24 AM12/31/14

to juli...@googlegroups.com

I am not familiar with Travis, but have used Jenkins quite a bit and
assumed that Travis to be similar to that, so I took my first look at
the Travis results for Julia. The only thing I see is a consecutive
list of builds and clicking into one of them brings up the build log
file. Is there more to the story that I am missing? Usually the log
file is the last thing I want to be rummaging around in to find out
what's wrong.

Travis is very easy to set up, free for our use case and integrates nicely with github. As a CI system "all" it does is run the build and the tests, which isn't a lot, but enough to make sure that changes build across the platforms we support and don't break any tests.

There is a huge difference when _everything_ is available as a
dynamically updated url: an overall status dashboard, performance
tracking history, current coverage of all the source code (where you
can click into the files and see line by line coverage), artifact
promotion, etc. Again, we can all look at the same thing at the same
time. I am not saying that Jenkins is the only solution. Perhaps all
of this can be done with Travis, but the effort level is probably the
same.

Julia packages have been using coveralls.io to aggregate the coverage info generated during Travis runs and display it (see e.g. https://coveralls.io/r/dcjones/Gadfly.jl). Regarding performance tracking there used to be http://speed.julialang.org/, but we weren't too satisfied with the software running it and I think we should just write a simple server (maybe even in julia) that takes the performance results and displays them. Of course Travis isn't ideal for performance comparisons, but we have some resources at MIT we have used and can use.

Searching back through the dev mailing list, I see that the opinion on
Jenkins is that "it seems to be enterprisey bloatware". I believe you
have missed the mark on that one. In my last company with ~12
developers, one buildmeister and one QA, we could not have moved the
code base (C++ and Python) from a University project to a commercial
endeavor without Jenkins, and if you ask _anyone_ on the team, their
opinion would be the same. Yes, it is work to get what you want but
you do not have to get it all at once and the work is highly leveraged.

I agree that Jenkins is a decent CI system and I actually spent quite a bit of time a while back trying to set it up for Julia (this was before Travis). The only problem with Jenkins is that it's github integration is pretty bad (or at least was when I was working on it) and that it's quite a nontrivial task to keep it running well. I'm not sure it's worth it over the Travis/Coveralls solution.

On the issue of technical debt, are there unit tests for the C code
and the lisp front end code? It wasn't obvious to me where they might
live. Do you have coverage numbers for that code base? If not, what
are the plans here? Yes, coverage is a flawed metric for code
quality. But until the coverage numbers are in a non-embarrassing
range (approaching 80%) then it is senseless to have a debate about
it. Coverage is best had by unit tests written concurrently with the
code. I am not just saying that but have lived it.

I don't think we need dedicated unit tests for the C and lisp code separate from the regular tests. When something is wrong in that code, it's usually a miscompilation or a crash that can be easily tested with regular Julia code.

So what is to be done? First I applaud Tim Holy's call to action in
https://github.com/JuliaLang/julia/issues/9493 (Although you might
want to consider a little coordination there in order to avoid a
smash bros. melee.) Second, invest more in Travis (if feasible) or
move and invest in Jenkins (there is a "FOSS Free" program for hosted
Jenkins at Cloudbees https://www.cloudbees.com/resources/foss).
Third, encourage (or enforce) the habit of no code gets checked in
without coverage.

I think it would be a good project for somebody to get back on the performance tracking thing, which is the one thing I think we're sorely missing. If somebody wants to do this, please coordinate with Elliot (staticfloat), who's already running a bunch of buildbots in addition to the CI system.

Stefan Karpinski

unread,

Dec 31, 2014, 12:06:35 PM12/31/14

to Julia Dev

On Wed, Dec 31, 2014 at 11:36 AM, Stefan Karpinski <ste...@karpinski.org> wrote:

+100 – let's just put this straight into a document. The question is just where it belongs.

Ah, this is weird – not sure how this comment ended up on this thread. Sorry.

Stefan Karpinski

unread,

Dec 31, 2014, 12:19:53 PM12/31/14

to Julia Dev

I missed the bit about the Jenkins-based CI service at the bottom. That would be a viable option. But I'm not sure that reporting is our biggest issue when it comes to testing and coverage. I'm all for pushing towards 100% coverage of Base and then insisting that changes maintain that, but for that to work, we need a better way of measuring coverage. In particular, using lines doesn't work well – it seems to me that counting n-grams of basic blocks is much better, which gives different coverage rates for each n, the most basic being percentage of basic blocks reached, then next being the percentage of pairs of basic blocks traversed. For n > 1, of course, you need to make sure that the denominator only consists of feasible n-grams of basic blocks, which is a somewhat non-trivial analysis. Keno, do you think we could measure this (at least for n = 1) by instrumenting our LLVM code to record counts?

Keno Fischer

unread,

Dec 31, 2014, 12:31:42 PM12/31/14

to juli...@googlegroups.com

> Keno, do you think we could measure this (at least for n = 1) by instrumenting our LLVM code to record counts?

If I remember correctly, Clang's instrumentation is done in clang as opposed to being an instrumentation pass. I believe there's an asan based coverage pass, but I haven't looked into it. In either case, it doesn't seem to hard to do by hand.

Tim Holy

unread,

Dec 31, 2014, 5:51:08 PM12/31/14

to juli...@googlegroups.com

If we want to rely on metrics, presumably we still have to solve the "how to
accurately measure coverage?" problem, and to do that we need to distinguish
between useful, compilable lines and junk. Right now I think that's a much
bigger problem than worrying about internal branches in functions.

If the strategy outlined here:
https://github.com/JuliaLang/julia/issues/7541
seems reasonable, I'd be happy to collaborate with someone who knows some
Scheme and can handle step 1 (which, practically speaking, I can't).

--Tim

Stefan Karpinski

unread,

Dec 31, 2014, 6:05:27 PM12/31/14

to Julia Dev

The thing is that lines don't matter at all for coverage. When testing coverage, you want to know if all the possible paths through your code were taken. A branch-free 25-line function contains fewer possible code paths than a single line that includes a ternary operator or two. If you use line counts then both will look equally covered by a test that executes a single path through each but the one line that includes branches will actually not be well covered at all – and having a branch on a single line of code is not uncommon in Julia. A single test that covers the 25-line branch-free function also counts disproportionately much when you're doing line coverage, but really not that much is being tested. Basic blocks are the right level of abstraction for code coverage since they are units of straight-line code.

Don MacMillen

unread,

Dec 31, 2014, 11:44:12 PM12/31/14

to juli...@googlegroups.com

Just a few more thoughts and responses on this topic before I go back

to my part time lurker identity.

The big win in using a tool like Jenkins is in automating the entire

development and deployment pipeline with total transparency. So yes,

this is much more than git hooks, miscellaneous scripts, Makefiles,

and cron jobs. There is a qualitative difference in development when

this information, (builds, regressions, coverage, performance,

valgrind, artifact creation and promotion) is built automatically and

available to all. But, again, you do not have to have the all singing,

all dancing version from day one.

As you start to flesh out more testing you will find that there are

tests that no longer belong with the build. Most of these should be

moved to the nightlies with results available in a dashboard. Not

only last night's results, but results going back a specified time

period. The smoke tests, or the level 0, or whatever nomenclature you

like, stays with the CI build.

Yes, code coverage can be a very distorted thing. It almost doesn't

matter to me exactly how code coverage is calculated, at least at

first. But automate it with the nightlies. Make it visible. Start

with something. Make it better. It is the process automation that

counts here.

I have to respectfully disagree (strongly, as it turns out)

with @keno on

"I don't think we need dedicated unit tests for the C and

lisp code separate from the regular tests."

I could spin an argument around "first line of defense", "developer

intent" (especially in an arcane code base), "localized errors", etc,

but there is plenty of literature out there on test driven development

that does that already. I am pretty sure I am not going to convince

anyone in a post to a group discussion, so I will just have to agree

to disagree.

Finally, there are multiple comments about the resources necessary to

do these things. Stefan thinks that it is a full time _job_ to keep

servers up. From my experience, that is not correct. It is however,

a full time _responsibility_ for someone to keep them up. You will not

burn a full person year keeping servers up for a year. So hosted

services are great when you cannot amortize the support costs of doing

things yourself on EC2 over different tasks.

It has also been said by @keno that keeping Jenkins up is a non trival

task. I don't think that is really correct either. As you start to

automate the pipeline, you will see many opportunities to make

developers lives easier and hence there is a long period of features

being added, searching for the right Jenkins plugins, writing your own

for specialized tasks, etc. But all of that has incremental and

immediate pay offs. The problem then becomes the resource usage of

Jenkins. But we kept it on m3.large class machine for a long time

before we had to bump it up. If you just wanted to replicate what you

have today on Travis, that's a day's worth of work for an expert and I

doubt you would need to touch it much after that. But that would be

missing the point totally.

Fundamentally, it is the difference between continuous integration and

continuous delivery. There must be other folks on this list with a

perspective on this. Comments from those who have used Jenkins in a

continuous delivery shop?

Keno Fischer

unread,

Jan 1, 2015, 3:17:24 AM1/1/15

to juli...@googlegroups.com

I have to respectfully disagree (strongly, as it turns out)
with @keno on
"I don't think we need dedicated unit tests for the C and
lisp code separate from the regular tests."
I could spin an argument around "first line of defense", "developer
intent" (especially in an arcane code base), "localized errors", etc,
but there is plenty of literature out there on test driven development
that does that already. I am pretty sure I am not going to convince
anyone in a post to a group discussion, so I will just have to agree
to disagree.

I don't disagree in principle, but I think it's less necessary in programming languages

than in a lot of other projects, precisely because in order to be useful it needs

to be exposed in the language. The other project, with which I am familiar

is LLVM whose entire test suite is in LLVM bitcode that drives LLVM from

the command line and this is the preferred form of testing in that project.

Admittedly there are a couple C++ unit tests, but those are for APIs that are

not exposed by the tools, which we don't rally have have in julia (an exception to

this might be a test that tests the embedding APIs).

On that note though, if you have any specific tests in mind for the C and LISP code,

I'd love to be convinced otherwise on this point.

Finally, there are multiple comments about the resources necessary to
do these things. Stefan thinks that it is a full time _job_ to keep
servers up. From my experience, that is not correct. It is however,
a full time _responsibility_ for someone to keep them up. You will not
burn a full person year keeping servers up for a year. So hosted
services are great when you cannot amortize the support costs of doing
things yourself on EC2 over different tasks.

It has also been said by @keno that keeping Jenkins up is a non trival
task. I don't think that is really correct either. As you start to
automate the pipeline, you will see many opportunities to make
developers lives easier and hence there is a long period of features
being added, searching for the right Jenkins plugins, writing your own
for specialized tasks, etc. But all of that has incremental and
immediate pay offs. The problem then becomes the resource usage of
Jenkins. But we kept it on m3.large class machine for a long time
before we had to bump it up. If you just wanted to replicate what you
have today on Travis, that's a day's worth of work for an expert and I
doubt you would need to touch it much after that. But that would be
missing the point totally.

You might be right, all I remember was my experience setting up Jenkins being frustrating,

with lots of plugins being unmaintained and my bug reports ignored (though I just checked my blocking bug from back then was actually fixed a year and a half later, so I guess that's something).

Plus you don't just need to maintain a Jenkins install, you also need to maintain the virtualization setup, since you'll be building arbitrary pull requests, keep on top of security updates, etc. I'm not saying this is impossible or even a full time job, but from my experience it's a significant commitment

that should not be taken lightly. I'm not opposed to setting something like Jenkins up, but we'd have to have somebody come forward who's willing to do the maintenance and knows enough about the Jenkins internals to be able to fix bugs in the problems as they come up.

Although, quite frankly, I'd prefer those time resources being put into making julia's test suite better rather than wrangling with infrastructure.

I'm also still not convinced that setting up Jenkins would magically make our life easier as I don't think we've reached the limits of the existing tooling yet. If you (or anybody else - don't mean to put you on the spot ;) ), think we have, I'd love to hear the reason why and discuss the best way to address that. It might be to set up something else, or it might simply be a feature request to travis for example.

Tim Holy

unread,

Jan 1, 2015, 10:19:59 AM1/1/15

to juli...@googlegroups.com

I take your point, and it's certainly valid. However, to make the opposite
point clearer: I don't care about lines, that's much too fine-grained. I would
be happy if we could just count the number of whole _methods_ that lack any
kind of test coverage. But currently we can't even do that, simply because our
counting is based entirely on things that get compiled, and it doesn't get
compiled unless it gets run.

--Tim

Don MacMillen

unread,

Jan 1, 2015, 5:41:06 PM1/1/15

to juli...@googlegroups.com

OK, this is _really_ my final post. I hope you all have a great

New Year.

"I don't disagree in principle, but I think it's less necessary in

programming languages than in a lot of other projects, precisely

because in order to be useful it needs to be exposed in the language."

Well, maybe. But for that to be really true a couple of antecedents

need to be true as well. The first is that you actually measure the

coverage of the C code and the Lisp from the Julia test suite. The C

code should be easy. Can you do that on the Lisp code? That would be

very cool. The second is the ability to look at the uncovered C and

Lisp and be able to deterministically write small Julia tests to

target the uncovered lines. That sounds hard to me. It has been my

experience that the further away from the code you are trying to

exercise the more state you have to inject to get the controllability

and observability you need.

"On that note though, if you have any specific tests in mind for the C

and LISP code, I'd love to be convinced otherwise on this point."

Thanks for keeping an open mind, but it is your code. All I am asking

about are the zeroth level, non-controversial (I thought) quality metrics.

"I'm also still not convinced that setting up Jenkins would magically

make our life easier as I don't think we've reached the limits of the

existing tooling yet"

There is no magic, only investment. Yes, you can invest more in the

current tool chain, but then I think you may wind up reinventing the

wheel in many cases. You guys have to decide how you want to invest,

or not.

Finally for Tim, I didn't realize that you can't get at method level

coverage. That should be a simple output of the line level coverage

and indeed that was shown on our coverage summary url's. Easy to see

since uncovered methods stick out when everyone can see the same web

page.

But isn't the problem you describe easily solvable? Don't you just

need some post process coverage scripts that first walk the source

tree and record all the methods, then party on the raw coverage data

and output the appropriate xml? How hard can that be? But maybe I am

missing the point?

Every development team I have ever been a part of has built and

maintained a set of internal tools. (I don't mean to imply that any of

that was my idea, I have just had the wonderful opportunity to work

with some great folks). It is always a struggle in deciding to roll your

own or shoe-horn someone else's tool. But all that is needed is just

good judgement and good engineering. No magic required.

So invest in your future. Who is better to do it than you guys?

Don

Jameson Nash

unread,

Jan 1, 2015, 5:48:47 PM1/1/15

to juli...@googlegroups.com

https://github.com/JuliaLang/julia/commit/88783ec1503f065cb271e6a7eb10274c8d7e6103#diff-75b23cc61d190dd99feab17b8e297ee3R2017

adding some tests for the C code turns out to be a good idea :)

Tobi

unread,

Jan 1, 2015, 6:07:27 PM1/1/15

to juli...@googlegroups.com

Which shows that the C code is not that far away from the julia code :-)

Tim Holy

unread,

Jan 1, 2015, 6:17:10 PM1/1/15

to juli...@googlegroups.com

On Thursday, January 01, 2015 02:41:06 PM Don MacMillen wrote:
> OK, this is _really_ my final post. I hope you all have a great
> New Year.

I hope it's not your final post!

> Finally for Tim, I didn't realize that you can't get at method level
> coverage. That should be a simple output of the line level coverage
> and indeed that was shown on our coverage summary url's. Easy to see
> since uncovered methods stick out when everyone can see the same web
> page.
>
> But isn't the problem you describe easily solvable? Don't you just
> need some post process coverage scripts that first walk the source
> tree and record all the methods, then party on the raw coverage data
> and output the appropriate xml? How hard can that be? But maybe I am
> missing the point?

As you say, the only missing component is parsing the source code to get a
list of all the methods (and their first line number). It's not entirely
trivial because (among other reasons) `end` is used for many things in julia.
I know nothing about parsing---I could probably figure it out, but there are
others who already know much more than me, so I'm looking for a collaborator
here.

I figure it's better to either leverage julia's built-in parser (written in
Lisp, which I don't speak) or use
https://github.com/jakebolewski/JuliaParser.jl.

--Tim

Tony Kelman

unread,

Jan 1, 2015, 9:15:36 PM1/1/15

to juli...@googlegroups.com

Coverage is a not-exactly-easy question with Julia code as others have mentioned, but with more effort we can hopefully figure it out.

On infrastructure, we could just buy EC2 machines or whatever equivalent and do everything there. But given that those of us who've done most of the work in setting up and paying attention to our current Travis and AppVeyor CI services, and the buildbots we use for building nightlies and release binaries, have little or no experience with Jenkins, I don't think we're likely to experiment with it on our own. If someone really familiar with Jenkins wants to contribute by setting up a trial run I'm sure we'd be willing to entertain it and evaluate its reliability and ease-of-use versus what we're doing now.

Considering how well Elliot's buildbot (http://buildbot.e.ip.saba.us:8010/builders/) has been working for building binaries, I do think we should look into expanding what we use it for. There's an outstanding bug https://github.com/staticfloat/julia-buildbot/issues/3#issuecomment-66222260 that means the Windows builders can't run the tests right now, we should probably bother Jameson about that to see if he can help get it working. We should also promote the buildbot to an easier-to-remember URL like build.julialang.org, and start hooking it up to Github's status API.

I also think a few tweaks to process, having core developers go through PR's for a few more things we would normally commit straight to master, would also be useful for at least providing a CI buffer window as opposed to accidentally breaking master along with all PR's that get submitted in the meantime.

Keno Fischer

unread,

Jan 2, 2015, 4:22:56 AM1/2/15

to juli...@googlegroups.com

Given Jameson's example I realize we may have been talking past each other. I am absolutely not opposed to adding tests like that, that just ccall into the runtime. What I didn't want was a totally separate testsuite using one of the C unittest frameworks to just test the C or lisp code.

Tobi

unread,

Jan 2, 2015, 5:30:33 AM1/2/15

to juli...@googlegroups.com

Although one has to mention that this is only possible for the exported symbols.

My opinion is that creating a test suite for the C part is not the most critical issue currently. The most meaningful test can be made from Julia anyway so there is a large portion that is well tested. Same holds for the lisp code.

Stefan Karpinski

unread,

Jan 2, 2015, 5:46:32 PM1/2/15

to Julia Dev

On Fri, Jan 2, 2015 at 4:22 AM, Keno Fischer <kfis...@college.harvard.edu> wrote:

Given Jameson's example I realize we may have been talking past each other. I am absolutely not opposed to adding tests like that, that just ccall into the runtime. What I didn't want was a totally separate testsuite using one of the C unittest frameworks to just test the C or lisp code.

As it turns out, Julia is a pretty good way of unit testing C code :-)

Stefan Karpinski

unread,

Jan 2, 2015, 5:51:19 PM1/2/15

to Julia Dev

On Fri, Jan 2, 2015 at 5:30 AM, Tobi <tobias...@googlemail.com> wrote:

My opinion is that creating a test suite for the C part is not the most critical issue currently. The most meaningful test can be made from Julia anyway so there is a large portion that is well tested. Same holds for the lisp code.

I wholeheartedly agree. What we need, imo, is a way of measuring coverage such that 100% actually means that every basic block has been executed at some point. Once we reach that, we can start worrying about n-grams of basic blocks for n > 1. If we can reach 100% for n = 2, I think we will really be doing well for coverage. This is ambitious, but I think it would be transformative for the reliability of base Julia.

So the question is how do we measure coverage in a way that actually works? Since we lower every piece of Julia code before doing inlining or type inference, would it be most effective to add coverage instrumentation as part of lowering? I suspect that the lowered Julia AST is almost as conducive to basic block coverage counting as LLVM code is.

Leah Hanson

unread,

Jan 3, 2015, 2:09:08 PM1/3/15

to Julia Dev

I don't think optimizing the coverage statistics first is that helpful. Picking any reasonable metric and getting it to 80% seems like a good way to improve test coverage. Whole methods or even whole functions would be reasonable starting places; whatever is easy/quick to implement is probably pretty good.

If you're not at say 80% for the sane-but-flawed metric, then you have to add more tests, which is the point. Spending a bunch of time figuring out and implementing a fancy metric doesn't increase the test coverage. Coverage metrics are never going to be an indication of perfect testing.

There's room for optimization of the metric in parallel with the adding-of-tests for the first metric and after the first metric reaches 80+%, but optimizing the metric when a weaker, easier to implement metric still causes more tests to get written, seems premature. The metrics on the ultimate, perfect side of the scale (for example path coverage or dataflow coverage) are super-hard (like academic research is getting done on them), so coverage metrics is not a thing where we're going to be perfect -- we just need to be good enough that we write more tests.

-- Leah

samoconnor

unread,

Jan 3, 2015, 5:04:11 PM1/3/15

to juli...@googlegroups.com

On Sunday, January 4, 2015 6:09:08 AM UTC+11, Leah Hanson wrote:

I don't think optimizing the coverage statistics first is that helpful.

Picking any reasonable metric and getting it to 80% seems like a good way to improve test coverage.

+1

The invention-and-deep-thought effort of the core devs should be spent on getting the semantics of the core language, and the API patterns in the libraries right, not advancing the state of the art of test metrics.

Tools and metrics are just a means to and end.

In 10 or 20 years time, there will be a huge installed Julia codebase out in the wild.

If there are bugs lingering in Julia due to poor early testing now, they can be fixed, and the installed base will benefit.

If the core language semantics have lingering problems, or if the core libraries have inconsistent API patterns, these problems can't be fixed without breaking everyone's code.

Stefan Karpinski

unread,

Jan 5, 2015, 5:31:16 PM1/5/15

to Julia Dev

This is a good point – more tests are good, even without any coverage metric. To me the value of measuring basic block coverage is having more confidence of correctness. If you have a good metric and good coverage numbers with that metric it becomes much harder to be wrong.

Tim Holy

unread,

Jan 6, 2015, 10:54:39 AM1/6/15

to juli...@googlegroups.com

OK, I believe I've figured out how to solve the main problems with our current
(line-based) strategy for counting coverage:
https://github.com/IainNZ/Coverage.jl/pull/36

The results are looking interesting: I chose two packages which will be called
"A" and "B" to protect the authors from embarrassment (however, at least one
of these is very highly regarded by the community). Here are results from
analyzing coverage fraction*:

Package Old coverage (from README badge) New coverage
------- -------------------------------- ------------
A 94% 17%
B 91% 71%

Looking through the .cov files manually (and knowing the actual tests of both
packages), it's 100% obvious that the new coverage metric presents a vastly
more accurate picture of the actual test coverage. While I endorse Stefan's
interest in even better ways of measuring coverage, I think this experiment
proves that we had some even bigger fish to fry.

Once that PR reaches consensus and gets merged, package authors should be
prepared for a substantial decrease in their coverage fraction.

Best,
--Tim

*(1) Tests were run on julia 0.4 with inlining disabled, so the .cov files
should be accurate. (2) Tests were run on my local checkout of these packages,
which may or may not reflect what people receive from Pkg.add().

Stefan Karpinski

unread,

Jan 6, 2015, 2:09:50 PM1/6/15

to Julia Dev

I'm really excited by those extremely low coverage numbers because we *know* that our current high coverage stats are bogus.

John Myles White

unread,

Jan 6, 2015, 2:11:48 PM1/6/15

to juli...@googlegroups.com

This is great. I've learned to completely ignore the coverage numbers, so this makes me think we're heading in the right direction.

-- John

Jiahao Chen

unread,

Jan 7, 2015, 8:30:09 PM1/7/15

to juli...@googlegroups.com

> I chose two packages which will be called "A" and "B" to protect the authors from embarrassment

For the record, I will gladly allow Tim Holy to embarrass me about anything I have ever written in Julia. It's the least I can do in exchange for the Shostakovich.

For the record, coverage of RandomMatrices.jl went from 93% to a much more believable 12%.

Tobi

unread,

Jan 8, 2015, 4:53:33 AM1/8/15

to juli...@googlegroups.com

So the coverage results for that package were "random" before :-)

Just kidding. I think this is a very important improvement. Especially due to the dynamic nature of Julia it is quite important to test all code path so that the problems are not just discovered during runtime. Static languages can do quite some test during compilation so that this is really an important thing for Julia.

Cheers,

Tobi

Tim Holy

unread,

Jan 8, 2015, 5:28:27 AM1/8/15

to juli...@googlegroups.com

There's probably no way to test "all" code paths: you might have a test for
every method, but then not realize that your function doesn't work if given an
array of matrices rather than an array of Float64s as input.

But as you say, hopefully we're at least removing a rather ridiculous source
of "dishonesty" (not intentional, just hard to overcome) from our coverage
reports.

--Tim

Tobi

unread,

Jan 8, 2015, 6:04:53 AM1/8/15

to juli...@googlegroups.com

Sure, "all" is probably not really possible. But I really think that testing and coverage is an important tool for us and we have to be honest to ourselves that due to the dynamic nature of Julia it can be sometimes hard to detect bugs. Or they are discovered quite late because one just has not used a function with a specific type. This is IMHO an area where it can be good to restrict input types of function (no duck typing). This, together with interfaces/traits is an area where I think the language will evolve during the ride to 1.0.