Kritis taking care of multiple namespaces

24 views
Skip to first unread message

Eduardo Munari

unread,
Nov 3, 2020, 3:36:10 PM11/3/20
to Kritis users

Hey guys, I’m working on a context and facing a kinda weird issue.


Context is, I’m deploying Kritis on my K8s Cluster on a 'default' namespace.
I’m, also, deploying two GAPs/AttestationAuthorities, one for the 'default' namespace and one for the 'test' namespace.

The thing is, whenever I try to deploy a pod on my 'default' namespace (same namespace that Kritis is deployed), everything works pretty normal and Kritis is able to validate and allow/block all images/pods.
But, whenever I try to deploy a pod on the 'test' namespace, kritis-validation-hook pod logs get stuck and I get the following error on my terminal:


Error from server (InternalError): error when creating "pod.yaml": Internal error occurred: failed calling webhook "kritis-validation-hook.grafeas.io": Post https://kritis-validation-hook.default.svc:443/?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

To get things weirder, if I delete the Occurrence related to the image on the 'test' namespace, my logs on the kritis-validation-hook pod get unstuck and prints the message that the image is attested and its allowed to be deployed (although the image is attested, when I get the timeout error it doesn’t get deployed).

Do you guys have any ideia about this problem?
And btw, am I doing things right? I mean, is it ok to deploy Kritis on a namespace and deploy GAPs and Authorities to check all other namespaces in the cluster?
Btw, I’m using the same secret (deployed on the 'default' namespace) for all Attestations.

Thanks!

Qifan Pu

unread,
Nov 3, 2020, 5:52:40 PM11/3/20
to Eduardo Munari, Kritis users
Hi Eduardo,

Yeah the policies are scoped to namespaces so I don't think this is an issue. 
What you observed seems bizarre, does it work when you only deploy to "test" namespace?

Best,
Qifan

--
You received this message because you are subscribed to the Google Groups "Kritis users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kritis-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kritis-users/2bffbfdc-1af2-42fd-a44d-c5c618635d71n%40googlegroups.com.

Eduardo Munari

unread,
Nov 4, 2020, 7:08:36 AM11/4/20
to Qifan Pu, Kritis users
Hello Qifan! First of all, thank you for your answer, really appreciate it!

Yes, it works just fine when I deploy only to 'test' namespace. Pretty weird, right? 

--
Eduardo Munari

Bacharel em Ciência da Computação
UNESP - IBILCE - São José do Rio Preto

Balázs Gyurák

unread,
Nov 4, 2020, 5:38:17 PM11/4/20
to Kritis users
Hi Eduardo, are you able to post the Kritis logs here? I assume the above error message was the kubectl output, but I think the root cause will be in the Kritis logs. The fact that removing the Occurrence solves the problem makes me think it might be incorrect metadata in the Occurrence itself, which Kritis cannot handle and it crashes.

Eduardo Munari

unread,
Nov 5, 2020, 9:02:38 AM11/5/20
to Balázs Gyurák, Kritis users
Hey Balázs!

This message is gonna be a bit long, sorry!

Yes, that's the kubectl log. My Kritis log gets stuck on a message "Validating against GenericAttestationPolicy my-gap", which is from the review.go file, line 70 (on my current version).
After some digging, I kinda found out the problem, but am still struggling to fix it.

The thing is, I'm using Kritis according to the standalone docs.
I'm following the tutorial but adding some changes. One of them is that I'm using grafeas with pgsql, and I'm not deploying it on my K8s cluster, it's on another machine (but I'm generating/using all certs).
My problem happens whenever I use the create_attestation.go file to generate my note and occurrence to attest the image I'm deploying.

When I first run the create_attestion, it creates the note and occurrence and when I run a cURL on my grafeas server to retrieve the notes, my nextPageToken comes empty.
(it creates note 'att' and occurrence inside the 'kritis' project)
Captura de Tela 2020-11-05 às 09.27.30.png

But, when I run the create_attestation again to create a new note/occurrence on a new project, my cURL on the notes list comes with a nextPageToken that changes every time I run a cURL.
(it creates note 'att-new' and occurrence inside the 'newkritis' project)

1st cURL
Captura de Tela 2020-11-05 às 09.28.34.png

2nd cURL
Captura de Tela 2020-11-05 às 09.28.54.png

So, when I try to deploy a pod on the 'newkritis' namespace it gets stuck on the fetchAttestationOccurrence method inside the grafeas.go file.
Captura de Tela 2020-11-05 às 10.03.39.png
There is this for statement, which works with the ListNoteOccurrences method and, in this particular case, it never goes in the if statement that breaks the execution, cause every time ListNoteOccurences is hit, it returns a new nextPageToken that is never empty and because I have an occurrence, len(occs) is never 0 and always gets an increment, also.

(I built a new kritis-server image and added lots of logs on this for statement and I believe it falls in an infinite loop because of the nextPageToken problem.)
Captura de Tela 2020-11-05 às 09.30.00.png

That's why when I delete the occurrence and the note it gets unstuck, cause the len(occs) returns 0 and finally it goes in the if statement.
Captura de Tela 2020-11-05 às 09.32.31.png
The log above is what happens when I delete the note and the occurrence (the pod is not deployed even though the logs says its attested)

When I try to deploy a pod that depends on the 'att' note ('default' namespace), on the project 'kritis', it goes pretty right.
Captura de Tela 2020-11-05 às 09.29.34.png

PS: If I run the create_attestation.go to create the notes/occurrences for the 'newkritis' namespace first,  then the problem occurs on the 'default' namespace related notes/occurrences

Do you guys think it's a problem on the create_attestation.go (and perhaps, some/all of its dependencies)? Or a Grafeas problem? Or a pgsql problem?
I generated a new grafeas-server image just yesterday from the grafeas repo, but still getting the error.

And last but not least, thanks again for the help!


Eduardo Munari

unread,
Nov 5, 2020, 2:54:33 PM11/5/20
to Kritis users
Hey guys, btw, I made a fix that apparently solved my problem.
Let me show you guys what I did and you guys tell me if you think it's ok or if it'll surely cause some bad side effects.

After taking a look at the logs I added to the kritis-server image, I figured out something:
Captura de Tela 2020-11-05 às 09.30.00.png

All of this happens on the grafeas.go file, at the fetchAttestationOccurrence method.
I found those empty returns from the resp.Occurrences (which resp is the response from the ListNoteOccurrences method)

I saw that, after a call to the ListNoteOccurrences that replies a full answer (occurrences and next page token), the next call will have only nextPageToken on resp, but the occurrences are empty.
If the occurrences are empty the first time the method is called, it will go inside the if statement that executes the break statement at line 267.

After this "empty response" the next call kinda goes back to the initial point (same occurrences list from before), and it stays in an infinite loop like:
Full Occurrences List (Occurrences and NextPageToken) -> Empty Occurrences List (NextPageToken only) ->  Full Occurrences List (Occurrences and NextPageToken) -> Empty Occurrences List (NextPageToken only) -> Full Occurrences List (Occurrences and NextPageToken) -> Empty Occurrences List (NextPageToken only) -> and so on

The original method looks like this:
Captura de Tela 2020-11-05 às 16.41.30.png

So I added the if statement at line 262 to check resp.Occurrences and removed the 'len(occs)' verification on the if statement at line 267.
Ps: if the first call returns an empty occurrences list, it would previously enter in the if at line 267, now it will enter the line 262 if.
Captura de Tela 2020-11-05 às 16.44.27.png

That's a palliative solution I found for the nextPageToken problem.
I ran the unit tests for this class and they are all still passing.
I will try more scenarios (changing PageSize and etc) to see if I face some bad side effects.

If you guys have any suggestions, I'd be happy to hear!
Thanks!

Balázs Gyurák

unread,
Nov 5, 2020, 3:45:32 PM11/5/20
to Kritis users
Hi Eduardo,

I've never used the reference implementation of Grafeas (we've implemented our own), but I do recall I've discovered some potential issues with the ListNoteOccurrences method. In your particular case, based on your logs and analysis, it seems pretty likely to me that the issue is on the Grafeas side - looks like it's not returning an empty page token when it should. We use Kritis as is in our solution, and it works correctly.

I recommend taking a look at the logic that generates the page token, in your case I suspect the execution never goes into this if block:

You could verify this without Kritis by calling Grafeas' ListNoteOccurrences method with the correct request object (same one as Kritis produces on your screenshot, with the filters). If you do discover a bug, I'm sure the maintainers will happily take a PR.

Good luck!

Balazs

Balázs Gyurák

unread,
Nov 5, 2020, 3:55:16 PM11/5/20
to Kritis users
Sorry, I sent my reply before I saw your latest message. I'm afraid I don't fully understand your explanation though. However, one of your sentences caught my attention:

> the next call will have only nextPageToken on resp, but the occurrences are empty.

This doesn't seem correct to me - as in, if you receive a response from Grafeas that has a nextPageToken but no occurrences, I think that's not correct. You know it's the last page when your nextPageToken is empty, and, if the nextPageToken has value then the response list's size should equal to the pageSize. If you experience anything other from Grafeas, then I do believe it's a bug on the Grafeas side.

Your fix might solve your immediate issue, but if Grafeas handled pagination correctly, you wouldn't need to change Kritis.

Thanks,
Balazs

Eduardo Munari

unread,
Nov 5, 2020, 4:07:28 PM11/5/20
to Kritis users
Hi Balázs! 

About what you said here:

> if you receive a response from Grafeas that has a nextPageToken but no occurrences, I think that's not correct
and here:

> if Grafeas handled pagination correctly, you wouldn't need to change Kritis.

I totally agree with you! And yeah, my explanation seems pretty confusing but that's exactly what is happening, I'm getting a nextPageToken even when there's no more data!

Btw:

> Your fix might solve your immediate issue

I'm doing some tests and apparently it did, but I'll surely take a look at the code you sent me, thank you for that!!
If I figure out what's causing the real problem (maybe I'm doing something wrong on my Grafeas implementation), I'll let you guys know and, if it's really a bug, I'll try to fix it and send a PR.

Thanks,


Reply all
Reply to author
Forward
0 new messages