specs for data validation

20 views
Skip to first unread message

Matthew Pocock

unread,
Jan 23, 2014, 1:38:17 PM1/23/14
to specs...@googlegroups.com
Hi,

I have some complex data validation to do. Each data value tends to contain other nested data, and has a number of validation rules that need to be checked for it and for its children. A well-formed value passes all of these checks.

I got part way through writing a generic data validation DSL, and then started writing the specs tests for these, and in essence found that I was re-implementing specs.

So my question is, how would I use specs to validate these data structures? I need to have a function that goes data -> (success | failures) and then render failures in a number of ways (text, or json initially).

Thanks,
Matthew

--
Dr Matthew Pocock
Turing ate my hamster LTD

Integrative Bioinformatics Group, School of Computing Science, Newcastle University

skype: matthew.pocock
tel: (0191) 2566550

Bill Venners

unread,
Jan 23, 2014, 2:45:14 PM1/23/14
to specs-users

Hi Matthew,

Am I correct in concluding that you want to do this validation in production code? If so,  can you post a specific example showing the nesting?

Bill

--
You received this message because you are subscribed to the Google Groups "specs-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to specs-users...@googlegroups.com.
To post to this group, send email to specs...@googlegroups.com.
Visit this group at http://groups.google.com/group/specs-users.
For more options, visit https://groups.google.com/groups/opt_out.

Matthew Pocock

unread,
Jan 23, 2014, 3:07:42 PM1/23/14
to specs...@googlegroups.com
On 23 January 2014 19:45, Bill Venners <bi...@artima.com> wrote:

Hi Matthew,

Am I correct in concluding that you want to do this validation in production code?

Yes, exactly. This would happen in a running application after data is loaded from file, for example, or as a document validation web service.

If so,  can you post a specific example showing the nesting?

Sure. Forgive the ascii - I'm just typing into the email here:

DnaSequence(id="my_sequence", nucleotides = "tacgatcgtagtcgtagtcgatcgta")
DnaSequence(id="sub_seq_2", nucleotides = "tag")

DnaComponent(id="my_component", sequenceRef="my_sequence", annotations=Seq(
  SequenceAnnotation(bioStart=3, bioEnd=6, subComponent="sub_component_1"),
  SequenceAnnotation(bioStart=9, bioEnd=11, subComponent="sub_component_2")
))

DnaComponent(id="sub_component_1")
DnaComponent(id="sub_component_2", sequenceRef="sub_seq_2")

So we have some constraints on SequenceAnnotation. bioStart <= bioEnd, for example.

Then we have some constraints between DnaComponents. If DnaComponent x contains a SequenceAnnotation y that refers to a sub-component z, and if x and z both have an associated DnaSequence (sx, sz respectively), then the substring of sx.nucleotides from bioStart to BioEnd (counting from 1 and inclusive of both indicies, not zero, because we are biologists) must be the same as the whole of sz.nucleotides.

It is fairly easy to wrap up all the entities behind a little dictionary that dereferences the IDs so that it looks like you have a fully connected graph of all the objects. So in practice, each ref behaves like an Option[T].

Does that help at all? These are just two of the validation rules. There are quite a few others, and we're in the process of building a much larger data model with many more validation rules.

Thanks,
Matthew

Bill Venners

unread,
Jan 23, 2014, 5:53:30 PM1/23/14
to specs-users
Hi Matthew,

Sorry for the delay in responding. Today's a meetings day. I will think about this on the train. What will you want if validation fails? Error messages for each failure? One error message? An exception? What will you do, log the problem? Show a message to a user?

Bill
Bill Venners
Artima, Inc.
http://www.artima.com

Matthew Pocock

unread,
Jan 23, 2014, 6:01:31 PM1/23/14
to specs...@googlegroups.com
Hi Bill,


On 23 January 2014 22:53, Bill Venners <bi...@artima.com> wrote:
Hi Matthew,

Sorry for the delay in responding. Today's a meetings day. I will think about this on the train.

No worries - we all have hectic lives
 
What will you want if validation fails? Error messages for each failure? One error message? An exception? What will you do, log the problem? Show a message to a user?

All the above, depending on the phase of the moon. Ideally I want back a data structure detailing the failures which I could then render as needed for that specific application. In future I may need to extend this to a lint-like tool that suggests fixes, but this isn't needed now.

Matthew

Bill Venners

unread,
Jan 23, 2014, 6:15:07 PM1/23/14
to specs-users
Hi Matthew,


On Jan 23, 2014 3:01 PM, "Matthew Pocock" <turingate...@gmail.com> wrote:
Hi Bill,


On 23 January 2014 22:53, Bill Venners <bi...@artima.com> wrote:
Hi Matthew,

Sorry for the delay in responding. Today's a meetings day. I will think about this on the train.

No worries - we all have hectic lives
 
What will you want if validation fails? Error messages for each failure? One error message? An exception? What will you do, log the problem? Show a message to a user?

All the above, depending on the phase of the moon. Ideally I want back a data structure detailing the failures which I could then render as needed for that specific application. In future I may need to extend this to a lint-like tool that suggests fixes, but this isn't needed now.

What I'll think about on the train is if there is anything special needed to apply it to your given examples.

Bill

Matthew Pocock

unread,
Jan 23, 2014, 6:33:23 PM1/23/14
to specs...@googlegroups.com
Thanks Bill. I'm sort of hoping for something like:

def wellFormedSequenceAnnotation(sa: SequenceAnnotation) = "A SequenceAnnotation" should {
  "Have a bioStart less than or equal to bioEnd" in {
    sa.bioStart < sa.bioEnd
  }
}

Something like this that lets me describe each of the constraints, and then test for them for a specific data value. So, for a DnaComponent, part of the spec would be that all its SequenceAnnotation instances are well-formed:

def wellFormedDnaComponent(dc: DnaComponent) = "A DnaComponent" should {
  "Have well-formed annotations" in {
    all(dc.anntoations) wellFormedSequenceAnnotation
  }

  "Conform to other constraints" in ...
}

Is some variant on this likely to be possible? I'm not wedded to using defs like this, it's just meant to illustrate the issue.

Matthew

Bill Venners

unread,
Jan 23, 2014, 6:35:54 PM1/23/14
to specs-users

Hi Matthew,

No seats in the train. So on phone.  Quick question: you want that kind of syntax in production code not the tests, correct?

Bill

Matthew Pocock

unread,
Jan 23, 2014, 6:38:32 PM1/23/14
to specs...@googlegroups.com
Yes, exactly. This is going into production code that will be run to validate data input and report violations of the data format, e.g. through a web page that validates user files and potentially highlights the offending bits, or as tool-tips in a CAD tool.

Standing on a train sucks.

Bill Venners

unread,
Jan 23, 2014, 6:43:57 PM1/23/14
to specs-users
Hi Matthew,

On Thu, Jan 23, 2014 at 3:38 PM, Matthew Pocock <turingate...@gmail.com> wrote:
Yes, exactly. This is going into production code that will be run to validate data input and report violations of the data format, e.g. through a web page that validates user files and potentially highlights the offending bits, or as tool-tips in a CAD tool.

Standing on a train sucks.

In the meantime I got a seat. OK, this is interesting. So what I'm thinking is if this fails:


def wellFormedSequenceAnnotation(sa: SequenceAnnotation) = "A SequenceAnnotation" should {
  "Have a bioStart less than or equal to bioEnd" in {
    sa.bioStart < sa.bioEnd
  }
}

You might get an error message of ... Well what error message would you want to see? Also, what error message would you want to see if this fails:


def wellFormedDnaComponent(dc: DnaComponent) = "A DnaComponent" should {
  "Have well-formed annotations" in {
    all(dc.anntoations) wellFormedSequenceAnnotation
  }

Bill

Matthew Pocock

unread,
Jan 24, 2014, 5:27:41 AM1/24/14
to specs...@googlegroups.com
Hi Bill,

Having thought about it further, what I really want back is the tree of tests performed, annotated with their pass/fail status. An internal node in this tree would have failed if any of its children failed. I could then process that tree, extract the human-readable text snippets, and render everything out as needed. It would be ideal if each test case captured the object that it had tested so that I could print out e.g. object IDs for the user to track or for a CAD tool to highlight.

When I run specs2 tests in Intellij IDEA (when I can get it to work), it prints out a tree of texts. In SBT, it prints out an indented tree of ascii-art. So, I'm guessing there's a data structure something like this under the hood.

Matthew

etorreborre

unread,
Jan 24, 2014, 6:45:40 AM1/24/14
to specs...@googlegroups.com
Hi Matthew,

I've been following the discussion and still trying to think about a good solution.

specs2 displays a tree of results but you have to declare the structure yourself in the specification. Whereas in your case you already have this structure, this is your data structure.

So I would probably use something like tree rewriting and / or remote attribute grammars in Kiama to replace the data structure with a recursive "validation" structure holding the results:

trait Result {
  val status: Status
  val elementId: Id
  val subResults: Seq[Result]
}

Maybe you could reuse some of the matching logic in specs2 or ScalaTest for the actual tests but I think that it'd better to use Kiama for the traversal and creation of your validation structure.

Eric.

Bill Venners

unread,
Jan 24, 2014, 9:44:21 AM1/24/14
to specs-users
Hi Matthew and Eric,

I think I get what you're after, which is to have a validation API whereby you can put your error messages on the outside in a tree structure, and the checking logic on the inside, then evaluate the whole tree and out pops the accumulated errors. So that the whole thing reads like a specification in your production code.

But this would need to give good user-friendly error messages. Can you post some examples of how you'd like the error message to look ideally? What would make most sense for users, and how the tree of strings might then look in your validation code?

Bill

etorreborre

unread,
Feb 11, 2014, 8:05:19 PM2/11/14
to specs...@googlegroups.com
Hi Matthew,

Maybe you'll be interested in this.

Eric.
Reply all
Reply to author
Forward
0 new messages