nodes and edges span several files

21 views
Skip to first unread message

Dirk Roorda

unread,
Sep 13, 2013, 1:53:10 AM9/13/13
to poio-d...@googlegroups.com
The resource I am working with has many nodes and edges, so I have spread them into multiple files.
If the file size becomes larger than 200 MB, validation becomes a pain, even xmllint takes ages and then it breaks.

But POIO complains if an edge points to a node in another file. (KeyError).

What should I do?

Related observation: in the resource header you can specify file dependencies. I have done so, but I direct POIO to the primary data header not to the resource header.
And there you cannot specify file dependencies.

It is more logical that POIO starts reading the resource header, but for some reason that did not work for me. Maybe I have an error in my resource header file.

So what does POIO expect: the resource header or the primary data header?

Dirk Roorda

unread,
Sep 13, 2013, 2:18:39 AM9/13/13
to poio-d...@googlegroups.com
Found a solution: add dependency information in the annotation files themselves:
<dependencies>
<dependsOn f.id="f_lingo.s"/>
</dependencies>

etcetera.

Dirk Roorda

unread,
Sep 13, 2013, 3:22:43 AM9/13/13
to poio-d...@googlegroups.com
Disadvantage of local declaration of dependencies: if A depends on C and B depends on C, then C will be parsed twice. With more files and more dependencies, this becomes quickly inefficient.
However, if you feed POIO the files in the right order: C, A, B, then C gets parsed only once and POIO is still happy.

So I stick to declaring the dependencies only globally, in the resource header file, and I will not declare them again locally in the individual annotation files.
I try to avoid cyclic dependencies.

Dirk Roorda

unread,
Sep 13, 2013, 4:35:39 AM9/13/13
to poio-d...@googlegroups.com
Now I think of it: local dependencies are good, logically, in case you want to process only a few annotation files and not all.
So I really should add all dependencies to all annotation files.
But then POIO should keep track of previously imported files and not try to load files multiple times.
What do you think of such an approach? 

pbouda

unread,
Sep 13, 2013, 5:34:14 AM9/13/13
to poio-d...@googlegroups.com
Yes, that should be the correct behaviour. Can you add an issue on Github for this? If you have some sample files to demonstrate the problem, could you also add those? Thanks!

graf-python currently support parsing of individual annotation files (plus dependencies) and of the primary data header (which will parse all mentioned annotation files). Ressource headers are not supported (yet), there is an issue for that on Github.

Peter

Dirk Roorda

unread,
Sep 13, 2013, 7:05:35 AM9/13/13
to poio-d...@googlegroups.com
I have files with a dependency pattern like this:

LA1 < L < M < R
LA2< L < M < R

MA1 < M < R
MA2 < M < R

L is a file with nodes corresponding to linguistic objects
M is a file with nodes corresponding to word objects (50 MB)
R is a file with regions (40 MB)

LA1, LA2, etc are linguistic annotations and edges (no nodes) (6 files, up to 112 MB)
MA1, MA2, etc are word annotations (no nodes) (11 files, up to 180 MB)

Shall I construct a minimal example?
If you import a file only once, cyclic dependencies are no longer problematic.
Reply all
Reply to author
Forward
0 new messages