Loading documents which spans multiple lines and loading XMI files

Yamen Ajjour

unread,

Jun 23, 2016, 10:33:44 AM6/23/16

to dkpro-bigdata-users

Hello

I have a corpus in specific format on HDFS, which i want to process via UIMA pipeline. A document in the corpus can span multiple lines. From previous issues in the mail list i recognized that i have to override DocumentTextExtractor, but that still doesn't give me the freedom to process multiple lines. After browsing the source code, i got the idea the feelings that the only way to adjust my files is to implement my own RecordReader. Is there an easier way to solve this problem. Another question is whether there is a way to read XMI files from HDFS.

Greetings from Bauhaus
Thanks

Richard Eckart de Castilho

unread,

Jun 23, 2016, 5:48:50 PM6/23/16

to Yamen Ajjour, dkpro-bigdata-users

Hi,

I've just added a HDFS supporting module to DKPro Core proper and there is a unit test illustrating its use:

https://github.com/dkpro/dkpro-core/blob/master/dkpro-core-fs-hdfs-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/fs/hdfs/HdfsResourceLoaderLocatorTest.java

Cheers,

-- Richard

Richard Eckart de Castilho

unread,

Jun 25, 2016, 7:47:09 PM6/25/16

to Yamen Ajjour, dkpro-bigdata-users

On 25.06.2016, at 23:57, Yamen Ajjour <yamen...@gmail.com> wrote:
>
> I am using Eclipse , after adding the dependencies to the POM ( like dkpro bigdata and uima fit ) i click on export , runnable java jar and then i run it via hadoop

That does not work. Some frameworks, e.g. uimaFIT, require configuration files in well-known locations within the JAR file.

If you simply merge JAR files, these files overwrite each other and the final merged JAR only contains a part of the information.

You can safely merge JARs into a far JAR using the Maven Shade Plugin. A configuration of that plugin that produces a JAR
with proper information for uimaFIT can be found here:

https://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#ugr.tools.uimafit.packaging

Cheers,

-- Richard

Richard Eckart de Castilho

unread,

Jun 26, 2016, 4:21:46 PM6/26/16

to Yamen Ajjour, dkpro-bigdata-users

Hm, did you actually add the UKP SNAPSHOT repo to your POM?

<repositories>
<repository>
<id>ukp-oss-snapshots</id>
<url>http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>

> On 26.06.2016, at 14:56, Yamen Ajjour <yamen...@gmail.com> wrote:
>
> I am getting very similar problems
>
> [ERROR] Non-resolvable import POM: Could not find artifact org.dkpro.bigdata:dkpro-bigdata:pom:0.2.0-SNAPSHOT @ line 73, column 17 -> [Help 2]
> [ERROR] 'dependencies.dependency.version' for org.dkpro.bigdata:dkpro-bigdata-hadoop:jar is missing. @ line 83, column 15
>
> Thanks

Richard Eckart de Castilho

unread,

Jun 27, 2016, 5:48:15 AM6/27/16

to Yamen Ajjour, dkpro-bigdata-users

On 27.06.2016, at 10:12, Yamen Ajjour <yamen...@gmail.com> wrote:
>
> Ah yeah sorry the repository data was not available in the tutorial. This seems to solve the problem. The hdfs collection reader however doesn't seem to be able to access the files within the hadoop job. Even though i managed to do that in regular application. But now i am getting a java.io.IOException: Cannot open stream for HDFS Resource for ....

Ok, not sure if that is any problem we could help with though.

Btw. please ensure to keep the list CC when replying.

Best,

-- Richard

Reply all

Reply to author

Forward