*Sorry for the previous, anticipated reply
The biggest one was the availability of parseable joarticles. The best you can do is to parse all the journals that provide the HTML representation and are open. HTML is due to the fact that pdf often tend to permute words and there is no reliable way of re-ordering text automatically. The second one has more to do with the copyright and intellectual property of the publishers. For us this meant that we couldn't index most of the journals from the Nature and Cell publishing, which was excluded for us.
I won't elaborate much on the other problems, for this is not the point of your question.
What we've found useful before abandoning this project was projecting the terms encountered in the articles and relations between them into a GO term and BioPax lvl3 ontology to parse the article elements of sense and to bind them more easily to other database-like data.
A year ago when we were working on it, bio4j didn't yet support any of those (has it changed?). Thus we had to build our own neo4j-based biological database. You might want to implement those node types and associated functions in bio4j before you start using it as a backbone for your article aggregation service.
Another problem you might want to consider is also the fact that the underlying technology, neo4j, has a node and relation number limitations that might make it not very well-suited for data-intesive applications. You can find more about it here:
http://www.slideshare.net/andreikucharavy/graph-databases-in-biology-case-of . As such I've chosen to use a language-specific Gremlin compiler in my projects (bulbs in case of python) to make the transition from neo4j to Titan or other graph database easier and less painful.
You might consider doing the same thing. One of the long-term advantages is that neo4j implements writing locks, whereas in Titan the IO is fully parallel and can supports thousands simultaneous writes. With text processing and while using only opensource articles you most likely will need to use several parsing servers just to get over everything published in time and parallel IO might not be a luxury.
Good luck for your project!
Andrei