Packaging OpenRefine for Debian

45 views
Skip to first unread message

Antonin Delpeuch (lists)

unread,
Feb 10, 2021, 3:05:13 AM2/10/21
to openref...@googlegroups.com, Markus Koschany
Hi all,

Please welcome Markus Koschany (Cc:), who has kindly accepted to work on
improving our Linux packaging. As a Debian developer he will
specifically work on packaging OpenRefine for inclusion in the official
Debian repository, which should eventually make the tool available in
many Debian derivatives. Markus will be carrying out this work as a
Freexian contractor
(https://www.freexian.com/en/services/debian-packaging.html), funded by
our CZI-EOSS grant.

Beyond the benefit of simplifying the installation process for users
greatly, the hope is that this project will also help us adopt better
practices as maintainers, for instance to migrate out of non-free or
obsolete dependencies (such as the org.json migration done a few years
ago) or fix security vulnerabilities more swiftly.

Indeed, Debian has pretty strict requirements concerning the packaging
of dependencies, the handling of non-free code and other packaging
topics. This should hopefully be a useful follow-up to our previous call
for projects on tackling technical debt.

One major question is which branch he should start working from. We
could aim to package the master branch (so, the 3.x series) or directly
the new-architecture branch (which should become the 4.x series).

I am personally edging towards 4.x for the following reasons:
- Debian's release cycles are not so quick, so if we work on 3.x the
risk is that when it reaches users in Debian stable, it is already
outdated and superseded by stable 4.x releases
- the current version of Jetty used in 3.x (6.1.26) is very old and not
available in Debian anymore. In the new architecture I have had to
migrate to Jetty 9 already (to avoid dependency conflicts with Spark),
so that is one less obstacle. We could try to do backport this migration
in 3.x but it could potentially create conflicts with extensions (and
generally speaking it is not clear to me if it is safe doing so now as
we are thinking of cutting out a 3.5 release).
- more generally, since the new architecture has not been released yet,
we can afford to make pretty arbitrary breaking changes there to comply
with Debian's guidelines. For instance, say we discover a non-free
dependency somewhere (similar to org.json), we can afford to get rid of
it quickly.
- the new architecture uses Maven modules more, which should make it
easier to maintain extensions outside our code base. This is relevant
for packaging, since we should make it possible for extensions to be
packaged independently too.

That being said, since the new architecture is still in its infancy, I
can totally understand if people prefer to ship the 3.x series instead.

Let me know what you think!

Antonin

Thad Guidry

unread,
Feb 10, 2021, 8:11:38 AM2/10/21
to openref...@googlegroups.com, Markus Koschany
I think for all the reasons you list it makes sense not to rush a Debian package with 3.x but have it alongside with 4.x

As far as licensing concerns...which might fall outside of the requirements for our Debian packaging, but still...

I would REALLY like to see SPDX (Software Package Data Exchange) from the Linux Foundation used on our project, and later even get a License check integrated into our GitHub Actions (I saw a few already available).
And here's some tools: https://spdx.dev/spdx-tools/
Specifically the SPDX Maven plugin: https://github.com/spdx/spdx-maven-plugin#usage

Markus -  Welcome!  Do you have familiarity with SPDX and have some thoughts on our adoption?



--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/79ab21cd-250d-bf65-90c2-6dd560c6931d%40antonin.delpeuch.eu.

Tom Morris

unread,
Feb 10, 2021, 2:37:28 PM2/10/21
to openref...@googlegroups.com, Markus Koschany
Welcome, Markus! Thanks for volunteering.

I think we should focus on packaging the current stable version, ie 3.4.1 / 3.5. We haven't even agreed on evaluation criteria for the prototype, let alone begun to measure it against those criteria. As soon as the packaging effort is complete, the work will be available in Debian Unstable.

I'm a little confused by the discussion of non-free dependencies. What dependencies do we think are in conflict with our license? We need to address those independent of Debian packaging.

re Jetty - Antonin, can you provide the hash of the commit with the Jetty 9 upgrade in it? That would help evaluate the complexity and potential impact on extensions.

re SPDX, if Debian doesn't require it, it should be a separate discussion. Let's stay focused.

Tom



Antonin Delpeuch (lists)

unread,
Feb 10, 2021, 4:24:39 PM2/10/21
to openref...@googlegroups.com, Markus Koschany
On 10/02/2021 20:37, Tom Morris wrote:
>
> I'm a little confused by the discussion of non-free dependencies. What
> dependencies do we think are in conflict with our license? We need to
> address those independent of Debian packaging.

I don't think we are aware of any such non-free dependency at the
moment, it was just an example - we can always have bad surprises with
dependencies for some reason or another. And migrating out of them can
have all sorts of costs… I definitely hope we don't find another
org.json! Markus has already done a preliminary investigation and did
not find huge red flags so far. Fingers crossed!

>
> re Jetty - Antonin, can you provide the hash of the commit with the
> Jetty 9 upgrade in it? That would help evaluate the complexity and
> potential impact on extensions.

The update is actually in Butterfly itself:
https://github.com/OpenRefine/simile-butterfly/commit/910ff67fc681ab91c2e8fd158e2a8e61ad75bb22
But I suspect extensions can rely on Jetty-specific things on their side
(I haven't investigated thoroughly).

Antonin

>
>
> On Wed, Feb 10, 2021 at 3:05 AM Antonin Delpeuch (lists)
> <li...@antonin.delpeuch.eu <mailto:li...@antonin.delpeuch.eu>> wrote:
>
> Hi all,
>
> Please welcome Markus Koschany (Cc:), who has kindly accepted to work on
> improving our Linux packaging. As a Debian developer he will
> specifically work on packaging OpenRefine for inclusion in the official
> Debian repository, which should eventually make the tool available in
> many Debian derivatives. Markus will be carrying out this work as a
> Freexian contractor
> (https://www.freexian.com/en/services/debian-packaging.html
> <https://www.freexian.com/en/services/debian-packaging.html>), funded by
> <mailto:openrefine-dev%2Bunsu...@googlegroups.com>.
> <https://groups.google.com/d/msgid/openrefine-dev/79ab21cd-250d-bf65-90c2-6dd560c6931d%40antonin.delpeuch.eu>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-de...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openrefine-dev/CAE9vqEFPZBSTCEL%3DzVNxMyOR01Q0BmURx1TnUy203p6hyP6pzQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/openrefine-dev/CAE9vqEFPZBSTCEL%3DzVNxMyOR01Q0BmURx1TnUy203p6hyP6pzQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Markus Koschany

unread,
Feb 11, 2021, 7:13:19 PM2/11/21
to openref...@googlegroups.com
Hi folks,

thanks for the warm welcome. I'm subscribed to the list now, so you don't need
to CC me anymore.

1. Packaging version 3 or 4 of OpenRefine

Debian's soft freeze for the upcoming Debian 11 "Bullseye" release is starting
today. That means new packages can't enter the testing distribution anymore
which will eventually become the next stable release. That means we have
roughly two years before OpenRefine will be officially part of the next Debian
12 stable release. In the meantime all packages will be available in Debian
unstable as usual of course.

Other distributions like Ubuntu will regular sync the packages, making it very
likely that OpenRefine can be shipped with Ubuntu 21.10 already. We also have
the option to create backports for Debian stable. The bullseye-backports suite
will be available one month after the official stable release, which could
happen around June/July 2021.

So the question is, how useful will be version 3.x of OpenRefine in two years
or in the meantime and does it make sense to start with 4.x. How many breaking
changes are there between those two versions? How many new dependencies are
needed for 4.x if we start with 3.x first? In the worst case we would package
dependencies which are only useful in the context of OpenRefine (like the fork
of the simile butterfly server) but then they would be dropped and replaced
with something else. This extra time could probably be spent elsewhere. If
there aren't many breaking changes and new dependencies or if 4.x is really
just a bit too experimental right now, then it makes sense to start with the
stable 3.x and then upgrade to 4.x when it is feature complete and stable.

2. SPDX

We already have a similar convention in Debian, a machine-readable
debian/copyright file. [1] Examples: simple [2], more complex [3]. In [4] a few
differences are listed between Debian's copyright format (DEP5) and SPDX.


Our short-license identifiers are already very similar to the SPDX format.
Before a package is accepted into Debian it has to be reviewed by the
maintainer and our ftp-team for compliance with Debian's Free Software
Guidelines. [5] Should I discover any non-free dependencies or license
incompatibilities like Apache-2.0 code linked with GPL-2 only or vice versa, I
will inform you about my findings.



[1] https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
[2]
https://tracker.debian.org/media/packages/j/jackson-databind/copyright-2.12.1-1
[3] https://tracker.debian.org/media/packages/u/ufoai-data/copyright-2.5-2
[4] https://wiki.debian.org/Proposals/CopyrightFormat
[5] https://www.debian.org/social_contract

Antonin Delpeuch (lists)

unread,
Feb 13, 2021, 2:18:47 AM2/13/21
to openref...@googlegroups.com
Fi Markus,

Thanks very much for this very thorough explanation!

In terms of breaking changes, it's pretty major:
* most extensions written for 3.x will not be compatible with 4.x
(because of a major change of server-side architecture)
* 4.x stores project data in a different format, that 3.x cannot read.
And so far, 4.x has only partial support for reading project data
written with 3.x

Best,
Antonin

Antonin Delpeuch (lists)

unread,
Feb 16, 2021, 12:07:11 PM2/16/21
to openref...@googlegroups.com
Ok, so after thinking about this a bit I am now also leaning towards
packaging 3.x. I think it should be reasonably safe to upgrade Jetty
before 3.5. I will try to do this this week.

This should just give us another incentive to make the migration from
3.x to 4.x as soon as possible, and gives me more time to iron out
aspects of the new architecture which are not quite ripe yet.

Markus, what do you think about this?

Antonin

Tom Morris

unread,
Feb 16, 2021, 3:20:22 PM2/16/21
to openref...@googlegroups.com
On Tue, Feb 16, 2021 at 12:07 PM Antonin Delpeuch (lists) <li...@antonin.delpeuch.eu> wrote:
Ok, so after thinking about this a bit I am now also leaning towards
packaging 3.x. I think it should be reasonably safe to upgrade Jetty
before 3.5. I will try to do this this week.

I had a quick look at this and my (non-exhaustive) investigation seemed to indicate that it should be straightforward.

Tom

Markus Koschany

unread,
Mar 12, 2021, 6:04:30 PM3/12/21
to openref...@googlegroups.com
Am Dienstag, den 16.02.2021, 18:07 +0100 schrieb Antonin Delpeuch (lists):
> Ok, so after thinking about this a bit I am now also leaning towards
> packaging 3.x. I think it should be reasonably safe to upgrade Jetty
> before 3.5. I will try to do this this week.
>
> This should just give us another incentive to make the migration from
> 3.x to 4.x as soon as possible, and gives me more time to iron out
> aspects of the new architecture which are not quite ripe yet.
>
> Markus, what do you think about this?

Sorry for not replying earlier but I was waiting for the final go-ahead from
Freexian. Since all formalities are resolved now, I will start to package
OpenRefine 3.4.1 [1] and I expect to introduce all major components within the
next 3-4 weeks, although it may take a bit longer until Debian's ftp team
finally approves the packages. In any case you will get a report about the
current status in the first week of April.

To me it seems there was a preference to package the current stable release
instead of the more experimental 4.x series. Since Jetty 9 doesn't require new
dependencies I can just package the existing components and when 3.5 is finally
released, I just upgrade to it or apply a patch before this happens.

If there are more questions, I will just contact you on the list.

Markus

[1] https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1

Antonin Delpeuch (lists)

unread,
Mar 16, 2021, 10:28:50 AM3/16/21
to openref...@googlegroups.com
On 13/03/2021 00:04, 'Markus Koschany' via OpenRefine Development wrote:
> Sorry for not replying earlier but I was waiting for the final go-ahead from
> Freexian. Since all formalities are resolved now, I will start to package
> OpenRefine 3.4.1 [1] and I expect to introduce all major components within the
> next 3-4 weeks, although it may take a bit longer until Debian's ftp team
> finally approves the packages. In any case you will get a report about the
> current status in the first week of April.
>
> To me it seems there was a preference to package the current stable release
> instead of the more experimental 4.x series. Since Jetty 9 doesn't require new
> dependencies I can just package the existing components and when 3.5 is finally
> released, I just upgrade to it or apply a patch before this happens.

Great! Note that 3.4.1 uses a lot of really old dependencies, the master
branch (which will become 3.5) should be much easier to work with
(hopefully!).

I have been trying to upgrade to Jetty 9 and I think I am pretty close,
but the CI still disagrees so far. Hopefully I will get this to work
before it becomes a blocker for you.

Do let us know if there is anything that holds you back, if you have any
feedback about how we should be doing things differently, and so on. We
are all ears!

Antonin

Antonin Delpeuch (lists)

unread,
Mar 18, 2021, 5:14:40 AM3/18/21
to openref...@googlegroups.com
On 16/03/2021 15:28, Antonin Delpeuch (lists) wrote:
> I have been trying to upgrade to Jetty 9 and I think I am pretty close,
> but the CI still disagrees so far. Hopefully I will get this to work
> before it becomes a blocker for you.

This is now merged in master - let me know if it works for you!

Best,
Antonin

Markus Koschany

unread,
Apr 10, 2021, 5:01:46 PM4/10/21
to openref...@googlegroups.com
Hello,

here is my first report where we currently stand in regard to packaging
OpenRefine 3.5 for Debian. I have packaged and uploaded nine new source and
binary packages for now, upgraded libjuniversalchardet-java to the latest
upstream release and quickly packaged jdatapath and the opencsv fork. I'm not
sure if these two packages should really be uploaded to Debian, more about this
shortly.

The new packages are currently waiting in the NEW queue for further review by
Debian's ftp team. https://ftp-master.debian.org/new.html

The list of packages is as follows. The tracker.debian.org links return 404 at
the moment but (for future readers) after the packages enter Debian, you can
find all relevant information about a certain source package there.

1. openrefine-jdatapath
=======================

https://people.debian.org/~apo/openrefine/openrefine-jdatapath/

https://code.google.com/archive/p/jdatapath/

In my opinion jdatapath is only useful for Windows systems (if at all).
Apparently there was a user.home property bug on Windows systems (19 years ago)
and this package should fix that. But according to

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4787931

the bug was resolved eight years ago in Java 8.

server/src/com/google/refine/Refine.java is the only place that makes use of
the code. I could simply remove the few lines starting at line 358 and the
result for Debian would be the same. I suggest not to depend on jdatapath in
the long-term and use a patch for Debian as a workaround for now.


2. openrefine-opencsv
=====================

This is an eight year old fork of the opencsv project on sourceforge.

https://people.debian.org/~apo/openrefine/openrefine-opencsv/

We currently have an almost up-to-date opencsv package in Debian and the
upstream project is still actively developed.

https://tracker.debian.org/pkg/opencsv

Ideally the fork should be dropped from the OpenRefine project. I'm not sure
how time consuming a switch to the sourceforge opencsv project was, but I would
recommend the switch to avoid code duplication. For now I'm planning with a new
openrefine-opencsv source package but I will also investigate if it is feasible
to depend on src:opencsv in Debian instead.

3. libdexx-java
===============

build-dependency of apache-jena

https://tracker.debian.org/pkg/libdexx-java
https://salsa.debian.org/java-team/libdexx-java
https://bugs.debian.org/986609

4. libthrift-java
=================

build-dependency of apache-jena

https://tracker.debian.org/pkg/libthrift-java
https://salsa.debian.org/java-team/libthrift-java
https://bugs.debian.org/986435

5. apache-jena
==============

It took a while to fix a few build failures because Debian's tools appear to
struggle with multi-module Maven projects which have nested modules inside of
modules as is the case with apache-jena. For now it works and I will discuss
these issues with the Debian Java team.

https://tracker.debian.org/pkg/apache-jena
https://salsa.debian.org/java-team/apache-jena
https://bugs.debian.org/986605


6. libmarc4j-java
=================

https://tracker.debian.org/pkg/libmarc4j-java
https://salsa.debian.org/java-team/libmarc4j-java
https://bugs.debian.org/986677


7. liblessen-java
=================

build-dependency of openrefine-butterfly

https://tracker.debian.org/pkg/liblessen-java
https://salsa.debian.org/java-team/liblessen-java
https://bugs.debian.org/986608

8. libsecondstring-java
=======================

build-dependency of openrefine-vicino

https://tracker.debian.org/pkg/libsecondstring-java
https://salsa.debian.org/java-team/libsecondstring-java
https://bugs.debian.org/986680


9. openrefine-arithcode
=======================

build-dependency of openrefine-vicino

https://tracker.debian.org/pkg/openrefine-arithcode
https://salsa.debian.org/java-team/openrefine-arithcode
https://bugs.debian.org/986678


10. openrefine-butterfly
========================

https://tracker.debian.org/pkg/openrefine-butterfly
https://salsa.debian.org/java-team/openrefine-butterfly
https://bugs.debian.org/986611

11. openrefine-vicino
=====================

https://tracker.debian.org/pkg/openrefine-butterfly
https://salsa.debian.org/java-team/openrefine-butterfly
https://bugs.debian.org/986679

12. libjuniversalchardet-java
=============================

I updated to version 2.4.0, uploaded the package to experimental because of the
current Debian freeze, relocated the Maven coordinates of the old package
(which was eight years old) and made sure that all reverse-dependencies still
can be built from source.

https://tracker.debian.org/pkg/libjuniversalchardet-java



Outlook:

I'm currently packaging odfdom and then the last major missing build-dependency
is org.sweble.wikitext:swc-parser-lazy where we also need three or four
additional build-dependencies to build this package from source. After that I
can start to tie everything together. Thanks for the Jetty update. It looks
fine to me. However testing the change will only be possible in one of the
later packaging stages.

More updates will follow soon.

Regards,

Markus



Antonin Delpeuch (lists)

unread,
Apr 11, 2021, 3:15:13 AM4/11/21
to openref...@googlegroups.com
Hi Markus,

Thanks for this update! I can't wait to get rid of jdatapath, I haven't
looked into it so far but according to your analysis it looks simple.

For opencsv it's a bit more difficult, because to rely on the official
version we would have to drop support for multi-character separators in
CSV/TSV files. The OpenCSV maintainers are not keen to add support for
it upstream either. But we can think again about this issue.

Best,
Antonin

Markus Koschany

unread,
Apr 19, 2021, 12:11:13 PM4/19/21
to openref...@googlegroups.com
Hi,

Am Sonntag, den 11.04.2021, 09:15 +0200 schrieb Antonin Delpeuch (lists):
> Hi Markus,
>
> Thanks for this update! I can't wait to get rid of jdatapath, I haven't
> looked into it so far but according to your analysis it looks simple.
>
> For opencsv it's a bit more difficult, because to rely on the official
> version we would have to drop support for multi-character separators in
> CSV/TSV files. The OpenCSV maintainers are not keen to add support for
> it upstream either. But we can think again about this issue.

Eventually I have packaged the OpenCSV fork and uploaded it a few days ago. If
your changes are not upstream-able then we should better use the fork to ease
the maintenance.

openrefine-opencsv
==================

https://bugs.debian.org/987099
https://tracker.debian.org/pkg/openrefine-opencsv

https://salsa.debian.org/java-team/openrefine-opencsv


I believe I have packaged all major missing dependencies of OpenRefine now and
I will put everything together this week, let's see how it goes.

In addition to my last report I have also packaged the following new source
packgages:

librdfa-java
============

https://bugs.debian.org/986857
https://tracker.debian.org/pkg/librdfa-java
https://salsa.debian.org/java-team/librdfa-java

libodfdom-java
==============

https://bugs.debian.org/986681
https://tracker.debian.org/pkg/libodfdom-java
https://salsa.debian.org/java-team/libodfdom-java

libxtc-rats-java
================

https://bugs.debian.org/986922
https://tracker.debian.org/pkg/libxtc-rats-java
https://salsa.debian.org/java-team/libxtc-rats-java

libsweble-common-java
=====================

https://bugs.debian.org/986926
https://tracker.debian.org/pkg/libsweble-common-java
https://salsa.debian.org/java-team/libsweble-common-java

libsweble-wikitext-java
=======================

https://bugs.debian.org/986924
https://tracker.debian.org/pkg/libsweble-wikitext-java
https://salsa.debian.org/java-team/libsweble-wikitext-java

maven-jflex-plugin
==================

https://bugs.debian.org/986921
https://tracker.debian.org/pkg/maven-jflex-plugin
https://salsa.debian.org/java-team/maven-jflex-plugin

httpcomponents-core5
====================

https://bugs.debian.org/987097
https://tracker.debian.org/pkg/httpcomponents-core5
https://salsa.debian.org/java-team/httpcomponents-core5

httpcomponents-client5
======================

https://bugs.debian.org/987098
https://tracker.debian.org/pkg/httpcomponents-client5

https://salsa.debian.org/java-team/httpcomponents-client5


Regards,

Markus



Antonin Delpeuch (lists)

unread,
Apr 21, 2021, 7:08:59 AM4/21/21
to openref...@googlegroups.com
Thanks a lot Markus! I'm on the jdatapath issue.

Best,

Antonin
Reply all
Reply to author
Forward
0 new messages