Hi,
For ingesting data into the platform there are a few options. One is by using the
hopsworks-cli library, which is a java client for the Hopsworks REST API. Users can upload data through the DatasetService directly into their project's datasets. Another approach would be to open the Kafka port in Hopsworks and external applications running for example on IoT devices can push directly to it by using the kafka client. Then it's trivial to run a Spark app in Hopsworks that ingests and process data from Kafka and potentially persist them to the Datasets where they can be shared as well. An example is shown on slide 18
here. Ingestion for AWS S3 buckets is also feasible by using Spark.
On IoT, we have done work on ingesting data from Android devices into Hopsworks via Kafka. That is achieved by having the mobile devices authenticate to Hopsworks by using the project's X.509 certificate. We are interested in adding mqtt support, as we are currently investigating how we can use the same mechanism as the one used by the Android devices to authenticate to Hopsworks.
About the specific tools. support for sqoop is coming into Hopsworks in February as part of AirFlow on Hopsworks, we don't support Kafka-connect (
part of the reason) but we are running Apache Kafka so its satellite tools can be added into Hopsworks and we don't support nifi.
In general, the thing to pay attention when adding new services in Hopsworks is to make them compliant with the project-user multi-tenancy model of Hopsworks. That is, a user in a project should not be able to access for example nifi workflows of other projects. even if the user if a member of both projects. Of course the implementation details vary depending on the service itself, if it has any notion of users and if so how access management is performed, where users are stored etc.
The most active Hopsworks repository is the one of
Logical Clocks, so you could fork if its AGPLv3 license is compatible with your project. Otherwise there is also
this repo with a different license. Also, this Hops
repo is the most heavily maintained and developed and it's the one Logical Clocks is contributing to as well.
Thanks,
Theo