Hi Thad,
Of course the word "record" is used a lot of database systems - but that
does not mean they have anything in common.
Flink could be interesting in the future to run OpenRefine workflows on
streams (which I would be keen to have), but that is a significant step
away from what OR currently does. Aiming for a data model that is even
further away from what OpenRefine currently is would make a migration
even riskier. I don't think we have the time or resources to venture
into that right now. The leap by going to Spark is big enough already.
If we end up making the data processing backend pluggable then why not
develop a Flink implementation, but I do not think that should be a
priority.
I know you would be keen to "have SQL support", whatever that actually
means - it's easy to ask for some software to "support" some technology,
it's something else to come up with a meaningful integration that
actually helps people in their workflows. Remember Data Packages…
If you mean being able to run a SQL query on an OpenRefine project (or
multiple projects, perhaps, with joins between them), then this is
something we could do with most engines, including Spark (and again, the
availability of user defined types is not going to be a blocker for
that). But I am yet to be convinced that this is something worth
developing in OpenRefine: we need to have a clear vision of which user
workflows we are talking about. *Concrete* workflows :)
I know you don't like when I ask for something "concrete", but that is
what it really comes down to: you cannot plan migrations like this on
the basis of a sales pitch enumerating the technologies supported by
some platform. You need to have a look deep into the architecture to
understand how that platform could fit in and cater for our users' needs.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/2ec6f3ca-678b-37e3-0b03-dea23fc98989%40antonin.delpeuch.eu.
--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/e67fc227-08c4-a6af-e675-cadf05662302%40antonin.delpeuch.eu.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/55d07f44-26ac-6d5a-04aa-d84ac68b42d3%40antonin.delpeuch.eu.
> 2. Do we plan at some point in the future to be able to save « trees »
> in a project?
Do you mean representing hierarchical information in a more faithful way
than the records mode? I would be very keen to have that. As you know we
have been discussing this before:
https://groups.google.com/forum/#!searchin/openrefine/future|sort:date/openrefine/X9O8NBC1UKQ
If that's not what you mean, just expand :)
table["My-Table-1"].rows
tree["My-First-Tree".branches]
This month I have implemented the idea (suggested in this thread) of
making the datamodel implementation pluggable, so that we could easily
switch from Spark to something else. I am really happy about this move
because it does make things a lot cleaner:
--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/ced64272-0201-99bc-43fb-82566470386c%40antonin.delpeuch.eu.
I realize I have not replied to this! Very sorry about that.
Rereading some of the earlier discussion from the end of April, the details of which I kind of skimmed over in the flurry of discussion, I wanted to provide feedback on some of the perceived design constraints.
I don't take it as a given that functionality and form needs to be preserved 100%. In particular, when there's a "nothing else in the world operates quite this way" moment, I take that as a sign to take a step back and consider if OpenRefine should do it differently too. As an example, Fill Down and Blank Down, which require row state, exist principally as an artifact of how we create row groups aka "records". We could provide the same functionality by designating certain columns to be the grouping key (currently leading columns) and instead of using blank/null as the flag, say that any rows with the same set of values for those cells are in the same group/record. In the same vein, we should look at records with an eye towards what people use them for rather than as a literal implementation which can't be changed.
From an internal architecture point of view, if rather than having facets create a filtered row set to operate on we can have them generate a filter expression (a la SQL WHERE) that we can use to generate statistics and combine together to generate sets of rows to operate on. It's been years since I looked at it, but as I remember MassEdit was a candidate. It's kind of a generic catch-all that's used to implement a bunch of stuff.
All that being said: I am very keen to make progress on the discussion
around the records mode to identify its successor (and have similar
discussions on other aspects of the tool that we want to revamp). It
just requires a lot of design effort that we have not been able to put
in so far.
Antonin
Great! I am easy to convince: just cook up a SQL schema to represent an
OpenRefine project, and then give some sample SQL queries which
represent the operations I listed.
>
> Is it the UI itself and how to give users controls on visualizing,
> transforming, querying records that you struggle with here?
> Or is it the internals and data models?
The records mode is fundamentally a data model problem. The question is:
how should users represent hierarchical data in OpenRefine projects? For
instance, how could workflows like this one be done without the records
mode:
https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Tutorials/Working_with_APIs
Antonin
Data publishing technologies can be compared along two orthogonal dimensions: flexibility of data modeling and flexibility of presentation. The former is a spec-trum ranging from rigid schemas that cannot be changed to general data models that can fit any information. The latter ranges from predetermined layouts and styles to completely customizable presentations. Figure 2.1 lays out this space and frames the discussion that follows
- when running an OpenRefine workflow in a pipeline (for instance as a
Spark job), I want external data to be fetched in a pipelined way: for
instance, if my workflow consists of reconciliation followed by data
extension, I expect that the corresponding HTTP requests will be done in
near-parallel: the first data extension request will not have to wait
for the last reconciliation request.
The behavior of the above code is undefined, and may not work as intended. To execute Jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this caseforeach()
). This closure is serialized and sent to each executor.
- similarly, if we implement runners for stream-based engines (such as
Flink), we could potentially run OpenRefine workflows on streams (such
as Kafka sources). Not all operations make sense in this setting (for
instance, reordering rows using a sorting criterion is not possible
without adding a windowing strategy), but this could still be useful
down the line. In this setting, if I am fetching URLs, I obviously want
these to be fetched as the stream is consumed.
- in the near future, we also want to be able to persist partial URL
fetching results (or reconciliation results) such that we can recover
from a crash during a long-running operation more efficiently, and also
let the user inspect partial results before the operation completes
(typically, for reconciliation: being able to inspect the first few
reconciliation candidates that are returned, and potentially start
judging them before all other candidates are fetched).
Do you agree with these use cases? Are there others I should keep in mind?
I would like to be able to load a new Excel/CSV file into the existing OpenRefine dataset and do cleaning on it. Ideally, this new dataset would come in as new rows that I can then start manipulating.
The goal is to reach a stage where I can merge the backlog of commits
that have accumulated on the master branch since I branched off to work
on the new architecture (this is going to be a lot of fun) and then stay
up to date by cherry-picking changes as they are merged in master.
Having a complete feature set soon will help run arbitrary workflows on
this branch, which is necessary for thorough debugging.
--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/5f43ef04-94ed-0d02-5234-7938a876898f%40antonin.delpeuch.eu.
ChangeData objects can of course be persisted to disk and restored, with
an API similar to that of GridState.
In addition, projects are associated with a ChangeDataStore which holds
all the change data objects associated with operations in the project
history.
I also (re)discovered one aspect of the data extension operation which
made its migration a bit more difficult. When you run this operation in
rows mode, the operation does not respect facets fully.
--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/05bfd119-826d-5b1d-787c-aeb336c74af6%40antonin.delpeuch.eu.
On 02/09/2020 20:57, Tom Morris wrote:
> If that's not a typo where you meant record mode, that just sounds like
> a bug. Is there a good reason to preserve the buggy behavior?
I did mean what you quoted. In records mode, facets are respected, but
not in rows mode.
For me, the current behaviour feels right, although it can be a bit
surprising at first. Taking the same example grid, what other behaviour
could we have?
We could only fetch the first value of each property when in rows mode,
which would give us:
a b d v1
c
But that is silently discarding data, which sounds pretty dangerous.
Or we could add extra values on new empty rows:
a b d v1
v2
c
But that breaks the record structure, since b and c should intuitively
be on contiguous rows.
That being said this last behaviour is perhaps okay for most purposes, I
could be convinced to switch to that. But the bottom line is that when
dealing with multi-valued properties, the user will have to switch to
the records mode anyway.
Note: when using custom objects as the key in key-value pair operations, you must be sure that a customequals()
method is accompanied with a matchinghashCode()
method. For full details, see the contract outlined in the Object.hashCode() documentation.
Another next step I have in mind is adding a method in the datamodel
interface to parse a grid from an iterable stream (a stream that you can
iterate over multiple times). Even if by default Spark does not take
advantage of that, other datamodel runners could (our home-grown one,
Flink, others…)
This should make it possible to improve the efficiency of parsers such
as the JSON/XML parsers, for which we can read the file in a streaming
fashion.
Another improvement I would also like to make is the ability to create
an OpenRefine project from a file, without duplicating the file contents
in the workspace: being able to work from the original file directly.
This would be useful for large datasets held in a cluster (for instance
a Hadoop file system), when they are in a format that is quick enough to
parse.
We should also discuss about whether (and how) we would like to support
formats such as Parquet. Since the use of Spark in OpenRefine should be
optional, this means we should rely on other libraries to parse these
(when Spark is not available). It is not clear to me that this is an
urgent need of the OpenRefine community either, so I haven't dedicated
effort to that yet (but I have ideas of how to make it work). Perhaps it
would still be worth doing it for the sake of bringing visible changes
to users.
Then I will carry on with general fixes to catch as many bugs I notice
(there will inevitably be quite a few). This can already be done using
the existing datamodel implementations (Spark and naive). In parallel I
will be developing a new datamodel implementation that is suitable as a
implementation (as discussed earlier).
--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/210d43dc-2c4d-5752-c753-8470872217c6%40antonin.delpeuch.eu.
I have finished merging the new architecture and the latest changes on
the master branch.
The tests for grid sorting were failing, and that was due to some race
condition in the tests, which would fail more often on the new
architecture.
In the meantime Florian has been working on improving these
tests to make them less sensitive to timings.
I have added
support for importing projects exported from OpenRefine 3.x, but that
only imports the project in the state it was left in - the history is
discarded. This is because the architecture change makes it very hard to
preserve this history. This is potentially a pain point since arguably
one of the main benefits of project imports is to preserve this history
(even if the import is still useful to preserve things like
reconciliation data or overlay models). I can think of a few ways to
mitigate this:
- make it clearer in the UI that the history is discarded, by showing
the user a message, indicating that they should use OpenRefine 3.x if
they need to preserve the history.
- introduce a non-breaking change in OpenRefine 3.6 (for instance),
which serializes the initial grid state in addition to the current grid
state in every project. This would make it possible to preserve history
of 3.6 projects in 4.x. So if a user has a 3.4 project that they want to
open with 4.x, they could first upgrade to 3.6, open the project there,
export it to an archive and import it to 4.x
- invest a lot of effort to re-implement old changes in OpenRefine 4.x
so that the new architecture is fully backwards-compatible. In any case,
this would not make it possible to work seamlessly with 3.x and 4.x on
the same workspace:
once a project has been converted to 4.x, it will no
longer be understood by 3.x.
Hazelcast Apache Spark Connector allows Hazelcast Maps and Caches to be used as shared RDD caches by Spark using the Spark RDD API. Both Java and Scala Spark APIs are supported.
--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/25fbf344-ea87-53e8-4ac9-8247ea54c8a6%40antonin.delpeuch.eu.
--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/cec27d5e-2cad-8eb6-e2c3-4fa6886f4066%40antonin.delpeuch.eu.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/CALQ49KmbBnxC5HhUA9YuaSH8ZM%3DkcSeG%2BwU%2B%3DRZcFcOzx6Wjvw%40mail.gmail.com.
I forgot I owed you this info from our last call with SJ regarding in-memory data grid and how we used Ignite with Spark ...
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/CAChbWaNCPaczcx0SQd_vpLi2NxN7V75WhdKG%2B8GBc2cm7-Brcg%40mail.gmail.com.