row indexing

138 views
Skip to first unread message

Håkan Jonsson

unread,
May 9, 2011, 4:17:57 AM5/9/11
to google-refine
Hi,

Sorry for posting a newbie question, but I couldn't find the answer
to this simple question regarding Google Refine anywhere: Can I
reference different rows to create new column values? For example, I
want to create a new column where each cell contains the value of
row(i), column(x) minus row(i-1), column(x)?

Regards,
Håkan

Thad Guidry

unread,
May 9, 2011, 11:02:46 AM5/9/11
to google...@googlegroups.com
The row.index is a variable defined in Google Refine: http://code.google.com/p/google-refine/wiki/Variables

gives some examples of row.index usage.

If that doesn't help you along, let us know.

2011/5/9 Håkan Jonsson <haj...@gmail.com>

Håkan Jonsson

unread,
May 9, 2011, 1:20:29 PM5/9/11
to google...@googlegroups.com
Yes, I noticed that, but I couldn't figure out a way to reference a value on the previous row (rather than the current row)?

/Håkan

David Huynh

unread,
May 9, 2011, 1:24:51 PM5/9/11
to google...@googlegroups.com
Hi Håkan,

You can't reference the previous row like that ... Just to understand your use case a bit more, is it only the immediately preceding row that you want, or do you want to go an arbitrary number of rows backward?

By the way, rows within each record can reference one another. That solves some use cases, but not yours.

David

2011/5/9 Håkan Jonsson <haj...@gmail.com>

Håkan Jonsson

unread,
May 9, 2011, 2:14:07 PM5/9/11
to google...@googlegroups.com
Hi,

In this case I am only interested in the previous row, not any row. But for me it is a common case.

/Håkan

Randall Amiel

unread,
May 9, 2011, 10:18:23 PM5/9/11
to google-refine
David:

Could the feature include something that could process to the next row
that meets some criteria such as a blank row or a row that matches
some regular expression or column(s) expression(s). Then from there if
we think of refine as a paper feed, it would be nice to have access to
the previous x rows that meets the previous criteria and possibly the
next row (at a minimum). Furthermore, it would be even nicer to have
something like the regular expression look-ahead/look-behind where we
could also find the next break in advance, Since you guys are limiting
the scope of refine to approx 1million rows of data, it would be
pretty feasible to process, in the use cases that you don't have the
whole file as a break etc...... To note, could the numbers of columns
and amount of system memory could be of limiting factors? what other
limiting factors do you foresee?

This could be very useful in processing more complex datasets like
i've discussed with the forum before such as complex XML like datasets
that are unstructured and very hard to detect and moreover process.I
think that having pointer access to the previous (x) rows and next (x)
rows where (x) is a critieria matching row could be very useful in
refining these "types" of complex data structures that are loosely
defined.

-- Randall
On May 9, 1:24 pm, David Huynh <dfhu...@gmail.com> wrote:
> Hi Håkan,
>
> You can't reference the previous row like that ... Just to understand your
> use case a bit more, is it only the immediately preceding row that you want,
> or do you want to go an arbitrary number of rows backward?
>
> By the way, rows within each record can reference one another. That solves
> some use cases, but not yours.
>
> David
>
> 2011/5/9 Håkan Jonsson <haj...@gmail.com>
>
>
>
>
>
>
>
> > Yes, I noticed that, but I couldn't figure out a way to reference a value
> > on the previous row (rather than the current row)?
>
> > /Håkan
>

David Huynh

unread,
May 11, 2011, 12:30:28 AM5/11/11
to google...@googlegroups.com
Hi Randall,

I definitely agree that there's a need for accessing other rows. It's a bit tricky to implement in conjunction with faceted browsing, though, because with some facets or filters applied, "the nth preceding row" might mean 2 different things: either the row addressed by subtracting the current row's index by n, or the preceding nth row that has matched the facets and/or filters. The nth *following* row is even harder because we would need to look ahead.

We might need 2 different syntaxes for the 2 meanings

  rows[row.index - 1]  // this is the first meaning
  row.neighbors[-1]  // this is the second meaning

And what you're asking is yet another meaning, which perhaps we can express as

  row.neighbors.find(10, r, r.cells["column X"].value == 5)
  // look forward up to 10 rows, bind each row to variable "r",
  // evaluate the expression, if it's true, return that row)

I'd like to understand all the use cases we know before committing to this feature. We want the fewest and simplest building blocks that would solve all those cases.

But, note that with however we decide to implement this feature, there is an important complication: these operations that involve looking at other rows cannot be implemented in a map/reduce manner in which rows are processed in parallel.

David

drLization

unread,
Jun 6, 2011, 3:41:02 PM6/6/11
to google...@googlegroups.com
Dear David,

I have a similar request to this one. In my situation I'm using refine to clean data for Data Migration from 1 system to another system. We extract all kinds of data such as Vendor & Customer Details and Bank account records that have been collected in their old system over the past 7 years. All of this has been entered by several people who follow slightly different conventions and when I first heard about Google Refine, I realized how it will off-load a lot of what we currently require our customers to do while preparing data for us. Surely they will take the opportunity to filter out outdated data, but at the same time a lot of cleaning can be done as well. (where in the old system 1 field was used for multiple details, we can now use the powerful Cell transformations (with regex) and split Column functionality to get the data in the right format). Our datasets are usually between 500 and 1000 records (very small :)

The problem I'm facing now, I believe, is related to this discussion. 

What I'm trying to achieve is to get a running number for Vendor records based on the rowIndex within a Facet. I'm getting close with the facetCount(value[0,2],"value[0,2]","Vendor_Name")  but not quite there yet.

An Example might help, for instance I have the following Column Based on "Vendor_Name:":
rowvaluevalue[0,2]+facetCount(value[0,2],"value[0,2]","Vendor_Name")
1.3D CAM PAZ. SAN. TIC. LTD. STI3D1
2.ADANA HILTON ENT. OTEL.LTD.STIAD1
3.AGRUP BIYOLOJIK URUN.SAN.TIC.LTD.STIAG2
4.AGABEY GIDA TARIM VE HAY. TIC.SAN.LTD.STAG2

What I want to see:
rowvaluevalue[0,2]+facetCount(value[0,2],"value[0,2]","Vendor_Name")
1.3D CAM PAZ. SAN. TIC. LTD. STI3D1
2.ADANA HILTON ENT. OTEL.LTD.STIAD1
3.AGRUP BIYOLOJIK URUN.SAN.TIC.LTD.STIAG1
4.AGABEY GIDA TARIM VE HAY. TIC.SAN.LTD.STAG2

For now, I have to go back to Excel and use =SUMPRODUCT(($A$2:$A$7=A2)*(ROW($A$2:$A$7)<ROW(A2)))+1
Which will give an array of 0's and 1's for the first part wherever the value is equal to the current value multiplied with an array of 0's and 1's where the rows are before the current row, finally the resulting array of 0's and 1's for each row that pass both conditions is summed to give the total of records that matched both conditions.

My first approach was to try if the Facet Expression could handle this double condition within the facetCount (although I doubt this is possible now), my second thought was to see if I could get the rowIndex within a certain facet (which brought me to this thread).

As I see this discussion is touching on possible use cases in accessing other rows. I believe that my scenario would fall under the concept of "neighbor" rows within the same facet, although only looking back would suffice in my case.

If however you believe I can already achieve the above in the current Refine release, then I look forward to hear suggestions.

Best Regards

David Huynh

unread,
Jun 7, 2011, 1:33:53 AM6/7/11
to google...@googlegroups.com
Hi drLization,

Today I checked in a change that would support your use case. You could try checking out the code from trunk/ if you like, or wait for the 2.1 release. With this change, you can solve your problem by the following steps

1. Create a new column called "prefix" based on Vendor_Name, with expression:
        value[0,2]
2. Re-order the columns so that "prefix" is the first column.
3. Sort by "prefix" and invoke Re-order rows permanently.
4. Switch to "record" mode.
5. Create a new column based on column prefix, with the expression:
        value + (row.index - row.record.fromRowIndex + 1)

Let me know if that works!

David

Thad Guidry

unread,
Jun 7, 2011, 9:34:09 AM6/7/11
to google...@googlegroups.com
3. Sort by "prefix" and invoke Re-order rows permanently.

Didn't see where to Re-order rows permanently.  Something missing now in /trunk r2090 ?
 
4. Switch to "record" mode.
5. Create a new column based on column prefix, with the expression:
        value + (row.index - row.record.fromRowIndex + 1)
 
No addition effect.  Consecutive values stayed the same such as MI-1 and MI-1.
Minus sign is treated as a string ? or is that the intent ?
 
Let me know if that works!

David
 
--
-Thad
http://www.freebase.com/view/en/thad_guidry

Paul Makepeace

unread,
Jun 7, 2011, 9:51:27 AM6/7/11
to google...@googlegroups.com


On Tue, Jun 7, 2011 at 14:34, Thad Guidry <thadg...@gmail.com> wrote:
3. Sort by "prefix" and invoke Re-order rows permanently.

Didn't see where to Re-order rows permanently.  Something missing now in /trunk r2090 ?

It's at the end of the row "Show as: rows records  Show: 5102550 rows  Sort"

P

Thad Guidry

unread,
Jun 7, 2011, 10:00:51 AM6/7/11
to google...@googlegroups.com
David.... SERIOUSLY? haha... why the heck hidden under there ? hmm,
checkmark in the Sort... panel was were I was expecting.
(minor annoyance)

--
-Thad
http://www.freebase.com/view/en/thad_guidry

Thad Guidry

unread,
Jun 7, 2011, 10:05:30 AM6/7/11
to google...@googlegroups.com
On Tue, Jun 7, 2011 at 9:00 AM, Thad Guidry <thadg...@gmail.com> wrote:
> David.... SERIOUSLY?  haha... why the heck hidden under there ?  hmm,
> checkmark in the Sort... panel was were I was expecting.
> (minor annoyance)

Ah, never mind... I just found the usefulness of having a quick Sort
dropdown selector.
David HAS thought of everything. ;)

--
-Thad
http://www.freebase.com/view/en/thad_guidry

drLization

unread,
Jun 8, 2011, 10:06:25 AM6/8/11
to google-refine
Thank you, I love how quickly you replied. I'm sorry for not replying
earlier as I was holding back until I had the chance to try it..
unfortunately I'm on project and had to work on the outstanding tasks
at hand.

will try and come back to you as soon as I can.
> > rowvaluevalue[0,2]+facetCount(value[0,2],"value[0,2]","Vendor_Name")1.3D
> > CAM PAZ. SAN. TIC. LTD. STI3D12.ADANA HILTON ENT. OTEL.LTD.STIAD13.AGRUP
> > BIYOLOJIK URUN.SAN.TIC.LTD.STIAG24.AGABEY GIDA TARIM VE HAY.
> > TIC.SAN.LTD.STAG2
> > What I want to see:
> > rowvaluevalue[0,2]+facetCount(value[0,2],"value[0,2]","Vendor_Name")1.3D
> > CAM PAZ. SAN. TIC. LTD. STI3D12.ADANA HILTON ENT. OTEL.LTD.STIAD13.AGRUP
> > BIYOLOJIK URUN.SAN.TIC.LTD.STIAG14.AGABEY GIDA TARIM VE HAY.
> > TIC.SAN.LTD.STAG2

Magdmartin

unread,
Apr 9, 2012, 4:32:25 PM4/9/12
to google...@googlegroups.com
David,

I was looking for a way to get the row index within a record ... thanks for the tips.
In my case instead of adding 2 or 3, I want to add 0.2 or 0.3 
I've tried the following toNumber(row.index - row.record.fromRowIndex + 1)/10 but grefine returns 0.

After some testing (while typing this message) I found a work around: toNumber(row.index - row.record.fromRowIndex + 1)*0.01

I guess it is still worth reporting it ...

Martin

Le mardi 7 juin 2011 01:33:53 UTC-4, David Huynh a écrit :
Hi drLization,

Today I checked in a change that would support your use case. You could try checking out the code from trunk/ if you like, or wait for the 2.1 release. With this change, you can solve your problem by the following steps

1. Create a new column called "prefix" based on Vendor_Name, with expression:
        value[0,2]
2. Re-order the columns so that "prefix" is the first column.
3. Sort by "prefix" and invoke Re-order rows permanently.
4. Switch to "record" mode.
5. Create a new column based on column prefix, with the expression:
        value + (row.index - row.record.fromRowIndex + 1)

Let me know if that works!

David

Tom Morris

unread,
Apr 9, 2012, 6:24:20 PM4/9/12
to google...@googlegroups.com
On Mon, Apr 9, 2012 at 4:32 PM, Magdmartin <martin.m...@gmail.com> wrote:

> I was looking for a way to get the row index within a record ... thanks
> for the tips.
> In my case instead of adding 2 or 3, I want to add 0.2 or 0.3
> I've tried the following toNumber(row.index - row.record.fromRowIndex +
> 1)/10 but grefine returns 0.

That's a quirk of the way that computers/programming languages treat
numbers. Integer arithmetic is used if all the operands are integers.
If you divide by 10.0 instead of 10, you'll get the result that you
expect.

It could be argued that, as a tool for non-programmers, Refine would
be better off using floating point for everything, but that'd probably
have other unexpected side effects (e.g. weird roundoff effects due to
the way computers encode floating point numbers).

Tom

Paul Makepeace

unread,
Apr 9, 2012, 9:52:15 PM4/9/12
to google...@googlegroups.com

The biggest annoyance in my experience is that equality is effectively undefined for floating point owing to approximate representations. To non-programmers that's probably even more surprising :-/

Tom Morris

unread,
Apr 11, 2012, 10:15:24 AM4/11/12
to google...@googlegroups.com
On Mon, Apr 9, 2012 at 9:52 PM, Paul Makepeace <pa...@paulm.com> wrote:
> The biggest annoyance in my experience is that equality is effectively
> undefined for floating point owing to approximate representations. To
> non-programmers that's probably even more surprising :-/

Yes, "approximate representations" is what my "weird roundoff effects"
was referring to. Explaining the concept of comparing to an epsilon
in order to determine approximate equality seems likely to be even
more confusing to than explaining why integer and floating point
arithmetic work differently.

Tom

Reply all
Reply to author
Forward
0 new messages