Removing a columns (and their rows) that have values with decimal from a data frame

12 views
Skip to first unread message

Ryan Byrnes

unread,
Apr 4, 2018, 5:27:35 PM4/4/18
to Davis R Users' Group
Hi All,

A silly question, perhaps but here goes. I have a data frame with multiple columns one of the columns has numerical values. I am interested in keeping only the rows of this data frame that within this column of interest are whole numbers and a few thousand of them unfortunately have decimals. I won't go into why I want to remove the decimal values but was wondering if there's a way to remove numerical values with decimals from a data frame and keep the remaining whole numbers? All my searches have come up blank so far...

Thanks!
Ryan


Brandon Hurr

unread,
Apr 4, 2018, 5:31:05 PM4/4/18
to davi...@googlegroups.com
Maybe use filter and a modulo?

library(tibble)
library(dplyr)

tibble(x = c(1, 2, 3, 4.5, 5, 6.2, 7, 8.3)) %>% filter(., x %% 1 == 0)

--
Check out our R resources at http://d-rug.github.io/
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

Vince S. Buffalo

unread,
Apr 4, 2018, 5:35:02 PM4/4/18
to davi...@googlegroups.com
Or trunc():

> tibble(x = c(1, 2, 3, 4.5, 5, 6.2, 7, 8.3)) %>% filter(trunc(x) == x)
# A tibble: 5 x 1
      x
  <dbl>
1  1.00
2  2.00
3  3.00
4  5.00
5  7.00

V
--
Vince Buffalo
@vsbuffalo :: vincebuffalo.com
Coop Lab :: Population Biology Graduate Group
University of California, Davis

Scott Devine

unread,
Apr 4, 2018, 5:59:57 PM4/4/18
to davi...@googlegroups.com
This base R approach should also work for you.  If you have NAs in your numeric column, you will need to modify this approach.

#test the approach of logical indexing.  returns a TRUE when a decimal point is not found in a numeric vector converted to a character vector
test <- c(1.1, 1, 2, 1.48, 5)
class(test)
!grepl('\\.', as.character(test))

#implement logical indexing approach on a data.frame
df1 <- data.frame(a=test, b=c('no', 'yes', 'yes', 'no', 'yes'))
df1
df2 <- df1[!grepl('\\.', as.character(df1$a)), ]
df2

On Wed, Apr 4, 2018 at 2:27 PM, Ryan Byrnes <ryby...@gmail.com> wrote:

--
Check out our R resources at http://d-rug.github.io/
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.



--
Scott Devine
PhD candidate in Soils and Biogeochemistry
Dept of Land, Air, and Water Resources
University of California, Davis

Vince S. Buffalo

unread,
Apr 4, 2018, 6:18:02 PM4/4/18
to davi...@googlegroups.com
This isn't really a safe solution — what if you have large whole numbers that are displayed in scientific notation?

> test <- c(1.1, 1, 2, 1.48, 11313e14)
> test
[1] 1.1000e+00 1.0000e+00 2.0000e+00 1.4800e+00 1.1313e+18
> as.character(test)
[1] "1.1"        "1"          "2"          "1.48"       "1.1313e+18"
> !grepl('\\.', as.character(test))
> [1] FALSE  TRUE  TRUE FALSE FALSE

That last false should not be a false as it's a whole number. In general, converting numbers to characters for this stuff is not good.

Also both Brandon and my solutions can be easily implemented in base R, e.g.:

> df <- data.frame(x = c(1, 2, 3, 4.5, 5, 6.2, 7, 8.3))
> df[trunc(df$x) == df$x,]
[1] 1 2 3 5 7
> df[trunc(df$x) == df$x, ,drop=FALSE]  # or as dataframe, not vector
  x
1 1
2 2
3 3
5 5
7 7

V

Scott Devine

unread,
Apr 4, 2018, 6:39:55 PM4/4/18
to davi...@googlegroups.com
Nice catch!

However, in the event that there is a very small decimal place, trunc also fails:
> trunc(1.0000000000000001) == 1.0000000000000001
[1] TRUE
> trunc(1.0000000000000002) == 1.0000000000000002
[1] FALSE
> trunc(1.00000000000000009) == 1.00000000000000009
[1] TRUE

Brandon Hurr

unread,
Apr 4, 2018, 6:50:19 PM4/4/18
to davi...@googlegroups.com
Scott,

Yes, I was going to mention this, but failed to. The same problem appears with modulo:

> 1.0000000000000001 %% 1
[1] 0

The reason why is machine precision related and is explained fairly well here (although slightly backwards):
https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal

If there are assumptions that can be made about the data Ryan has then trunc() and %% should work. If values like 1.0000000000001 exist in his data then perhaps a different approach is required. What that approach would be... I'm not sure.

DTL might have a good response if he's watching 👀?

B


Vince S. Buffalo

unread,
Apr 4, 2018, 6:53:00 PM4/4/18
to davi...@googlegroups.com
However, in this case, this is a known issue with floating point arithmetic. The binary nature of a computer means we can't store certain decimals without rounding, e.g.

> print(1/10, digits=20)
[1] 0.10000000000000000555


> 0.3 + 0.6 == 0.9
[1] FALSE
>  0.3 + 0.6 - 0.9
[1] -1.110223e-16

This is a commonly cited paper with respect to these issues: What Every Computer Scientist Needs to Know About Floating-Point Arithmetic

V

Matthew D Pagel

unread,
Apr 4, 2018, 6:53:34 PM4/4/18
to davi...@googlegroups.com
If you find base R and data.frame computations with dplyr and other hadleyverse functions too slow for you, I've been a big fan of data.table

library(data.table)
dt = data.table(x = c(1, 2, 3, 4.5, 5, 6.2, 7, 8.3), y= c("do","re","mi","fa","sol","la","ti","do"))
dt[(x-as.integer(x))==0] # or dt[x==ceiling(x)]

this gives you
x y
1: 1 do
2: 2 re
3: 3 mi
4: 5 sol
5: 7 ti

you can even do fun stuff like
dt[,z:=paste0(as.character(x),y,as.character(x))][x==ceiling(x)]

x y z
1: 1 do 1do1
2: 2 re 2re2
3: 3 mi 3mi3
4: 5 sol 5sol5
5: 7 ti 7ti7

if your data is already in a data.frame (e.g. df), load the library and then
dt = setDT(df) # or if you've identified your "keyed" columns, dt = setDT(df, key="primaryKey,secondaryKey")

data.table is lightning fast, but does have a bit of a learning curve.

================
Matthew D. Pagel
Database Administrator: UCDavis Biotelemetry Lab
Data coordinator: multi-agency Enhanced Acoustic Telemetry group

1331 Academic Surge Building
Department of Wildlife, Fish and Conservation Biology (WFCB)
University of California, Davis
One Shields Ave
Davis, CA 95616

Office Ph: 530 752-5372
Lab Ph: 530 752-1156
> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com <mailto:davis-rug+...@googlegroups.com>.

Matthew D Pagel

unread,
Apr 4, 2018, 6:56:46 PM4/4/18
to davi...@googlegroups.com
I guess these issues can be ignored because

> 1.0000000000001
[1] 1

So, even if he did have 1.0000000000001 in his data, as soon as it's in R, he doesn't have it anymore.
> On Wed, Apr 4, 2018 at 2:59 PM, Scott Devine <smde...@ucdavis.edu <mailto:smde...@ucdavis.edu>> wrote:
>
> This base R approach should also work for you.  If you have NAs in your numeric column, you will need to modify this approach.
>
> #test the approach of logical indexing.  returns a TRUE when a decimal point is not found in a numeric vector converted to a character vector
> test <- c(1.1, 1, 2, 1.48, 5)
> class(test)
> !grepl('\\.', as.character(test))
>
> #implement logical indexing approach on a data.frame
> df1 <- data.frame(a=test, b=c('no', 'yes', 'yes', 'no', 'yes'))
> df1
> df2 <- df1[!grepl('\\.', as.character(df1$a)), ]
> df2
>
> On Wed, Apr 4, 2018 at 2:27 PM, Ryan Byrnes <ryby...@gmail.com <mailto:ryby...@gmail.com>> wrote:
>
> Hi All,
>
> A silly question, perhaps but here goes. I have a data frame with multiple columns one of the columns has numerical values. I am interested in keeping only the rows of this data frame that within this column of interest are whole numbers and a few thousand of them unfortunately have decimals. I won't go into why I want to remove the decimal values but was wondering if there's a way to remove numerical values with decimals from a data frame and keep the remaining whole numbers? All my searches have come up blank so far...
>
> Thanks!
> Ryan
>
>
> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com <mailto:davis-rug+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Scott Devine
> PhD candidate in Soils and Biogeochemistry <http://soils.ucdavis.edu/>
> Dept of Land, Air, and Water Resources
> University of California, Davis
> smde...@ucdavis.edu <mailto:smde...@ucdavis.edu(530) 333-7925
> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com <mailto:davis-rug+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Vince Buffalo
> @vsbuffalo <https://twitter.com/vsbuffalo> :: vincebuffalo.com <http://vincebuffalo.com>
> Coop Lab <http://gcbias.org/> :: Population Biology Graduate Group
> University of California, Davis
> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com <mailto:davis-rug+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Scott Devine
> PhD candidate in Soils and Biogeochemistry <http://soils.ucdavis.edu/>
> Dept of Land, Air, and Water Resources
> University of California, Davis
> smde...@ucdavis.edu <mailto:smde...@ucdavis.edu> (530) 333-7925
> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com <mailto:davis-rug+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> Check out our R resources at http://d-rug.github.io/
> ---
> You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com <mailto:davis-rug+...@googlegroups.com>.

Scott Devine

unread,
Apr 4, 2018, 7:07:22 PM4/4/18
to davi...@googlegroups.com
Hi Matthew, what you see printed on the console is not always what is being stored internally:

> a <- c(1, 1.000000000000002)
> a
[1] 1 1
> a[1] == a[2]
[1] FALSE

You can control what is printed on the console by setting the digits in options.  Default is 7.  Can't go higher than 22.
> options(digits = 22)
> a
[1] 1.000000000000000 1.000000000000002

                To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com <mailto:davis-rug+unsubscribe@googlegroups.com>.

                Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
                For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.




            --             Scott Devine
            PhD candidate in Soils and Biogeochemistry <http://soils.ucdavis.edu/>
            Dept of Land, Air, and Water Resources
            University of California, Davis
            smde...@ucdavis.edu <mailto:smde...@ucdavis.edu(530) 333-7925
            --             Check out our R resources at http://d-rug.github.io/
            ---
            You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
            To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com <mailto:davis-rug+unsubscribe@googlegroups.com>.

            Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
            For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.




        --         Vince Buffalo
        @vsbuffalo <https://twitter.com/vsbuffalo> :: vincebuffalo.com <http://vincebuffalo.com>
        Coop Lab <http://gcbias.org/> :: Population Biology Graduate Group
        University of California, Davis
        --         Check out our R resources at http://d-rug.github.io/
        ---
        You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
        To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com <mailto:davis-rug+unsubscribe@googlegroups.com>.

        Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
        For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.




    --     Scott Devine
    PhD candidate in Soils and Biogeochemistry <http://soils.ucdavis.edu/>
    Dept of Land, Air, and Water Resources
    University of California, Davis
    smde...@ucdavis.edu <mailto:smde...@ucdavis.edu> (530) 333-7925
    --     Check out our R resources at http://d-rug.github.io/
    ---
    You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com <mailto:davis-rug+unsubscribe@googlegroups.com>.

    Visit this group at https://groups.google.com/group/davis-rug <https://groups.google.com/group/davis-rug>.
    For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.


--
Check out our R resources at http://d-rug.github.io/
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com <mailto:davis-rug+unsubscribe@googlegroups.com>.
--
Check out our R resources at http://d-rug.github.io/
--- You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+unsubscribe@googlegroups.com.



--
Scott Devine
PhD candidate in Soils and Biogeochemistry
Dept of Land, Air, and Water Resources
University of California, Davis

Matthew D Pagel

unread,
Apr 4, 2018, 7:49:22 PM4/4/18
to davi...@googlegroups.com
Two minor corrections/augmentations and a comment

in the setDT line, you can just morph your existing data.frame to a data.table without doing an assignment ("=" or "<-").  Just

    setDT(df) # now df is a data.table

another option for the comparison check is a check if a value is an integer more directly. e.g.

    dt[(x==as.integer(x))]

this still seems to have all the issues inherent with floating point precision as pointed out by Scott

> options(digits=22, scipen=999)

> (1+2^-53)==as.integer(1+2^-53) [1] TRUE > (1+2^-52)==as.integer(1+2^-52) [1] FALSE > 2^-52 [1] 0.00000000000000022204460492503131 > options(digits=22, scipen=0) > 2^-52 [1] 2.2204460492503131e-16


why the accuracy cutoff would be somewhere around there (in 64 bit R)...I have no idea. Note, the same cutoff appears to be in effect for ceiling, floor and round as for the comparison with a value to the as.integer of that value.

--Matt

Ryan Byrnes

unread,
Apr 5, 2018, 6:08:10 AM4/5/18
to Davis R Users' Group
Thanks everyone for the replies - I'll let you know which one of these works best for me!

Matthew D Pagel

unread,
Apr 5, 2018, 1:16:25 PM4/5/18
to davi...@googlegroups.com
Talking to myself here, I'm sure

64-bit floating point allows, I believe up to around exponent 999, which would take 10 bits. Allowing to negative 999 would expand this to 11.  The sign for the overall value would take an additional one bit. 64-12 = 52

Due to rounding concerns, you're probably only safe to 2^-51 if you're not strictly on-a-negative-power-of-2 above a whole number. Or 4.4408920985006262e-16.

However, if you have 10 + some small fraction, the 10 itself will take up 4 bits - 3 more bits than "1"...so 2^-48 would be the smallest such fraction you can expect to be "accurate".  Testing it, you're actually fine with 2^-49 (but in this case using the exact fraction).
> (10+2^-49)==as.integer(10+2^-49)
[1] FALSE
> (10+2^-50)==as.integer(10+2^-50)
[1] TRUE
> 10+2^(-49)
[1] 10.000000000000002

//end my tangent from the original question.
Reply all
Reply to author
Forward
0 new messages