help wanted: scan and analyze your sp services bills

115 views
Skip to first unread message

Paul Gallagher

unread,
Jul 20, 2012, 11:36:12 AM7/20/12
to Singapore Ruby Brigade
Hi folks,

crazY! write a gem to scan SP Services bills when it only takes a few moments to manually transcribe them?

Why bother? A few reasons:
  • SP Services website is very limited. They don't give you data, just your bill in PDF format
  • Other services like https://www.gogreenpost.com/ aren't focused on service data or usage insights either. They are more into (a) go paperless (b) keep track of paying on time [neither of which I find compelling (done already); I think they really should pivot to focus more on actionable insights - but that's another story!]
OK, to be honest I just brushed off some old code I had just for the fun of it, and because I was inspired to try some R by Sau Sheong's new book Exploring Everyday Things with Ruby and R.

So I've just pushed a first (very definitely alpha) version of the sps_bill gem. What does it do?
  • it provides a ruby library and CLI to scan SP Services bills (the PDF statements you can download from https://services.spservices.sg/) and extract the interesting data
  • I've added an initial example R script and sample data sets that produce analyses like this 

But I am really looking for help at this point to test it out more:

1) to verify it can handle the range of bills that SP Services send out. Attempting to read structured data from PDFs is a fraught exercise. I am able to read the past year of my own bills, but that doesn't prove much. Do you have very long bill statements? An unusual combination of services? A commercial account (if they are different)? A multi-property bill (if there is such a thing)?

I have setup the test suite so that it is a piece of cake to drop your own bills and run a full test - see the README for details.

I'd *really* appreciate the help of everyone who could try this out. When you do - email me direct or post on the github issues page.
  

2) or maybe you are more interested and the analytics. My R skillz are rudimentary, and I've only started to touch the surface  - just doing basic trending so far. Many more questions: how strongly correlated is the usage of different services? can we identify inflection points in trendlines?

Check out the scripts directory for more, including a sample data set. I'd appreciate it even if you can just help critique my R.


So after all this, have I learned anything from taking a much closer look at my bills than I ordinarily would? Oh yes:

First: By looking at trends over a year+, I was able to see a major change in my usage.
  • You don't see this in the 6-month charts that SP Services include in your bill.
  • Specifically: I was astounded by a significant step down in water an electricity usage at a certain point ... that happens to correspond to when I had all our original (7+ year old) aircon units replaced. It seems like the new aircon really is much more efficient - but I never realised just how much better until now.

Second: It's clear that current meter data is too poor for meaningful use as an aid to optimising energy and water use

  • I'm still serviced by meters that are manually read every 3(?) months or so
  • The pro-rating and adjustments that SP Services apply are too inaccurate to be able to see fine trends in detail.
  • e.g. if you change a shower head and want to know if it is improving your water usage, you are best served to go and check you meter yourself rather than rely on the meter readings that show in your bill.
  • I believe we are getting new meters soon. Hopefully they are the smart kind that allow automated remote reading.
  • Now I'm hoping that SP Services will then give its customers access to download the raw daily meter data. That data really *would* be useful for monitoring behavioural changes to see the actual impact on energy and water usage. Anyone heard anything along these lines?
  • If they give us the data, I will most happily put the sps_bill gem into early retirement!

Cheers,
Paul


Sau Sheong Chang

unread,
Jul 20, 2012, 11:45:02 AM7/20/12
to singap...@googlegroups.com
WOW!

I'm totally blown away by this, and also very flattered that I've been cited as an influence for you to start this project. I'll definitely check it out.

Paul Gallagher

unread,
Jul 20, 2012, 12:30:58 PM7/20/12
to singap...@googlegroups.com
thanks Sau Sheong. I'm glad I got your attention;-) I was in fact planning to ask you for your thoughts on creative analyses that might be possible with R.

I'm only doing simple moving averages so far(!), but one of the key questions I'd like to be able to answer is "has there been a significant change in usage"? I'm not sure the best way to go about answering that - for example: given a date, compare trends before and after; or even automatically identify inflections by analyzing the trend gradient.

It would be nice to build up a collection of R scripts for SP Services data - so again, invitation is open to all to hack away on the sample data (or your own), and contribute back anything that is interesting.

cheers,
paul

--
You received this message because you are subscribed to the Google Groups "Singapore Ruby Brigade" group.
To view this discussion on the web visit https://groups.google.com/d/msg/singapore-rb/-/w3pBQtkJp-MJ.
To post to this group, send email to singap...@googlegroups.com.
To unsubscribe from this group, send email to singapore-rb...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/singapore-rb?hl=en.

Sau Sheong Chang

unread,
Jul 20, 2012, 12:56:56 PM7/20/12
to singap...@googlegroups.com
I think there are lots of possibilities but the more important part of it is probably getting the data. I need to check out what u have done first and see how to add the R bits on top. 
To unsubscribe from this group, send email to singapore-rb+unsubscribe@googlegroups.com.

Patrick Haller

unread,
Jul 20, 2012, 2:36:31 PM7/20/12
to singap...@googlegroups.com
On 2012-07-21 00:30, Paul Gallagher wrote:
> I'm only doing simple moving averages so far(!), but one of the key
> questions I'd like to be able to answer is "has there been a
> significant change in usage"?

It'll probably have some seasonality, so check out ARIMA in R's stats package.

This will probably be followed shortly by you buying watt-meters and
monitoring power with more granularity. ;)

Meng Weng Wong

unread,
Jul 21, 2012, 9:03:19 AM7/21/12
to singap...@googlegroups.com, Meng Weng Wong
On Jul 20, 2012, at 11:36 PM, Paul Gallagher wrote:
  • Specifically: I was astounded by a significant step down in water an electricity usage at a certain point ... that happens to correspond to when I had all our original (7+ year old) aircon units replaced. It seems like the new aircon really is much more efficient - but I never realised just how much better until now.
Electricity I understand, but water? Is your aircon made by Hyflux?


Meng Weng Wong

unread,
Jul 21, 2012, 9:29:32 AM7/21/12
to singap...@googlegroups.com, Meng Weng Wong
On Jul 20, 2012, at 11:36 PM, Paul Gallagher wrote:

crazY! write a gem to scan SP Services bills when it only takes a few moments to manually transcribe them? 

I really like this work.

Back in January one of the ideas floating around JFDI was a general purpose backoffice mail handler -- like procmail for snailmail.

Today, SP. Tomorrow:

- bank statements
- credit card statements
- other utility statements
- invoices
- and any other structured data which gets flattened to paper or PDF.

There are a handful of "open and scan" mail handler services like earthclassmail.com but none of them are smart. I'd love a service that would convert my snailmail to data structures available over a combination of push and pull APIs compatible with, say, ifttt.com. Then I could script it and trigger auto-payments, etc.

Your work also connects to a Big Data idea that came up at the Quantified Self meetup last week. Working title: the Fishbowl Flag.

Problem 1: There's a lot of data about me out there – public transit tap-in and tap-out logs; medical records; utility statements; geolocation records. In a sense it's "my data" because it's about me, but in another sense it's not, because it lives on external servers which I have no access to.

Problem 2: Researchers working with public data are limited to pared-down, anonymized datasets. The limits exist because the public haven't given informed consent to publication.

Inspiration: Esther Dyson said she'd publish her genome. Genomera.com is doing crowdsourced clinical trials.

Solution to Problem 1: Imagine a Data Transparency Act that requires all public and corporate records of individual data to be made available to that citizen. In the Web 1.0 era that data would be compressed into a zip file and shipped grudgingly upon request, much as Facebook offers a zip archive of your account. In the cloud era that data would be made available in the cloud not just through a dumb webpage but through some sort of JSON-standard API. UP Singapore's data wiki and Microsoft's Project Nimbus give a sense of the datasets available.

Solution to Problem 2: Imagine if you could instruct all those data sources to share your data at the anonymity level of your choice, with the recipients of your choice. Even if only one Quantified Self geek out a hundred muggles signed up to flip their fishbowl flag, a lot of data would be liberated. One might even choose to share one's data only with other people who had similarly signed up to share their data: you show me yours, I'll show you mine. First the early adopters do it, then the network effects take over and everybody volunteers.

In most societies the Fishbowl Flag would require opt-in. But under the theory of "libertarian paternalism", and under the theory that Singaporeans are unusually sanguine about state intrusion into the private sphere, it might be possible for Singapore to be the first country in the world to turn the Fishbowl Flag on by default.

I have invited the Quantified Self people to continue this discussion on my interblog.


Paul Gallagher

unread,
Jul 22, 2012, 6:13:55 AM7/22/12
to singap...@googlegroups.com
On Sat, Jul 21, 2012 at 9:29 PM, Meng Weng Wong <meng...@gmail.com> wrote:
Today, SP. Tomorrow:

- bank statements
- credit card statements
- other utility statements
- invoices
- and any other structured data which gets flattened to paper or PDF.

exactly;-)

Ideally data custodians will see the light and open up machine-readable access to the data they hold about us(!!) .. but until then, *re*-combobulating data with tools will help demonstrate people really do want the data.

Actually, to that end, I just extracted the PDF-parsing smarts from sps_bill to another gem: https://github.com/tardate/pdf-reader-turtletext

If anyone's interested in taking aim at parsing another PDF source, pdf-reader-turtletext gives a little more of a leg up than raw pdf-reader (and I have a few more DSL-like ideas I think I'll add)

 
There are a handful of "open and scan" mail handler services like earthclassmail.com but none of them are smart. I'd love a service that would convert my snailmail to data structures available over a combination of push and pull APIs compatible with, say, ifttt.com. Then I could script it and trigger auto-payments, etc.

Your work also connects to a Big Data idea that came up at the Quantified Self meetup last week. Working title: the Fishbowl Flag.
 
I like the name;-) I should get along to a  Quantified Self meetup - sounds like my kind of bof


Paul Gallagher

unread,
Jul 22, 2012, 6:21:50 AM7/22/12
to singap...@googlegroups.com
On Sat, Jul 21, 2012 at 2:36 AM, Patrick Haller <patrick...@gmail.com> wrote:

This will probably be followed shortly by you buying watt-meters and
monitoring power with more granularity. ;)


good prediction Patrick!
.. says he studying the EKM-OmniMeter I v.3 and drooling over all the amazing stuff on alibaba.

Patrick Haller

unread,
Jul 22, 2012, 6:52:14 AM7/22/12
to singap...@googlegroups.com
On 2012-07-22 18:21, Paul Gallagher wrote:
> good prediction Patrick!
> .. says he studying the [2]EKM-OmniMeter I v.3 and drooling over all
> the amazing stuff on [3]alibaba.

If you want to monitor a bunch of circuits and other things, setting up
a MODBUS network with
http://csimn.com/CSI_pages/BBSPX.html
http://www.dentinstruments.com/power_meters_demand_response_branch_circuit_monitoring.html
will make it SNMP-monitorable.

I'm sure alibaba will have the same tech for much cheaper, I'm just
cutting and paste'ing from old notes.

Bas Vodde

unread,
Jul 22, 2012, 7:27:32 AM7/22/12
to singap...@googlegroups.com

Hola Paul,

Nice job :)

Slightly unrelated question. Do you know any other gems than Prawn for PDF generation? Especially ones who can replace text but still take care of spacing and things like that.
Been looking for one for a while and procrastinating one writing one :) (this is another procrastination :P)

Bas
> --
> You received this message because you are subscribed to the Google Groups "Singapore Ruby Brigade" group.
> To post to this group, send email to singap...@googlegroups.com.
> To unsubscribe from this group, send email to singapore-rb...@googlegroups.com.

Paul Gallagher

unread,
Jul 22, 2012, 7:43:59 AM7/22/12
to singap...@googlegroups.com
hola Bas.

sorry, no I haven't found a gem for that. I was looking for something similar in order to redact my SP Services bills so I could add them as samples without leaking PI, but I didn't find anything that quite met my needs (which was specifically to substitute text exactly while keeping everything else about the document structure intact).

The closest I came were the multivalent tools - but that's java. Depending on exactly what you need, you might be able to make use of things like uncompress (I think similar to what you can do with pdfbox).

If you come up with anything, let us know!

Bas Vodde

unread,
Jul 22, 2012, 7:45:32 AM7/22/12
to singap...@googlegroups.com

Yah, too bad :( I was afraid of it already. It is sort-of an un-trivial problem to do it in a good way :(

Well, if I ever get to it, then I'll let you know :) If you find something earlier, please share :P

Bas

Keith Bennett

unread,
Jul 22, 2012, 12:59:15 PM7/22/12
to singap...@googlegroups.com
I did some open source work on Apache Tika, which coordinates the parsing of multiple document types by various libraries, and PDFBox is the library to which PDF parsing was delegated.

I don't know anything about which PDF parsing library is best, but if using JRuby is an option you have access to PDFBox and a large number of mature libraries for this and other things.  (Of course, getting locked into using only JRuby may not be best for you.)

Here's their site in case you want to take a closer look:

http://pdfbox.apache.org/

If you're using rvm, JRuby is trivially easy to set up -- just do 'rvm install jruby'.  It assumes, of course, that you hava a Java runtime somewhere it would expect to be found.  Once you 'rvm jruby' you can run JRuby commands like commands of other Ruby implementations (that is, 'ruby', 'irb', etc., no need for 'jruby', 'jirb', etc.).  Setting up PDFBox should hopefully just be a matter of putting the jar files where they can be found at runtime.

- Keith
-- 
Keith R. Bennett
http://about.me/keithrbennett
Available for project work, specialties are Ruby and JRuby.
Reply all
Reply to author
Forward
0 new messages