Cost of running SnowPlow Analytics

1,057 views
Skip to first unread message

Robert Kingston

unread,
Mar 3, 2013, 9:42:04 PM3/3/13
to snowpl...@googlegroups.com
Hey guys,

Just wanted to get some kind of idea how much SnowPlow costs to run on sites with ~10 million events per month or more. If not for curiosity, I think it would really help the adoption of the platform if we had these figures. 

e.g. "Hey, we could save XYZ by choosing SnowPlow over Site Catalyst."

~2000 events / month
To kick things off, here's a rough cost breakdown for my blog for January including 2 EMR jobs and ~4 months of data (around 7-10,000 events, ~2000 events per month):

Cloudfront $0.32

EC2 $1.03

EMR $0.21

S3: $0.97

Total $2.53

Keep in mind, I use Cloudfront as a CDN for my sites too. As a rule of thumb, it costs me about $1 per job - expensive for AWS, but it's still just a drop in the pond.

As always, keep up the amazing work, Yali and Alex.

Alex Dean

unread,
Mar 4, 2013, 6:44:21 AM3/4/13
to snowpl...@googlegroups.com
Hey Rob,

Many thanks for sharing those numbers - it's much appreciated. We totally agree - sharing cost information is really important to growing use of the SnowPlow platform, especially when you can compare our costs to those of the other enterprise-strength solutions out there, like SiteCatalyst and GA Premium.

Yali is working on an information page for the website with these kinds of numbers on it.

Is there anybody else in the community who would be happy to share their cost numbers? Either in this thread or if you would prefer it to be private - ya...@snowplowanalytics.com

Thanks all!

Alex

--
You received this message because you are subscribed to the Google Groups "SnowPlow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Simon Rumble

unread,
Mar 11, 2013, 8:37:24 PM3/11/13
to snowpl...@googlegroups.com
On Mon, Mar 4, 2013 at 10:44 PM, Alex Dean <al...@snowplowanalytics.com> wrote:
Many thanks for sharing those numbers - it's much appreciated. We totally agree - sharing cost information is really important to growing use of the SnowPlow platform, especially when you can compare our costs to those of the other enterprise-strength solutions out there, like SiteCatalyst and GA Premium.

Something that would have a big impact on pricing would be the ability to use spot pricing for Hive EmrEtlRunner. As an example, an m3.2xlarge in Sydney runs steadily at $0.175 on the spot price history while an on-demand instance will cost you $1.40. Worth considering when thinking about latency issues is you can throw cheap compute at the problem much of the time.

I know EMR should cope just fine with instances disappearing halfway through a job, but how does it cope with a job being paused while waiting for the price to go down below the bid price? And what happens if the master dies?

(Sydney might be a poor example since it's a reasonably new data centre, but even us-east-1 doesn't seem particularly volatile.)
 
Is there anybody else in the community who would be happy to share their cost numbers? Either in this thread or if you would prefer it to be private - ya...@snowplowanalytics.com

Now that I've got real-world data flowing, I should be able to share some costs soon.

--
Simon Rumble <si...@simonrumble.com>

Yali Sassoon

unread,
Mar 12, 2013, 7:35:36 AM3/12/13
to snowpl...@googlegroups.com
Rob - strongly agree that giving prospective users better cost visibility will help with adoption. To that end, I'm going to start work on a model today for forecasting those costs. Once I've got that working, we'll publish an online calculator so prospective users can forecast costs based on event volumes / page views etc...

To make that work, though, I need as many data points as possible (to validate the model). Many thanks for sharing the costs you're seeing - thanks in advance Simon for sharing your cost data. 

Broadly, I think the model should work something like this:

Cloudfront costs should scale with page views, as `sp.js` is reloaded on every page view.

S3 costs should scale with number of events saved. It should be straightforward to calculate an average length of a line of SnowPlow data, and therefore calculate the cost per line. (Every line is saved to S3 multiple times - first in the collector logs, then in archive bucket.)

EC2 costs should be reasonably straightforward - we need to work out what size of instance is appropriate to orchestrate EmrEtlRunner and the StorageLoader, and how long the instance is then kept live for. We actually don't use EC2 ourselves for this (we use Hetzner instead), so if anyone can share with us what EC2 setup you're using, that would be very helpful.

EMR costs are the most complicated piece of the model. Amazon bill based on normalized instance hours (always rounding up). So we need to work out the relationship between volume of data processed by EmrEtlRunner and normalized instance hours. (An additional wrinkle is that EMR takes a long time to load lots of the smaller log files generated by the Cloudfront collector than e.g. the larger files generated by the Clojure collector, so normalized instance hours are a function of average file length as well as number of records processed.)

What would be incredibly helpful for me unpicking the relationship between data volumes and EMR costs would be if users shared the number of rows of data (roughly) processed and the number of normalized instance hours each EMR job takes. To get hold of this data using the command line tools, execute:

./elastic-mapreduce --list

To list the jobs that have run. To then find out the number of normalized instance hours for a particular job then execute

./elastic-mapreduce --describe --jobflow j-3MBH09MT8WMDT

(Substituting your jobflow ID for the example listed above.) The CLI responds with a JSON for the JobFlow that includes the number of instance hours e.g.:

{
  "JobFlows": [
    {
      "VisibleToAllUsers": false,
      "ExecutionStatusDetail": {
        "CreationDateTime": 1363068069.0,
        "LastStateChangeReason": "Steps completed",
        "ReadyDateTime": 1363068357.0,
        "State": "COMPLETED",
        "StartDateTime": 1363068356.0,
        "EndDateTime": 1363068680.0
      },
      "BootstrapActions": [],
      "AmiVersion": "2.3.3",
      "SupportedProducts": [],
      "LogUri": "s3n:\/\/snowplow-emr-logs\/snplow2\/",
      "Instances": {
        "InstanceGroups": [
          {
            "CreationDateTime": 1363068069.0,
            "LastStateChangeReason": "Job flow terminated",
            "ReadyDateTime": 1363068352.0,
            "InstanceRequestCount": 1,
            "InstanceGroupId": "ig-3PYWQOMD2TZD8",
            "LaunchGroup": null,
            "InstanceRole": "MASTER",
            "InstanceRunningCount": 0,
            "State": "ENDED",
            "Name": null,
            "Market": "ON_DEMAND",
            "InstanceType": "m1.small",
            "StartDateTime": 1363068270.0,
            "BidPrice": null,
            "EndDateTime": 1363068680.0
          },
          {
            "CreationDateTime": 1363068069.0,
            "LastStateChangeReason": "Job flow terminated",
            "ReadyDateTime": 1363068357.0,
            "InstanceRequestCount": 2,
            "InstanceGroupId": "ig-1KFMX42XPL7UB",
            "LaunchGroup": null,
            "InstanceRole": "CORE",
            "InstanceRunningCount": 0,
            "State": "ENDED",
            "Name": null,
            "Market": "ON_DEMAND",
            "InstanceType": "m1.small",
            "StartDateTime": 1363068357.0,
            "BidPrice": null,
            "EndDateTime": 1363068680.0
          }
        ],
        "KeepJobFlowAliveWhenNoSteps": false,
        "TerminationProtected": false,
        "MasterInstanceId": "i-0c096446",
        "MasterInstanceType": "m1.small",
        "Ec2SubnetId": null,
        "InstanceCount": 3,
        "Placement": {
          "AvailabilityZone": "eu-west-1a"
        },
        "MasterPublicDnsName": "ec2-54-228-39-26.eu-west-1.compute.amazonaws.com",
        "HadoopVersion": "1.0.3",
        "NormalizedInstanceHours": 3,
        "SlaveInstanceType": "m1.small",
        "Ec2KeyName": "etl-nasqueron"
      },
      "JobFlowRole": null,
      "Name": "SnowPlow EmrEtlRunner: Rolling mode",
      "JobFlowId": "j-3MBH09MT8WMDT",
      "Steps": [ ... ],
      "Properties": []
            },
            "Name": "Elasticity Hive Step (s3:\/\/snowplow-emr-assets\/hive\/hiveql\/mysql-infobright-etl-0.0.8.q)",
            "ActionOnFailure": "TERMINATE_JOB_FLOW"
          }
        }
      ]
    }
  ]
}

If SnowPlow users can share with us the number of lines processed with each EMR ETL job (roughly), and the normalized instance hours, we can plot the results to understand the relationship. (We will, of course, share the plot back with the community, in an anonimized form.)

Many thanks in advance, looking forward to getting stuck into the modelling and sharing the results...


Yali



--

Yali Sassoon

unread,
Mar 12, 2013, 7:41:58 AM3/12/13
to snowpl...@googlegroups.com
Simon - quickly to answer your questions about spot vs on-demand instances for the EMR job:

If you run your EMR cluster entirely with Spot instances, then if the prices goes above your bid price mid job, your cluster goes down and fails. For that reason, people typically provision on demand instances for master and core instances, and then provision spot instances for task instances. (It is common-place to bid just a few pennies for spot instances, especially if you're using e.g. instances in the US whilst it's nighttime in the US :-).) Then it becomes cost-effective to fire up very large clusters, safe in the knowledge that your job wont fail (just slow down) in the event that the spot price increases mid-job.

For more details see http://understandbigdata.wordpress.com/2012/12/26/using-spot-instances-in-amazon-emr-without-the-risk-of-losing-the-job/ and http://stackoverflow.com/questions/14014744/emr-leverage-using-spot-instances. For modelling purposes I'm going to assume on demand instances - and leave it to individual users to be clever with how they leverage spot instances to reduce costs...

Simon Rumble

unread,
Mar 12, 2013, 8:03:56 PM3/12/13
to snowpl...@googlegroups.com
Yeah I didn't know about the Instance Group construct in EMR. That sounds like what I need: use regular on demand instances for master and core, and if the spot price is high it'll just take longer, but with the spot price normally working the data will churn out quicker.

From my very limited understanding of Ruby syntax, it appears all the EmrEtlRunner config under :emr:jobflow: get passed through to Elasticity. I'm trying to work out how I'd pass in the appropriate Elasticity bits to set up a task instance group as spot instances with a specific bid. But I don't think that'll work because JobFlow#set_task_instance_group needs to be passed the config. Or am I getting it wrong?

This would result in considerable savings and speed improvements for my jobs.

Alex Dean

unread,
Mar 13, 2013, 4:07:58 PM3/13/13
to snowpl...@googlegroups.com
Hey Simon,

You're right - EmrEtlRunner simply configures Elasticity's hive_step with the fields from the :jobflow: section. To get spot instances working for tasks, it looks like we would need to change config.yml from:

:jobflow:
    :instance_count: 2
    :master_instance_type: m1.small
    :slave_instance_type: m1.small

To something like:


:jobflow:
    :core_instance_count: 2
    :master_instance_type: m1.small
    :core_instance_type: m1.small
:task_instance_count: X
:task_instance_type: m1.small
:task_instance_bid: 0.25 # Or leave blank for non-spot task instances

And then add support in EmrEtlRunner for something like:


ig = Elasticity::InstanceGroup.new
ig.count = # :task_instance_count:
ig.type  = # :task_instance_type:
ig.set_spot_instances( # :task_instance_bid: )

jobflow.set_task_instance_group(ig)

I have added a ticket for this:


Thanks for the suggestion,

Alex

Dimitris

unread,
Dec 24, 2015, 6:20:44 AM12/24/15
to Snowplow
Hi Dean, 

Adding possibility to use task nodes on spot instances was a great idea. 

I'm wondering, are you planning to add this functionality for core nodes too, such as ?

:jobflow:
    :master_instance_type: m1.small
    :core_instance_count: 2
    :core_instance_type: m1.small
    :core_instance_bid: 0.25 # Or leave blank for non-spot core instances
:task_instance_count: X :task_instance_type: m1.small 
    :task_instance_bid: 0.25 # Or leave blank for non-spot task instances 
  
Thank you in advance

Dimitris

Alex Dean

unread,
Dec 24, 2015, 6:38:36 AM12/24/15
to Snowplow
Hi Dimitri,

Using spot instances for core nodes is possible in Elasticity (our underlying EMR lib), so it could be done, but it's not particularly recommended on the Amazon end:

> We do not recommend Spot Instances for master and core nodes unless the cluster is expected to be short-lived and the workload is non-critical. Also, Spot Instances are not recommended for clusters that are time-critical or that need guaranteed capacity.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-spot-instances.html

But if you have a use case for it and submit a PR, we'd be happy to merge it in.

Cheers,

Alex

You received this message because you are subscribed to the Google Groups "Snowplow" group.

To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Dimitris

unread,
Dec 24, 2015, 7:35:23 AM12/24/15
to Snowplow
Hi Dean

Thank you for your reply. Well, yes, I think it makes sense, so I'll give it a try. 

Thanx

Dimitris
Reply all
Reply to author
Forward
0 new messages