Rob
- strongly agree that giving prospective users better cost visibility will help with adoption. To that end, I'm going to start work on a model today for forecasting those costs. Once I've got that working, we'll publish an online calculator so prospective users can forecast costs based on event volumes / page views etc...
To make that work, though, I need as many data points as possible (to validate the model). Many thanks for sharing the costs you're seeing - thanks in advance Simon for sharing your cost data.
Broadly, I think the model should work something like this:
Cloudfront costs should scale with page views, as `sp.js` is reloaded on every page view.
S3 costs should scale with number of events saved. It should be straightforward to calculate an average length of a line of SnowPlow data, and therefore calculate the cost per line. (Every line is saved to S3 multiple times - first in the collector logs, then in archive bucket.)
EC2 costs should be reasonably straightforward - we need to work out what size of instance is appropriate to orchestrate EmrEtlRunner and the StorageLoader, and how long the instance is then kept live for. We actually don't use EC2 ourselves for this (we use Hetzner instead), so if anyone can share with us what EC2 setup you're using, that would be very helpful.
EMR costs are the most complicated piece of the model. Amazon bill based on normalized instance hours (always rounding up). So we need to work out the relationship between volume of data processed by EmrEtlRunner and normalized instance hours. (An additional wrinkle is that EMR takes a long time to load lots of the smaller log files generated by the Cloudfront collector than e.g. the larger files generated by the Clojure collector, so normalized instance hours are a function of average file length as well as number of records processed.)
What would be incredibly helpful for me unpicking the relationship between data volumes and EMR costs would be if users shared the number of rows of data (roughly) processed and the number of normalized instance hours each EMR job takes. To get hold of this data using the command line tools, execute:
./elastic-mapreduce --list
To list the jobs that have run. To then find out the number of normalized instance hours for a particular job then execute
./elastic-mapreduce --describe --jobflow j-3MBH09MT8WMDT
(Substituting your jobflow ID for the example listed above.) The CLI responds with a JSON for the JobFlow that includes the number of instance hours e.g.:
"VisibleToAllUsers": false,
"ExecutionStatusDetail": {
"CreationDateTime": 1363068069.0,
"LastStateChangeReason": "Steps completed",
"ReadyDateTime": 1363068357.0,
"StartDateTime": 1363068356.0,
"EndDateTime": 1363068680.0
"LogUri": "s3n:\/\/snowplow-emr-logs\/snplow2\/",
"CreationDateTime": 1363068069.0,
"LastStateChangeReason": "Job flow terminated",
"ReadyDateTime": 1363068352.0,
"InstanceRequestCount": 1,
"InstanceGroupId": "ig-3PYWQOMD2TZD8",
"InstanceRole": "MASTER",
"InstanceRunningCount": 0,
"InstanceType": "m1.small",
"StartDateTime": 1363068270.0,
"EndDateTime": 1363068680.0
"CreationDateTime": 1363068069.0,
"LastStateChangeReason": "Job flow terminated",
"ReadyDateTime": 1363068357.0,
"InstanceRequestCount": 2,
"InstanceGroupId": "ig-1KFMX42XPL7UB",
"InstanceRunningCount": 0,
"InstanceType": "m1.small",
"StartDateTime": 1363068357.0,
"EndDateTime": 1363068680.0
"KeepJobFlowAliveWhenNoSteps": false,
"TerminationProtected": false,
"MasterInstanceId": "i-0c096446",
"MasterInstanceType": "m1.small",
"AvailabilityZone": "eu-west-1a"
"HadoopVersion": "1.0.3",
"NormalizedInstanceHours": 3,
"SlaveInstanceType": "m1.small",
"Ec2KeyName": "etl-nasqueron"
"Name": "SnowPlow EmrEtlRunner: Rolling mode",
"JobFlowId": "j-3MBH09MT8WMDT",
"Name": "Elasticity Hive Step (s3:\/\/snowplow-emr-assets\/hive\/hiveql\/mysql-infobright-etl-0.0.8.q)",
"ActionOnFailure": "TERMINATE_JOB_FLOW"
If SnowPlow users can share with us the number of lines processed with each EMR ETL job (roughly), and the normalized instance hours, we can plot the results to understand the relationship. (We will, of course, share the plot back with the community, in an anonimized form.)
Many thanks in advance, looking forward to getting stuck into the modelling and sharing the results...
Yali