A simple data processing pipeline example (for proximity)
63 views
Skip to first unread message
Oren Lederman
unread,
May 20, 2019, 4:08:27 PM5/20/19
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Rhythm Badges
Hi all,
I recently finished anonymizing an analysis environment that I used for some of my studies. It's not perfect, but it can provide you with some nice examples how on to handle your data. The scripts themselves can be found here - https://github.com/HumanDynamics/rhythm-public-analysis-deltav18-anon
What can you expect to find there?
The environment is structured based on the cookiecutter data science project template . It separates code (in github) from data (stored in AWS S3). There are folders for reusable python code (under /src), some scripts for synchronizing your data with amazon S3, and a place for your notebooks (/notebooks)
A data processing "pipeline" with the following steps:
download - pulls datafiles from the server. You'll need to dig into the settings and code and tweak it to your needs, but in general you'll need to use your project key, list the hubs that generated the data, and give it a range of dates you are interested in. The server maintains a daily file for each hub, so this script just iterates on these lists and pull the files
gzip - a hacky stage where we gzip the files before the next steps. It saves a lot of storage space
group - create per-hour files from all available data. Makes it easier for the next steps to rung pre-processing in parallel
process - preprocessing. Creates the basic data structures (member-2-member, member-to-badge, etc)
clean - a set of functions that clean up the data. Based on your configuration (in the config.py file), it will :
Remove data before the experiment started and after it ended
For each participant (member) it allows you to define when the person joined and left (see members file example for reference)
Removes data from when I manually changed the batteries (Sunday night). That's project specific, but I didn't have the chance to generalize it
analysis - various analysis steps that I was able to generalize. In particular:
Compliance - for each member, this step identifies when the badge was in use, and when it wasn't. Based on this information it will keep member-to-member records from when the badges were actually in use. There is no magic here - you need to structure your experiment in a very specific way to do this - I setup location beacons in the reception area (marked in the metadata as "board" type). When the badges are closest to these beacons, I assume that they are not being used.
Connections - aggregates the member-to-member data frames to create connection tables.
You can set a list of RSSI cutoffs to use, and it will create a different table for each one
It aggregates the two temporal resolutions (daily and overall/annual), and grouping level (member-to-member, member-to-company, company-to-company). In my example, people belonged to different companies so it was convenient to group the data and measure the interaction between companies. Companies are defined int he member.csv.