Hi,
Greetings!!!
Role: HPC Observability Engineer (Python, HPC)
Location: Remote
Duration: 6+ months
Description:
The client has Grafana and InfluxDB services running on K8S in house on prem. Telegraf is used to ingest data from a GPU HPC cluster into InfluxDB. This engineer will help collect and visualize data for the “Terra” platform. The HPC Observability Engineer should have experience in:
- Developing Grafana Dashboards
- Utilizing Telegraf to ingest data into InfluxDB
The HPC Observability Engineer should become familiar with this environment and deliver the following:
- Grafana dashboard landing page for the HPC cluster
- “Drill down” dashboards for each server including:
- Memory, Network, and CPU utilization
- GPU metrics (power, utilization, memory used)
- Exploration of other useful metrics that exist “out of the box” with current InfluxDB data
- Documentation for how to write Python scripts to ingest external data into InfluxDB with specific examples
- Examples include:
- Initially a simple python script to get load. Ensure this is consistent with native collection and use this as a POC and example script to build from.
- Infiniband packet data (script written, need to ingest)
- LSF jobs for each queue in various states (Pend, Run)
- Visualization in Grafana of various “non-native” resources:
- Server specific metrics like IB packet data mentioned above
- Cluster-wide metrics like LSF jobs mentioned above.
- Bonus: Explore and deliver plugins from third-party vendors like DDN’s Lustre, Mellanox fabric, etc.
Qualifications and Skills:
- B Tech, MS or PhD degree in Computer Science or similar.
- 5-8 years of strong hands-on experience in Grafana, InfluxDB & Telegraf.(Main tech stack)
- Hands-on experience in Python & Bash scripting would be a plus.
- knowledge of Docker, Google Cloud (Compute engine, GCS) would be a plus.
- Good to have HPC Operations work.
- Good communication skills and ability to work independently
- Expertise in understanding and analyzing requirements
- Incorporate automated testing into development and maintenance procedures.
- Ability to write efficient, secure, well-documented, and clean Python code
- Proficiency with modern development tools, like Git
- Experience with both consuming and designing Pipelines setups.
- Suggest any enhancements or changes that are required to stay up with modern security and development best practices.
- Good to have GCP Cloud
Responsibilities:
- Developing Grafana Dashboards
- Utilizing Telegraf to ingest data into InfluxDB.
- Leveraging the use cases of Grafana & Telegraph
- Build Grafana dashboard landing page for the HPC cluster.
- “Drill down” dashboards for each server including:
- Memory, Network, and CPU utilization
- GPU metrics (power, utilization, memory used)
- Exploration of other useful metrics that exist “out of the box” with current InfluxDB data
- Python scripts to ingest external data into InfluxDB.
- Documentation of all the things developed.
- Visualization in Grafana of various “non-native” resources:
- Server specific metrics like IB packet data mentioned above
- Cluster-wide metrics like LSF jobs mentioned above.
- [Bonus - Optional] Explore and deliver plugins from third-party vendors like DDN’s Lustre, Mellanox fabric, etc.
- Excellent teamwork and communication abilities
- Write backend code/scripts in programming languages like Python / BASH
- Maintains high standards of quality for code, functional specification documentation, and deliverables
- Self-motivated and self-managing, with strong organizational skills
- Ability to work with tight deadlines and multiple competing priorities
- Write efficient, secure, clean, scalable and robust Python code that is effective
- Test and troubleshoot the pipeline to ensure its performance.
- Ability to optimize the pipeline for performance
- Interact with development teams to develop a strong understanding of the project and testing objectives.
- Participate in troubleshooting of issues with different teams to drive towards root cause identification and resolution
- Documentation skills to track the development and implementations
- Effective communication skills: Regularly achieve consensus with peers, and clear status updates.
Desired Skills:
- Grafana Dashboard [Must]
- InfluxDB [Must]
- Telegraf [Must]
- Python [Good to have]
- Bash scripting [Good to have]
- Docker [Must]
- LSF Jobs, HPC Operations work [Good to have]
- DDN’s Lustre, Mellanox fabric [Good to have]
- Google Cloud Platform [Good to have]
- Experience in building Google cloud solutions and/or microservices. [optional]
- Knowledge of Git [Must]
Thanks and Regards,
TestingXperts (Tx) – Next Gen Digital Assurance and Quality Engineering Company
CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.