Production Reliability Engineer - Trade Desk

Job Description

As part of our Trade Desk technical operations team, the Trading Systems Site Reliability Engineer will have primary responsibility for managing the real-time production trading environment for Jump Trading. It will require deep technical and operational knowledge across all areas of the trading platform in order to proactively monitor and troubleshoot our trading system, deploy changes to our production environment while minimizing operational risk, and implement tools and processes to drive continuous improvement.

What you’ll do:

The Trading Systems SRE will solve complex problems that require both technical and business understanding. The engineer will work with traders, back-office teams, exchanges, and developers to optimize the trading environment and investigate and solve system issues.

  • Own the production environment, driving performance, reliability, and operability through continuous improvement
  • Proactively monitor and troubleshoot large-scale trading systems and exchange connectivity
  • Build and maintain devops toolkit for the production trading system including configuration management, process management, deployment, monitoring, data collection, and analysis
  • Leverage firm-wide metrics to improve scalability and system performance
    Collaborate across the technology organization to analyze and troubleshoot complex system problems
  • Work closely with Risk Management and Operational Trading Support teams to coordinate changes and manage incidents
  • Interact directly with traders to communicate and drive technology changes, manage incidents, and troubleshoot problems
  • Work with Clearing team to reconcile trades and position breaks
  • Assess and manage operational risk of changes into the production environment
  • Define and document process and procedure
  • Provide mentorship and cross training to other technical operations SREs
  • Other duties as assigned or needed

Skills you’ll need:

  • Degree in Computer Science, a related field, or equivalent professional experience
  • At least 5+ years of relevant work experience in an IT ops role, such as DevOps, SRE, Linux Systems Engineering, or Network Engineering
  • Fluency in python and shell scripting
  • Familiarity with C++ helpful but not required
  • A rigorous, detail-oriented approach to operations
  • Strong understanding of the linux operating system, including network and system configuration, kernel internals, scheduling, performance tuning
  • Strong understanding of networking concepts such as routing, multicast, LLDP, VLAN tagging, ethernet
  • A deep sense of ownership and urgency
  • Ability to handle shared operational and periodic on-call duties
  • Reliable and predictable availability