Senior Site Reliability Engineer Job at Mango, Inc., Los Angeles, CA

Ynh3Sm1pei9uSm8rTkNwN2ZvazlaenQ0blE9PQ==
  • Mango, Inc.
  • Los Angeles, CA

Job Description

We are seeking a Senior Site Reliability Engineer to own and evolve the infrastructure that supports our on-premise instruments, data systems, and machine learning pipelines. This role combines systems-level engineering with software craftsmanship, requiring deep understanding of how compute, storage, and networking layers interact under real workloads.You will be the go-to expert for diagnosing performance issues in our on-prem system. This could be from kernel-level I/O bottlenecks to distributed service latency. In addition to building robust automation that keeps our systems consistent and observable.Key ResponsibilitiesInfrastructure Design & Reliability Design, deploy, and maintain our on-premise and hybrid infrastructure which includes Dell PowerEdge and PowerVault servers, prosumer NAS units, and high-throughput data processing clusters. Implement fault-tolerant systems with reproducible deployments and clear observability.Performance & Systems Analysis Investigate complex performance issues across hardware, OS, and software boundaries. You will be using Linux toolin addition to in-house application-level metrics to uncover root causes in filesystems, caching layers, or I/O scheduling.Automation & Tooling Build automation for system provisioning, configuration management, and software deployment using Python, Go, Ansible, or similar frameworks. Develop lightweight services and tools that make reliability visible and maintainable.Collaboration Work closely with our software and hardware teams to co-design systems that meet the needs of high-resolution imaging and ML inference workloads. Translate hardware realities into software reliability guarantees.Observability & Incident Response Develop and maintain monitoring, alerting, and logging systems to ensure early detection of issues. Lead incident response and post-mortem efforts with a focus on learning and prevention.Documentation & Communication Produce clear documentation and communicate findings effectively to the broader team from network topology diagrams to kernel tuning rationales.General QualificationsDeep understanding of Linux systems and performance (I/O schedulers, RAID, caching, NUMA, kernel parameters).Hands-on experience designing and managing on-premise servers, storage arrays, or HPC clusters.Comfort with automation and software development (Python, Go, Bash, or similar).Strong diagnostic and analytical skills: ability to decompose performance problems across multiple layers.Proven track record of improving system reliability, throughput, and maintainability in a fast-paced environment.Excellent written and verbal communication skills for cross-disciplinary collaboration.Self-driven, curious, and motivated by understanding systems deeply rather than just maintaining them.Bonus Qualities (Not Required)510 years of relevant industry experience in systems engineering, SRE, or infrastructure software roles.Experience tuning Linux filesystems (ext4, btrfs) and software RAID (mdadm).Familiarity with containerization and orchestration (Docker, Compose, Kubernetes).Knowledge of networking fundamentals (VLANs, bonding, LACP, 10 GbE/40 GbE).Experience supporting data-heavy scientific or ML workloads.Demonstrated technical leadership mentoring others in debugging, reliability, or performance analysis.
recblid a27ykxdqpvdzrj81gllu1mnyf3d85k

Mango, Inc.

Job Tags

Similar Jobs

Brook Services

Live Chat Agent-Full time Job at Brook Services

 ...Live Chat Agent (Full-Time) Job Description Position Overview: We are looking for a Full-Time Live Chat Agent to provide real-time...  .... Ability to multitask and manage time effectively in a remote setting. Reliable internet connection and a quiet work environment... 

Soliant

Sign Language Interpreter in Atmore, AL Job at Soliant

 ...Job Description Job Description Degree in American Sign Language (ASL) Interpreting, certified ASL Interpreter credentials, and 1+ year of experience as an ASL Interpreter in educational settings. Applicants who do not meet these qualifications will not be considered... 

General Dynamics - Electric Boat

Welder - Skilled Job at General Dynamics - Electric Boat

 ...on a wide variety of materials, using an ever growing number of welding processes. Qualification is required for each process, as welders participate in continuous training in new and traditional welding processes using both conventional and state of the art equipment.... 

Hunt Electric, Inc.

High Voltage Apprentice Lineman Job at Hunt Electric, Inc.

 ...pounds.Attend IPSA Apprentice Schooling, track and turn in monthly on the job and schooling hours.Demonstrate skill in pole and tower climbing, URD installations, using hotline tools and working in aerial devices.Provide effective and meaningful onthejob training of... 

TouchPoint

PATIENT SERVER (PART TIME) Job at TouchPoint

 ...Job Description Job Description We are hiring immediately for part time PATIENT SERVER positions. Address : Henry Ford Genesys Hospital - 1 Genesys Parkway, Grand Blanc, MI 48439. Note: online applications accepted only. Schedule : Part time schedule....