AI Hardware Systems Engineer, Annapurna Labs, Trainium Machine Learning Fleet Operations
Company: Amazon
Location: Austin
Posted on: April 1, 2026
|
|
|
Job Description:
Annapurna Labs designs silicon and software that accelerates
innovation. Customers choose us to create cloud solutions that
solve challenges that were unimaginable a short time ago—even
yesterday. Our custom chips, accelerators, and software stacks
enable us to take on technical challenges that have never been seen
before, and deliver results that help our customers change the
world. In Annapurna Labs we are at the forefront of
hardware/software co-design not just in Amazon Web Services (AWS)
but across the industry. The Machine Learning Acceleration Fleet
Operations Team is looking for candidates interested in diving deep
into our fleet of ML servers deployed around the world. We are
seeking an engineer who is comfortable debugging emergent problems
in GPU and server hardware, writing scripts in languages such as
Python or Bash, running large scale experiments on a fleet of
complex hardware, developing data infrastructure and analyzing
trends, and developing automation software to scale operations. Our
team has end to end ownership of some of the most advanced server
hardware in the world. We drive technical debug efforts and write
truly massive scale autonomous software to monitor, optimize, and
remediate machine learning hardware. Come join us! Key job
responsibilities - Member of a team responsible for system
remediation, operational excellence, and customer experience on
bleeding edge ML products - Utilize data to root cause hardware
failures and identify live trends on the most complex systems in
AWS - Implement and improve system level testing across the product
lifecycle - Develop software which can be maintained, improved
upon, documented, tested, and reused - Dive deep on issues at the
intersection of hardware and software A day in the life As a
Platform Development Engineer, you are the dedicated owner of an ML
server platform in our fleet. Your mission is to maximize its
health, sellability, and customer experience. You start each day
with eyes on the fleet — reviewing dashboards to identify trends
and triaging emergent issues, then partnering with hardware and
software engineering teams to debug, investigate, and translate
findings into permanent fixes. You own the end-to-end testing story
and manage tradeoffs between coverage and velocity. You direct new
automations, tooling, and data infrastructure to scale your
operations. You manage software deployments, debug issues with
them, and run status meetings to align all platform stakeholders on
how the product is performing. About the team The MLA Fleet
Operations team was formed to maintain an exceptionally high
quality bar for our fleet of advanced machine learning accelerators
and server products. We perfect the customer experience by
developing scalable software for rapid incident response times and
data visualization as well as diving deep into hardware issues as
they arise. - 2 years of non-internship professional software
development experience - 1 years of designing or architecting
(design patterns, reliability and scaling) of new and existing
systems experience - 1 years of administrative experience in
networking, storage systems, operating systems and hands-on systems
engineering experience - Knowledge of systems engineering
fundamentals (networking, storage, operating systems) - Experience
programming with at least one modern language such as C++, C#,
Java, Python, Golang, PowerShell, Ruby - Experience with Linux/Unix
- Experience debugging and systems analysis to identify and quickly
resolve or mitigate issues - Bachelor's degree in Computer Science,
Computer Engineering, or Electrical Engineering - Experience in
hardware design and validation of components, subsystems and
systems - Experience with SOC bring-up and post-silicon validation
- Master's degree in Computer Science, Computer Engineering, or
Electrical Engineering Amazon is an equal opportunity employer and
does not discriminate on the basis of protected veteran status,
disability, or other legally protected status. Our inclusive
culture empowers Amazonians to deliver the best results for our
customers. If you have a disability and need a workplace
accommodation or adjustment during the application and hiring
process, including support for the interview or onboarding process,
please visit
https://amazon.jobs/content/en/how-we-hire/accommodations for more
information. If the country/region you’re applying in isn’t listed,
please contact your Recruiting Partner. The base salary range for
this position is listed below. Your Amazon package will include
sign-on payments and restricted stock units (RSUs). Final
compensation will be determined based on factors including
experience, qualifications, and location. Amazon also offers
comprehensive benefits including health insurance (medical, dental,
vision, prescription, Basic Life & AD&D insurance and option
for Supplemental life plans, EAP, Mental Health Support, Medical
Advice Line, Flexible Spending Accounts, Adoption and Surrogacy
Reimbursement coverage), 401(k) matching, paid time off, and
parental leave. Learn more about our benefits at
https://amazon.jobs/en/benefits . USA, TX, Austin - 136,000.00 -
184,000.00 USD annually
Keywords: Amazon, Killeen , AI Hardware Systems Engineer, Annapurna Labs, Trainium Machine Learning Fleet Operations, IT / Software / Systems , Austin, Texas