Failure Analysis Validation Engineer , Fremont, CA We are seeking a highly motivated and skilled engineer to join our team. The ideal candidate will have a strong background in managing server hardware including network, storage, compute, and AI. In addition, experienced in validation of failed server hardware. (This is an onsite role with the expectation of candidate's in-person presence every day of the work week. No remote-work option).Roles and Responsibilities:
Manage and maintain fleet of server racks from different OEMs (network, storage, compute, and AI hardware).
Interface with OEM vendors for firmware and driver update related maintenance.
Support failure analysis initiatives through the utilization of available HW resources to validate rack- level, system level, module level failures from different our client's datacenters.
Manage and maintain network infrastructure for the lab, including switches, routers, and firewalls.
Configure and manage network protocols, such as TCP/IP, DNS, and DHCP.
Ensure network security and compliance with company policies and industry standards.
Experience working with LLMs and popular frameworks such as TensorFlow or PyTorch.
Design and implement containerized applications using Docker and Kubernetes.
Manage and maintain virtual machines using popular hypervisors, such as VMware or KVM.
Provide support with failure analysis labs - inventory management, safety audits, and maintaining access controls to critical server hardware.
Support root cause analysis and diagnosing hardware/software issues. Isolate failures in platform, firmware, BIOS, CPLD, and other applications.
Experience working with dediprog tools (FW/BIOS debug).
Provide regular updates to failure analysis lead and collaborate with the team on different mission critical projects.
Qualifications:
Bachelor’s or master’s degree in computer science, Electrical Engineering, or related field.
5+ years of experience in server rack management, lab infrastructure management, and/or related fields.
Experience with debugging and troubleshooting complex hardware issues, including storage, compute, and AI.
Strong experience with Linux (RedHat, Fedora, CentOS, etc.) or Unix operating systems.
Experience with scripting languages, such as Python, PowerShell, PHP, Perl, etc.
Experience working with containerization, Kubernetes, docker, and virtual machine management.
Experience with failed server hardware validation, including BIOS/CPLD FW debug.
Knowledge of network protocols, including TCP/IP, DNS, and DHCP.
Strong knowledge of server hardware components, including motherboards, power distribution boards, and storage systems.
Strong problem-solving skills and ability to work independently.
Excellent communication and documentation skills.
Contact Kory Kiviharju with TEKsystems directly at kkiviharju@teksystems.com. Thank you! Pay and BenefitsThe pay range for this position is $60.00 - $90.00Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to specific elections, plan, or program terms. If eligible, the benefits available for this temporary role may include the following: Medical, dental & vision Critical Illness, Accident, and Hospital 401(k) Retirement Plan – Pre-tax and Roth post-tax contributions available Life Insurance (Voluntary Life & AD&D for the employee and dependents) Short and long-term disability Health Spending Account (HSA) Transportation benefits Employee Assistance Program Time Off/Leave (PTO, Vacation or Sick Leave)Workplace TypeThis is a fully onsite position in Fremont,CA.Application DeadlineThis position will be accepting applications until Feb 3, 2025.About TEKsystems: We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company. The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.