Staff Engineer, Reliability
Job Details
- Location:
- Hyderabad, Telangāna, IN
- Category:
- Information Technology
- Employment Type:
- Full time
- Job Ref:
- R2624804-333
We’re determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals – and to help others accomplish theirs, too. Join our team as we help shape the future.
Cloud Services Team is searching for a Reliability Engineer. Candidate must have hands-on experience operating and engineering services on Google Cloud Platform (GCP), including data, compute, and observability services. The team is accountable for the operations, engineering, and governance of 200+ Cloud Technologies across a multiple cloud environment. Role requires helping mature operational practices for GCP workloads as part of our multi-cloud strategy. This is an excellent opportunity for someone who is interested in a mix of strategy and hands-on work. The ideal candidate should feel comfortable working with teammates at all levels of the organization including leadership.
Key Responsibilities
Assists in the development, maintenance and operations of IT services across 200+ infra services across our Cloud transformation landscape.
Develop solutions and drive adoption of enterprise solutions such as Cyber Protection, Disaster Recovery, and Security enhancements, across Line of business teams.
Drive improvement, through automation, of software delivered as a service from an efficiency and simplicity perspective.
Provide clear operational documents and construction/support specifications to IT userbase.
Provide insight into operational Metrics across the entire Cloud Environment.
Consult with customers on any new requirements or design questions or functionality configurations for environments on and off premise
Delivers the tooling and capabilities needed to enable cloud compliance, metrics and reporting and cost management roadmap and strategy.
Participate in incident resolution and change implementation as necessary. This may occasionally include support during non standard hours.
Operate and improve reliability for production workloads running on Google Cloud Platform (GCP), focusing on availability, scalability, and operational readiness rather than application development.
Own day‑to‑day operational concerns for core GCP services including Compute Engine, GKE, Cloud Run, BigQuery, Cloud Storage, and supporting platform services.
Provide operational support for BigQuery platforms including job performance troubleshooting, capacity planning, quota management, dataset permissions, and cost optimization (slot usage, reservations, and quotas).
Support Vertex AI platforms from an operations and reliability standpoint, including environment readiness, access controls, monitoring, pipeline execution health, and incident response (not model development).
Build and maintain observability standards using Cloud Monitoring, Cloud Logging, Error Reporting, and custom SLI/SLO dashboards for GCP workloads.
Implement alerting strategies aligned to error budgets and production reliability goals; reduce alert noise and prevent toil.
Execute incident response, triage, and post‑incident analysis for GCP services, contributing to PIRs and corrective actions.
Develop and maintain runbooks, operational playbooks, and escalation workflows for GCP services.
Drive automation-first operations, including self‑healing patterns using Cloud Functions, Cloud Run jobs, Scheduler, and event‑driven remediation.
Enforce and operate GCP security and governance controls, including IAM, service accounts, Org Policies, VPC Service Controls, KMS, Secret Manager, and networking guardrails.
Partner with engineering and data teams to review designs for operability, resiliency, and supportability, ensuring workloads meet production readiness standards before launch.
Required Skills & Experience:
Expert understanding of how applications should be engineered by following fault tolerate best practices, separation of duties, observability, and being operator friendly.
Expert on being Self-motivated and results-oriented with the ability to work in a team environment and independently
Strong hands-on experience with BigQuery, including performance tuning, cost management, and governance.
Experience with Vertex AI, including pipelines, model deployment, model monitoring, and integration with BigQuery.
Deep knowledge of Cloud IAM, service accounts, Workload Identity Federation, and principle-of-least-privilege controls.
Experience with GKE operations (clusters, node pools, autoscaling, workload identity, Istio/Anthos optional).
Understanding of Cloud Storage, Pub/Sub, Dataflow, Dataproc, and Cloud Composer for data/ML workflows.
Experience building CI/CD pipelines targeting GCP using Cloud Build, Artifact Registry, and Terraform.
Ability to troubleshoot GCP networking: VPCs, firewall rules, private service access, interconnects/VPN.
Nice to Have
Intermediate knowledge of Terraform and Cloud Formation required.
Intermediate Microsoft office skills
Hands-on experience with advanced GCP services such as Vertex AI, BigQuery, Dataflow, Pub/Sub, Cloud Run, and GKE.
Experience creating org-level policies, security baselines, and automation patterns for GCP environments
What We Offer
Collaborative work environment with global teams.
Competitive compensation and comprehensive benefits.
Continuous learning and growth opportunities in geospatial and risk analytics technologies.
About Us
We believe every day is a day to do right.
And that belief has guided us for over 200 years. Showing up for people isn’t just what we do, it’s who we are. We’re devoted to finding innovative ways to serve our customers, communities and employees – continually asking ourselves what more we can do.
And while how we contribute looks different for each of us, it’s these values that drive all of us to do more and to do better every day.
Featured Career Opportunities
-
Director, Product & Underwriting, Employment Practices Liability
- Location
- Employment Type:
- Full time
- Job Ref:
- R2624874
-
Director, Product & Underwriting, Employment Practices Liability
- Location
- Hartford, CT
- Employment Type:
- Full time
- Job Ref:
- R2624874
-
Director, DAIO Process Improvement
- Location
- Hartford, CT
- Employment Type:
- Full time
- Job Ref:
- R2625033