Sr. Site Reliability - ElasticSearch/Graphite/Prometheus - Meraki
Location:San Francisco, California, US
Area of InterestEngineer - Software
Technology InterestCloud and Data Center, Internet of Everything
The Meraki cloud serves millions of customer devices from 8 datacentres around the world. As a Senior Site Reliability Engineer on the Observability team you will be responsible for designing useful, scalable and secure monitoring systems that make sure we stay online. You’re passionate about data, and about using automation to raise the bar. You will lead the design, development and operational aspects of the monitoring, log/event collection, and metric processing systems which support our private cloud. We believe in automating manual tasks with the right tools.
As SREs at Meraki we are responsible for building and scaling the cloud that supports millions of Meraki devices across the world. Meraki’s customer base has grown by a factor of 2-3 every year, serving more than 4 billion HTTP requests per day across six datacentres. Our customers depend on our products to run their critical infrastructure of network switches, security appliances, wireless APs and security cameras. We embrace the *nix way, automate away tedious tasks and build infrastructure as code.
Example projects of a Senior Site Reliability Engineer (Observability):
● Lead the discussion around our Graphite architecture to handle the next five years of metric growth.
● Design and build ElasticSearch clusters holding 10-1000TB of data, for a variety of use cases.
● Gather requirements, design and build an alerting system that allows developers to construct alerts - from multiple data sources and alerting workflows.
● Develop comprehensive meta-monitoring tools that provide new insights into our complex event and metric pipelines.
● Write libraries and APIs that provide a simple, unified interface to other developers when they use our monitoring, logging and event processing systems.
● Automate cluster scaling so monitoring resources can be requested and automatically deployed. You are an ideal candidate if you:
● Have 6+ years experience designing, deploying and operating mid to large scale enterprise or cloud environments.
● Have 3+ years experience scripting or coding with languages like Ruby, Scala, Python, or Bash.
● Fearlessly dive into other people's source code to solve a problem.
● Know your way around *nix systems. We run Debian.
● Consult with other teams on how they can better monitor their service. Evangelize best
● You automate all the things.
● You care about and empathise with the customer experience. You have experience supporting an externally-facing production environment, ideally in a team that follows the sun.
● Bonus points for experience with: ElasticSearch, Logstash, Kibana, Graphite, Grafana, statsd, collectd, Snowflake, Ansible, Ruby.
Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch, Logstash, Kibana, ELK, Grafana, Graphite, statsd, collectd, Snowflake, Ansible, Ruby.
Cisco is an Affirmative Action and Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis.
At Cisco Meraki, we don't just accept difference - it's one of our key values. Everybody In means we listen to each other's opinions. Everybody is accepted and valued here, and we are a team that works as one towards our goals. We recognize that diverse teams make the strongest teams, and we encourage people from all backgrounds to apply.