Sr. Site Reliability - ElasticSearch/Graphite/Prometheus - Meraki

  • Location:
    San Francisco, California, US
  • Area of Interest
    Engineer - Software
  • Job Type
  • Technology Interest
    Cloud and Data Center, Internet of Everything
  • Job Id
The Meraki cloud serves millions of customer devices from 8 datacentres around the world. As a  Senior Site Reliability Engineer on the Observability team you will be responsible for designing  useful, scalable and secure monitoring systems that make sure we stay online. You’re  passionate about data, and about using automation to raise the bar.  You will lead the design, development and operational aspects of the monitoring, log/event  collection, and metric processing systems which support our private cloud. We believe in  automating manual tasks with the right tools.  

As SREs at Meraki we are responsible for building and scaling the cloud that supports millions of  Meraki devices across the world. Meraki’s customer base has grown by a factor of 2-3 every  year, serving more than 4 billion HTTP requests per day across six datacentres. Our customers  depend on our products to run their critical infrastructure of network switches, security  appliances, wireless APs and security cameras. We embrace the *nix way, automate away  tedious tasks and build infrastructure as code.  

Example projects of a Senior Site Reliability Engineer (Observability): 
● Lead the discussion around our Graphite architecture to handle the next five years of  metric growth.  
● Design and build ElasticSearch clusters holding 10-1000TB of data, for a variety of use  cases.  
● Gather requirements, design and build an alerting system that allows developers to  construct alerts - from multiple data sources and alerting workflows.  
● Develop comprehensive meta-monitoring tools that provide new insights into our  complex event and metric pipelines.  
● Write libraries and APIs that provide a simple, unified interface to other developers when  they use our monitoring, logging and event processing systems.  
● Automate cluster scaling so monitoring resources can be requested and automatically  deployed.  You are an ideal candidate if you: 
● Have 6+ years experience designing, deploying and operating mid to large scale  enterprise or cloud environments.  
● Have 3+ years experience scripting or coding with languages like Ruby, Scala, Python, or  Bash.  
● Fearlessly dive into other people's source code to solve a problem.  
● Know your way around *nix systems. We run Debian.  
● Consult with other teams on how they can better monitor their service. Evangelize best 
● You automate all the things.
● You care about and empathise with the customer experience. You have experience  supporting an externally-facing production environment, ideally in a team that follows the  sun.  
● Bonus points for experience with: ElasticSearch, Logstash, Kibana, Graphite, Grafana,  statsd, collectd, Snowflake, Ansible, Ruby.  

Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch,  Logstash, Kibana, ELK, Grafana, Graphite, statsd, collectd, Snowflake, Ansible, Ruby.  

Cisco is an Affirmative Action and Equal Opportunity Employer and all qualified applicants will  receive consideration for employment without regard to race, color, religion, gender, sexual  orientation, national origin, genetic information, age, disability, veteran status, or any other legally  protected basis. 

At Cisco Meraki, we don't just accept difference - it's one of our key values. Everybody In means we listen to each other's opinions. Everybody is accepted and valued here, and we are a team that works as one towards our goals. We recognize that diverse teams make the strongest teams, and we encourage people from all backgrounds to apply.