- Don’t just restart things to “fix issues”. You need to know what’s wrong, debug the issue, understand the problem.
- Humans should not be paged in the middle of the night if they can’t do anything about the problem
- If you want someone to babysit an application, you are doing something wrong. Please, have proper metrics and automation instead of human beings looking at charts.
- If something goes wrong 0.1% of the time, for something huge as AWS it means several times a day! Fix all the bugs or build automation to recover your software automatically
- If a human has to type a command, one day, they will do it wrongly and destroy something
- You automate with scripts (you have code review, you have the same process every time). You don’t automate humans with documentation
- Code is more important than documentation
- Documentation is more important than tribal knowledge
- If it can wait a few hours, push back!
- You need to understand how Linux/Unix works!!!
- You must know how shellscript works and build a couple yourself. Automate everything!! Do you need to start your local env every morning? Better have a script for that! Do you need to run tests before pushing code? Automate it! If you type twice the same set of commands, automate it!
- If you have shellscripts that will help your team, share it!!
- Share your automation!
- Ask people to share their automation!
- Build tools to build automation!
- Build tools to build tools!
- Try first to build frameworks / generic software
- Better to automatically recover a problem than asking a human to evaluate it!
It was January 2015; I was in the middle of my master’s. I had just started working again for CI&T when I received a message from an Amazon recruiter. There would be a hiring event in São Paulo, and they wanted to invite me to the process. I was not interested in looking moving to the USA, but working for a big company would allow me to work in a different environment. Instead of using frameworks to build services for banks, I would be able to write those frameworks/tools myself.
I decided to try out the tests/interviews.
I ended up passing, and the salary they offered would allow me to live a completely different life. And I don’t mean a life with cars, houses and that type of shit, but instead I would have some real financial independence. I decided to accept the offer.
Due to the fact, the whole USA working visa is random, I ended up not being selected. So Amazon offered me to work in Europe instead (YAY). I could pick Germany or Ireland. I did some research, and at that time, it seemed to me that the work in Ireland was more interesting to me.
That’s how I ended up working at the Redshift team.
The first year was a hard one, not only I was starting a new job, but I was outside my home country, speaking a foreign language. But at the same time, I was lucky. Almost everybody on my team was new as well. We were all getting used to the whole thing, so we bound. The team on our side was also pretty new, so we also spent time outside work together. Those people were excellent, and I owe them a lot. Also, during our oncalls I had to deal mainly with hardware/system issues, so I learnt a lot during that time.
In my first two months, our team was basically doing oncall and trying to deliver something between the shifts. Every member was primary oncall for a whole week every month and another week “ops” oncall (what meant work on tasks no one wanted to “keep the lights on”). Oncall was heavy, a couple of tickets per hour, but our shift would last only from 8-16 every day. A bit better than 24 hours setup.
The environment was very nice, 8 hours of work and that’s it, no long hours, no crazy amount of performance review. Everybody was entitled to make mistakes and learn from them.
As it’s publicly known, life at Amazon is not always easy. The team on our side had pretty bad managers. People were having a lot of trouble, and it was public that the whole team was not happy. Most people end up leaving the company, including very good friends of mine.
Our team had some problems in the past as well, especially regarding people in the USA. They would see at the beginning our team as “the folks who do oncall during our night”, and for a long time, we had to fight for good projects. This made some folks on our team leave the company or try to find new organizations inside Amazon.
That changed with upper management, even the product changed. When I joined, the engineers were 100% focused on delivering new features. With the new managers, we started focusing on making Redshift stable first (our oncalls were pretty hard, with hundreds of tickets per week). Also, automation was finally a priority, and we were given the green light to implement our ideas of oncall improvement.
Our local manager was also pretty good, so we never had similar issues as the “team on our side”. He was the type of manager that you can trust.
Our team also increased size, and a new office was open in Berlin. I had the opportunity to train some folks there. Oncall was reduced to a few days every 2 months. Sometimes not even that, since we were onboarding more and more people in Berlin.
I mainly worked in three fronts on the system (more details bellow), but I also wrote code for different and random issues/features, mainly using Java.
The three fronts:
- OpsConsole - Lead developer: Rails application that helped engineers oncall to perform actions on Redshift Clusters or to check their state
- Autorecovery: I was part of the team building auto-recovery actions and tooling around it for Redshift Clusters. The main idea was to use an event processor to act based on the events emitted by the cluster or adjacent systems (such as EC2)
- Canary - Lead developer: I got an old ruby application that was running a small amount of api calls to our systems and implemented a canary app that besides checking frequently all the API is also aware of workflows of the system (so it also tries to perform actions in a logical flow to catch bugs, such as: create a cluster, run operations on the cluster and delete the cluster)