“The way that Spotify develops software was first described by Henrik Kniberg and Anders Ivarsson in Scaling Agile @ Spotify with Tribes, Squads, Chapters & Guilds. Their 2012 paper provided a snapshot of the way of working at that time at Spotify.” Ben Linders
Spotify Engineer Culture - Part 1
One of the big success factors here at Spotify is our agile engineering culture.
Culture tends to be invisible we don't notice it because it's there all the time, kind of like the air we breathe, but if everyone understands the culture, we're more likely to be able to keep it and even strengthen it as we grow, so that's the purpose of this video.
When our first music player was launched in 2008, we were pretty much a scrum company. Scrum is a well-established agile development approach and it gave us a nice team-based culture. However a few years later we had grown into a bunch of teams and found that some of the standard scrum practices were actually getting in the way so we decided to make all this optional.
Rules are good start, but then break them when needed.
We decided that agile matters more than scrum, and agile principles matter more than any specific practices, so we renamed the scrum master role to agile coach, because we wanted servant leaders more than process masters. We also started using the term squad instead of scrum team and our key driving force became autonomy.
So what is an autonomous squad? A squad is a small, cross-functional, self-organizing team, usually less than eight people. They sit together and they have end-to-end responsibility for the stuff they build - design, commit, deploy, maintenance, operations - the whole thing.
Each squad has a long-term mission, such as “make Spotify the best place to discover music” or internal stuff like “infrastructure for A/B testing autonomy”. Basically means that the squad decides what to build, how to build it and how to work together while doing it.
There are of course some boundaries to this, such as the squad mission, the overall product strategy for whatever area they are working on, and short term goals that are renegotiated every quarter.
Our office is optimized for collaboration. Here's a typical squad area. The squad members work closely together, with adjustable desks and easy access to each others screens. They gather over here in the lounge for things like planning sessions and retrospectives, and back there is a huddle room for smaller meetings or just get some quiet time. Almost all walls are whiteboards.
So why is autonomy so important? Well because it's motivating, and motivated people build better stuff . Also autonomy makes us fast by letting decisions happen locally in the squad, instead of the a bunch of managers and committees and stuff. It helps us minimize hand-offs and waiting so we can scale without getting bogged down with dependencies and coordination.
Although each squad has its own mission they need to be aligned with product strategy, company priorities and other squads basically, be a good citizen in the Spotify ecosystem.
Spotify’s overall mission is more important than any individual squad, so the key principle is really be autonomous but don't sub-optimize. It's kind of like a jazz band, although each musician is autonomous and plays his own instrument, they listen to each other and focus on the whole song together. That's how great music is created.
So our goal is loosely coupled but tightly aligned. Squads we're not all there yet but we experiment a lot with different ways of getting closer. In fact that applies to most things in this video. This culture description is really a mix of what we are today and what we are trying to become in the future.
Alignment and autonomy may seem like different ends of a scale, as in more autonomy equals less alignment, however we think of it more like two different dimensions.
Down here is low alignment and low autonomy. A micromanagement culture, no high-level purpose, just shut up and follow orders.
Up here is high alignment but still low autonomy. So leaders are good at communicating what problem needs to be solved, but they are also telling people how to solve it.
High alignment and high autonomy means leaders focus on what problem to solve, but let the teams figure out how to solve it.
What about down here then? Low alignment and high autonomy means teams do whatever they want and basically all run in different directions. Leaders are helpless and our product becomes a Frankenstein.
We're trying hard to be up here, aligned autonomy, and we keep experimenting with different ways of doing that.
Alignment enables autonomy. The stronger alignment we have the more autonomy we can afford to grant. That means the leaders job is to communicate what problem needs to be solved and why, and the squad's collaborate with each other to find the best solution.
One consequence of autonomy is that we have very little standardization. When people ask things like “which code editor do you use?” or “how do you plan?”, the answer mostly “depends on which squad”. Some do scrum sprints, others do Kan Ban, some estimate stories and measure velocity, others don't. It's really up to each squad.
Instead of formal standards, we have a strong culture of cross-pollination. When enough squads use a specific practice or tool such as get that becomes the path of least resistance and other squads tend to pick the same tool. Squads start supporting that tool and helping each other, and it becomes like a de-facto standard. This informal approach gives us a healthy balance between consistency and flexibility.
Our architecture is based on over a hundred separate systems coded and deployed independently. There's plenty of interaction, but each system focuses on one specific need, such as playlist management search or monitoring.
We try to keep them small and decoupled with clear interfaces and protocols. Technically each system is owned by one squad. In fact most quads owns several, but we have an internal open source model and our culture is more about sharing than owning.
Supposed squad one here needs something done in system B, and squad two knows that code best. They'll typically ask squad two to do it. However, if squad two doesn't have time or they have other priorities, then squad one doesn't necessarily need to wait. We hate waiting. Instead they are welcome to go ahead and edit the code themselves and then ask squad two to review the changes.
So anyone can edit any code, but we have a culture of peer code review. This improves quality and more importantly spreads knowledge. Over time, we've evolved design guideline code standards and other things to reduce engineering friction, but only when badly needed, so on a scale from authoritative to liberal we're definitely more on the liberal side.
Now, none of this would work if it wasn't for the people we have. A really strong culture of mutual respect. I keep hearing comments like “my colleagues are awesome”, people often give credit to each other for great work and seldom take credit for themselves. Considering how much talent we have here, there is surprisingly little ego.
One big ah-ha for new hires is that autonomy is kind of scary at first. You and your squad-mates are expected to find your own solution. No one will tell you what to do, but it turns out if you ask for help, you get lots of it and fast. There's genuine respect for the fact that we're all in this boat together and need to help each other succeed.
We focus a lot on motivation. Here's an example, an actual email from the head of people operations:
“Hi everyone, Our employee satisfaction survey says 91% enjoy working here and 4% don't. Now that may seem like a pretty high satisfaction rate especially considering our growth pain from 2006 to 2013 we've doubled every year and now have over 1200 people.” But then he continues, “This is of course not satisfactory and we want to fix it if you're one of those unhappy 4%. Please contact us we're here for your sake and nothing else”. so good enough isn't good enough. Half a year later things had improved and satisfaction rate was up to 94%.
This strong focus on motivation has helped us build up a pretty good reputation as a workplace but we still have plenty of problems to deal with. So yeah we need to keep improving.
Okay, so we have over 50 squads, spread across four cities, some kind of structure is needed.
Currently squads are grouped into tribes. A tribe is a lightweight matrix. Each person is a member of a squad, as well as a chapter. The squad is the primary dimension, focusing on product delivery and quality. The chapter is a competency area, such as quality assistance, agile coaching or web development. As squad member, my chapter lead is my formal line manager, a servant leader focusing on coaching and mentoring me as engineer, so I can switch squads without getting a new manager.
It's a pretty picture, huh, except that it's not really true. In reality, the lines aren't nice and straight and things keep changing. Here's a real-life example from one moment in time for one tribe and of course, it's all different by now and that's okay. The most valuable communication happens in informal and unpredictable ways.
To support this, we also have guilds. A guild is a lightweight community of interest where people across the whole company gather and share knowledge within a specific area, for example leadership, web development or continuous delivery. Anyone can join or leave a guild at any time. Guilds typically have a mailing list, biannual conferences and other informal communication methods.
Most organizational charts are an illusion, so our main focuses community rather than hierarchical structures. We found that a strong enough community can get away with an informal, volatile structure. If you always need to know exactly who is making decisions you're in the wrong place.
One thing that matters a lot for autonomy, is how easily can we get our stuff into production. If releasing is hard, we'll be tempted to release seldom to avoid the pain. That means each release is bigger and therefore even harder. It's a vicious cycle. But if releasing is easy we can release awesome. That means each release is smaller and therefore easier. To stay in this loop and avoid that one, we encourage small frequent releases and invest heavily in test automation and continuous delivery infrastructure. Release should be routine, not drama.
Sometimes we make big investments to make releasing easier. For example, the original Spotify desktop client was a single monolithic application. In the early days with just a handful of developers that was fine, but as we grew, this became a huge problem. Dozens of squads had to synchronize with each other for each release and it could take months to get a stable version.
Instead of creating lots of processes and rules and stuff to manage this, we changed the architecture to enable decoupled releases. Using chromium embedded framework, the client is now basically a web browser in disguise. Each section is like a frame on the website and squads can release their own stuff directly.
As part of this architectural change we started seeing each client platform as a client app and evolved three different flavors of squads: client app squads, feature squads and infrastructure squads.
A feature squad focuses on one feature area such, as search. This squad will build, ship and maintain search related features on all platforms.
A client app squad focuses on making release easy on one specific client platform such as desktop iOS or Android.
Infrastructure squads focus on making other squads more effective. They provide tools and routines for things like continuous delivery, A/B testing, monitoring and operations.
Regardless of the current structure we always strive for a self-service model, kind of like a buffet. The restaurant staff don't serve you directly, they enable you to serve yourself, so we avoid hand-offs like the plague. For example an operation squad or client app squad does not put code into production for people. Instead, their job is to make it easy for feature squads to put their own code into production.
Despite the self-service model, we sometimes need a bit of sync between squads when doing releases. We manage this using release trains and feature toggles. Each client app has a release train that departs on a regular schedule, typically every week or every three weeks depending on which client. Just like in the physical world, if trains depart frequently and reliably, you don't need much upfront planning just show up and take the next train.
Suppose these three squads are building stuff and when the next release train arrives. Features A, B and C are done, while D is still in progress. The release train will include all four features, but the unfinished one is hidden using a feature toggle.
It may sound weird to release unfinished features and hide them, but it's nice because it exposes integration problems early and minimizes the need for code branches. Unmerged code hides problems and is a form of technical debt. Feature toggles let us dynamically show and hide stuff in tests as well as production.
In addition to hiding unfinished work we use this to A/B test and gradually roll out finished features. All in all our releases process is better than it used to be, but we still see plenty of improvement areas so we'll keep experimenting.
This may seem like a scary model letting each squad put their own stuff into production without any form of centralized control and we do screw up sometimes, but we've learned that trust is more important than control. Why would we hire someone who we don't trust? Agile at scale requires trust at scale and that means no politics. It also means no fear. Fear doesn't just kill trust, it kills innovation, because if failure gets punished people won't dare try new things.
So let's talk about failure. Actually no let's take a break, get on your feet get some coffee, let this stuff sink in for a bit, and then come back when you're ready for part two.
Spotify Engineer Culture - Part 2
Hey! you're back. Great.
Now you've probably forgotten all about Part One, so let's do a quick recap:
Our culture is based on agile principles.
All engineering happens in squads and we try to keep them loosely coupled and tightly aligned.
We like cross-pollination and have an internal open source model for code.
Squads do small and frequent releases which is enabled by decoupling .
Our self-service model, minimizes the need for handoffs, and
We use release trains and feature toggles to get stuff into production early and often, and
Since culture is all about the people, we focus on motivation, community and trust rather than structure and control.
That was part one and now I'd like to talk about failure.
Our founder Daniel put it nicely “We aim to make mistakes faster than anyone else”.
Yeah I know it sounds a bit crazy, but here's the idea. To build something really cool, we will inevitably make mistakes along the way, right? But each failure is also a learning so when we do fail we want it to happen fast so we can learn fast and therefore improve fast. It's a strategy for long-term success.
It's like with kids. You can keep a toddler in the crib and she'll be safe but she won't learn much and won't be very happy. If you instead let her run around and explore the world, she'll fail and fall sometimes, but she'll be happier and develop faster, and the wounds? Well, they usually heal.
So Spotify is a fail friendly environment we're more interested in fast failure recovery than failure avoidance. Our internal blog has articles like ‘celebrate failure’ and stories like ‘how we shot ourselves in the foot’. Some squads even have a fail wall, where people show off their latest failures and learnings.
Failing without learning is well just failing, so when something goes wrong we usually follow up with a postmortem. This is never about whose fault was, it it's about what happened?what did we learn? what will we change?
Postmortems are actually part of our incident management workflow, so an incident ticket isn't closed when the problem is solved, it's closed when we've captured the learnings to avoid the same in the future. Fix the process not just the product.
In addition, all squads do retrospectives every few weeks to talk about what's working well and what to improve next.
All-in-all, Spotify has a strong culture of continuous improvement driven from below and supported from above. However failure must be non-lethal or we don't live to fail again, so we promote the concept of ‘limited blast radius’. The architecture is quite decoupled so if a squad makes a mistake it will usually only impact a small part of the system and not bring everything down. And, since the squad has end-to-end responsibility for their stuff without handoffs they can usually fix the problem fast.
Most new features are rolled out gradually starting with just a tiny percent of all users and closely monitored. Once the feature proves to be stable we gradually roll it out to the rest of the world, so if something goes wrong it normally only affects a small part of the system, for a small number of users for a short period of time. This ‘limited blast radius’ gives squads courage to do lots of small experiments and learn really fast instead of wasting time trying to predict and control all risk in advance.
Mario Andretti puts it nicely: “if everything is under control, you're going too slow”.
Alright let's talk about product development. Our product development approach is based on Lean Startup principles and is summarized by the mantra “think it, build it, ship it, tweak it”. The biggest risk is always building the wrong thing so before deciding to build a new product or major feature we try to inform ourselves with research. Do people actually want this? does it solve a real problem for them?
Then we define a narrative kind of like a press release or an elevator pitch, showing off the benefits. For example, “radio you can save” or “follow your favorite artists”. We also define hypotheses - “How will this feature impact user behavior and our core metrics?, Will they share more music? Will they log in more often?” and we build various prototypes and have people try them out to get a sense of what the feature might feel like and how people react.
Once we feel confident this thing is worth building, we go ahead and build an MVP. Minimum Viable Product. Just enough to fulfill the narrative, but far from feature complete. You might call it the minimum lovable product. The next stage of learning happens once we put something into production, so we want to get there as quickly as possible.
We release the MVP to just a few percent of all users and use techniques like A/B testing to measure the impact and test their hypotheses the squad monitors the data and continues tweaking and redeploying until they see the desired impact, then they gradually roll out to the rest of the world while taking the time needed to sort out practical stuff like operational issues and scaling.
By the time the product or feature is fully rolled out, we already know it's a success because if it isn't we don't roll it out. Impact is always more important than velocity, so a feature isn't really considered done until it has achieved the desired impact.
Note that like most things in this video, this is how we try to work, but our actual track record of course varies.
Now with all this experimentation going on, how do we actually plan “how do we know it's going to be released by which date?”. Well the short answer is we mostly don't. We care more about innovation than predictability and 100% predictability means 0% innovation. On a scale we'd probably be somewhere around here.
Of course sometimes we do need to make delivery commitments, like for partner integrations or marketing events and that sometimes involves standard agile planning techniques, like velocity and burn up charts but if we have to promise a date we generally defer that commitment until the feature is already proven and close to ready.
By minimizing the need for predictability, squads can focus on delivering value instead of being a slave to someone's arbitrary plan. One product owner said “I think of my squad as a group of volunteers that are here to work on something they are super passionate about”
So where do ideas come from? An amazing new product always starts with a person and a spark of inspiration but it will only become real if people are allowed to play around and try things out so we encourage everyone to spend about 10 percent of their time doing hack days or hack weeks.
That's when people get to experiment and build whatever they want, like this Dial-a-song product. Just pick it up and dial the number of the song you want to listen to. Is it useful? Does it matter? The point is, if we try enough ideas we're bound to strike gold from time to time, and quite often the knowledge gained is worth more than the actual hack itself. Plus, it's fun.
As part of this we do a Spotify-wide hack week twice per year. Hundreds of people hacking away for a whole week. The mantra is “make cool things real”. Build whatever you want, with whoever you want, in whatever way you want, and then we have a big demo and party on Friday.
We're always surprised by how much cool stuff can be built in just a week with this kind of creative freedom. Whether it's a helicopter made of lollipop sticks or a whole new way of discovering music, turns out that innovation isn't really that hard. People are natural innovators so just get out of their way and let them try things out.
In general, our culture is very experiment friendly. For example, should we use tool A or tool B? Don't know, let's try both and compare; or do we really need sprint planning meetings? Don't know let's skip a few and see if we miss them; or should we show 5 or 10 top songs on the artist page? Don't know let's test both and measure the impact. Even the Spotify wide hack week started as an experiment and now it's part of the culture.
So instead of arguing an issue to death, we try to talk about things like, what’s the hypothesis? and what did we learn? and what will we try next? This gives us more data-driven decisions and less opinion driven ego driven or authority driven decisions.
Although we are happy to experiment and try different ways of doing things our culture is very waste repellent. People will quickly stop doing anything that doesn't add value. If it works, keep it, otherwise dump it. For example, some things that work for us so far are retrospectives, daily stand-ups, Google Docs and guild conferences.
Some things that don't work for us our time reports, handoffs, separate test teams or test phases and task estimates. We mostly just don't do these things. We're also strongly allergic to useless meetings and anything remotely near corporate BS.
One common source of waste is what we call ‘big projects’. Basically anything that requires a lot of squads to work tightly coordinated for many months. Big project means big risk, so we are organized to minimize the need and instead try hard to break projects into a series of smaller efforts.
Sometimes, however, there is a good reason to do a big project and the potential benefits outweigh the risks and in those cases we found some practices to be essential: - visualize progress using various combinations of physical and electronic boards, do a daily sync meeting where all squads involved meet up to resolve dependencies, do a demo every week or two where all the pieces come together so we can evaluate the integrated product together with stakeholders. These practices reduce risk and waste because of the improved collaboration a short feedback loop.
We've also found that a project needs a small tight leadership group to keep an eye on the big picture typically we have, a tech lead, a product lead and sometimes a design lead, working tightly together. On the whole, we're still experimenting a lot with how to do big projects and we're not so good at it yet.
One thing we keep wrestling with is growth pain. As we grow we risk falling into chaos, but if we overcompensate and add too much structure and process, we risk getting stuck in bureaucracy instead, and that's even worse. So the key question is really what is the “minimum viable bureaucracy”? The least amount of structure and process we can get away with to avoid total chaos. Both sides cause waste but in different ways so the waste repellent culture and agile mindset helps us stay balanced the key thing about reducing waste is to visualize it and talk about it often.
So in addition to retrospective and post-mortems, many squads and tribes have big visible improvement boards that show things like; what's blocking us? and what are we doing about it?
We also like to talk about definition of Awesome. For example, Awesome for this squad means things like really finishing stuff, easily ramping up new team members and no recurring tasks or bugs; and our definition of Awesome Architecture includes: I can build test and ship my feature within a week, I use data to learn from it and my improved version is live in week 2.
Keep in mind though Awesome is a direction not a place, so it doesn't even have to be realistic but if we can agree on what awesome would look like, it helps focus our improvement efforts and track progress.
Here's an example of an improvement tracking board inspired by a lean technique called toyota kata. Top left shows what is the current situation, in this case the squad was having quality problems. Bottom left shows definition of Awesome, in a perfect world we'd have no quality problems at all. Top right is a realistic target condition “if we were one step closer to Awesome, what would that look like?”, and finally the bottom right shows the next three concrete actions that will move us towards the target condition. As these get done, the squad fills it up with new actions boards like this live on the wall in the squad room, and are typically followed up at the next retrospective.
All right, I realise that maybe this video makes it seem like everything at Spotify is just great. Well truth is we have plenty of problems to deal with and I could give you a long list of pain points, but I won't because they would go out of date quickly. We grow fast and change fast and quite often a seemingly brilliant solution today will cause a nasty new problem tomorrow just because we've grown and everything is different. However, most problems are short-lived because people actually do something about it. This company is pretty good at changing the architecture process organization or whatever is needed to solve a problem and that's really the key point - healthy culture heals broken process.
Since culture is so important we put a lot of effort into strengthening it. This video is just one small example. No one actually owns culture, but we do have quite a lot of people focusing on it. Groups such as people operations and about 30 or so agile coaches spread across all squads, and we do boot camps where new hires form a temporary squad. They get to solve a real problem while also learning about our tech stack and processes and learning to work together as a team all in one week. It's an intense but fun way to really get the culture. They often manage to put code into production in that time which is impressive, but again, failing is perfectly okay as long as they learn.
Mainly though, culture spreads through storytelling, whether it happens on the blog, at a post-mortem a demo or at lunch, as long as we keep sharing our successes and failures and learnings with each other I think the culture will stay healthy. At the end of the day culture in any organization is really just the sum of everyone's attitudes and actions. You are the culture so model the behavior you want to see.
That's it, we're done. I hope you enjoyed this story. Thanks for listening.