Ivory Jenga - how organizations struggle with granularity
- Abstract
- Department of DevOps, Division of Jenkins
- We need it, we reinvent it
- Hierarchical structure, hierarchical outcomes
- One deployment to rule them all
- Summary
Abstract
Building distributed systems is easier said than done. Putting aside the technical complexity and challenges that come with designing, developing and maintaining distributed systems, organizations need to figure out how to coordinate the efforts of their engineering teams, so that the results are delivered in finite time, at an acceptable cost, and make at least some sense for the business. This is easier said than done, and there is a reason why Conway’s Law is one of the more popular terms in the microservices world. Time and again, reality proves the outcomes of an organization’s efforts reflect its internal structure’s efficiency - or deficiency.
In this article, we will take a look at a few examples of how organizational structure and internal regulations affected how the distributed systems were being built by the engineering teams, and what effects it had on system maintainability, reliability, and efficiency.
Department of DevOps, Division of Jenkins
Anybody who has worked for a large organization with a headcount in the thousands, if not tens of thousands, has probably experienced this kind of org structure. Rather than gather the resources needed to get the job done, one needs to jump through hoops to be able to perform their tasks:
- Create a DBA ticket to provision a database for this new application,
- Spend a month’s worth of e-mail threads and meetings with the Network Team to establish a connection between the new database and the servers you application is going to run on,
- Get stuck for a week because the CI/CD pipeline got misconfigured - DevOps Team misunderstood that they need to prepare a JavaScript pipeline, and now your Java application cannot be built as it needs Maven… but got NPM.
Sadly, this way of working is not only inefficient and frustrating, but also prevalent to the point it is taken for granted in the large organizations. Oftentimes, it is justified with statements such as:
We have always worked this way, you have to get used to this
This is a serious business, we cannot just get things done like a startup
Challenging status quo is difficult enough, and the privilege to do this is usually beyond the reach of common folks and the lower-to-middle level managers who get to experience these struggles first-hand. At the same time, the organization keeps accumulating inefficiencies:
- Everybody is perpetually stuck, with their tickets waiting in line to be handled by another division,
- Stretched feedback loops, leading to poor technical decisions not being corrected in a timely manner,
- Corporate power struggles and “big man” mindset can easily impair SDLC processes and impair delivery or maintenance of critical products,
- In extreme cases, customers can be lost due to dissatisfaction with the time-to-delivery and/or Quality of Service the organization is able to provide.
Everyone is waiting for each other
The mutual dependencies between such tech-oriented departments tend to run in all directions:
- Development team needs QA team to sign off their changes for production deployments,
- DBA team reaches out to Development team to update database connections configuration, as they plan to migrate to a more scalable DB cluster and decomission the legacy one,
- QA team chases Development team as the latest changes slowed down the application significantly, and now Development team must sync with DBA team to improve indexes on tables,
- Security team approaches Development team to patch security updates, however this requires alignment with QA and DBA teams because other deliverables are already underway, and there is no easy way to roll out a simple library version bump with so much work already piled up.
Ineffective communication
Imagine a hierarchical, siloed org structure where a particular Development Team needs Infrastrucure Team to help:
Unfortunately for the Development Team, it is not the only one that reaches out to Infrastructure to get something done for them. There is a number of things, then, that often do happen in such scenario.
Scenario A: No capacity
In this scenario, Infrastructure Team simply rejects or postpones the request for help, as they are already too overwhelmed to handle yet another request, and everyone claims their request is the most urgent.
Obviously, the Development Team is dissatisfied with the dismissal, as from their perspective this help may really be critical, urgent, block their progress or even have a severe organization-wide impact. Meantime, the Infrastructure Team needs to spend considerable time triaging and responding to numerous requests. This often leads straight to the other two scenarios.
Scenario B: No response
Sometimes, this communication becomes so overwhelming that some teams stop responding to requests altogether - either due to requests being simply lost in the endless stream of tickets and emails, or because the team would be practically paralyzed if they replied to everyone.
This situation is even more unhealthy than perpetual rejection - because in case of the latter, at least the team reaching out for help gets to know the other side acknowledged their request. If the recipient never responds to your tickets, emails and meeting invites, follow-ups and escalations ensue.
Scenario C: Escalation
Sometimes it is used as a last resort, and sometimes it is the default way of getting work done in an organization, as direct communication between individual teams is virtually impossible. The escalation can sometimes go up several levels before the situation gets addressed.
Going through such escalation path not only takes time, but also creates needless tension and may exert excessive pressure on either party involved. Moreover, it gets multiple levels of management involved in fine-grained initiatives, stretching their own capacity, and possibly undermining trust in the team’s competence and the ability to deliver results.
Solution: Autonomous domain teams
As you can see from these few examples, such conundrums are not easily solved, especially when they compound. The problem would not exist if each team owning a particular domain mostly self-sufficient, having Software Engineers, QA Engineers, DBAs and Security Engineers on board, and could own their piece of cake end-to-end. Moreover, it would make it easier for an organization to see what kind value each individual team delivers. Frankly, this is not a new concept, as it dates back to at least 2003, when Eric Evans coined the term Domain Driven Design. Over 20 years later, we are still wondering whether what we do is proper DDD or where should we draw the lines between domains, though what is most important in DDD is its core concept - that instead of owning layers, organizational units should own particular areas of business as a bounded context.
In this setup, most of the time the teams are self-sufficient, and external interactions are required when concerns cross the boundaries of a single domain. Examples of such situations include:
- Architecture alignment initiatives,
- Organization infrastructure security hardening efforts,
- One domain interacting with another as part of higher-level customer journey.
We need it, we reinvent it
This is an experience most Software Engineers can relate to. We start a project, we start running into challenges… if we are lucky enough, we get to know someone has already addressed this before in the company, otherwise we get our hands dirty and proudly solve problems nobody told us do not need solving. This indicates a number of problems:
- Lack of alignment across engineering teams, as all of them have their own, in-house solutions to address similar problems,
- Considerable efforts are committed by teams to address the same challenge independently,
- Individual solutions are of varying quality, as smaller audience and narrower scope mean fewer opportunities for feedback. Some teams may arrive at brilliant solutions, while others would not,
- Communication gaps, as most of the time the teams are unaware of each other’s struggles, and thus cannot cooperate to arrive at a solution together.
Solution: Chapters, Guilds and Communities of Practice
It doesn’t really matter how exactly an organization names it, or what particular framework this is built around. The crucial aspect is to give room for engineering teams to gather and discuss particular topics of interest. It also creates an opportunity to showcase what kind of challenges the team ran into, and how they get addressed - this way, such solutions can spread and receive more substantial feedback, and it increases the chances the teams would re-use each other’s solutions, rather than keep themselves busy reinventing the wheel.
The most successful solutions - ones that got the most traction, proved to be reliable and liked by the engineering teams - can be further refined and adopted company-wide. This can lead to gradual creation of the company platform, maintained and owned by the platform team(s) as enablers for domain teams. As an added benefit, in my experience the ability to participate in such initiatives enhances engagement, and helps build a sense of belonging to the organization.
Solution: Platform team catering to everyone
These kinds of problems got gradually addressed as the company invested in building a platform team delivering enablers for the rest of us - such as high-level Terraform modules to provision required resources, running dedicated Kubernetes clusters in AWS EKS to be used by domain teams, and delivering a high-level CLI that allowed to easier integrate local environment with cloud resources for easier management. While some of these platform solutions were not great - the microservices generator was almost universally contested as insanely impractical - in general having such team greatly increased our productivity and alignment across teams.
One of the most important aspects of this approach was that it was built with self-service in mind - rather than sending tickets over to the Platform Team to do the necessary work for us, the team delivered technical enablers that allowed us to manage the infrastructure on our own easier and faster, while adhering to company’s security standards and keeping roughly uniform infrastructure.
Hierarchical structure, hierarchical outcomes
This kind of attitude is typical for strongly hierarchical organizations, and large companies with long traditions are quite susceptible to hierarchical mindset. Unfortunately, this has manifold negative effects, such as:
- Rejecting feedback based on position in hierarchy, rather than merit, means that it is no longer important whether the decision is right, what matters is who made it,
- Conflicts of interest arise, as one needs to choose between taking the risk of challenging the hierarchy, and sacrificing one’s own and organization’s opportunities for the sake of maintaining peace,
- It creates a non-inclusive work environment, where team members are not able to contribute according to their individual skills,
- It limits the growth of individual contributors, as they miss the opportunities to learn, contribute and be recognized,
- In a work environment where individual engineers have no say regardless of their merit, they are unlikely to be as committed as if they could get more involved.
Controller Service, Service Service, Database Service
Needless to say, the Software Architect dismissed the feedback received from the engineering teams, as he was the entity to both create and approve the architecture design. It did not take long for all the concerns raised by the teams to become reality - first, in an MVP version a single API call the from enterprise system led to no less than a dozen internal API requests, some of them cascaded and some looped. A few months later, a dozen became almost a hundred, and the system was visibly struggling due to increasing latency and low reliability, as none of the 100 or so API calls were allowed to fail in order for the operation to be successful.
This situation could have been avoided if the Software Architect would be more open to feedback from those below him, and included them in the design sessions in a more collaborative manner.
Solution: Collaborative design
The most robust, reliable and maintainable systems I have ever built were ones where all of the team members - regardless of their seniority level - were included in the system design, and were encouraged to participate. In some cases, the design was a result of a peaceful consensus, and sometimes it was born in heated discussions and arguments about various side effects, trade-offs and priorities that happened to be mutually exclusive. In either case, what mattered was that everyone felt included and was aligned with the design, rather than suppressed and uttering I knew it would happen to themselves.
There are many ways the team can collaborate on distributed system design:
- A team member may become a subject matter expert, driving particular initiatives while receiving design review from remaining team members,
- Architecture Decision Records or similar design documents help preserve the rationale for certain decisions, and can even checked into version control,
- Brainstorming sessions are a great way to discuss and confront various ideas and designs before committing to a particular decision,
- Most of the design work can be split into separate tasks, spikes or other actionable items, which can be worked on individually or collaboratively.
This approach works well both within the boundary of a single team, as well as within initiatives crossing the team boundaries, such as internal tech communities.
One deployment to rule them all
The decision to orchestrate deployments of a distributed system is one of the most dangerous anti-patterns - the Red Flag Law. There are multiple disadvantages of this deployment strategy in the context of distributed systems:
- Big-bang deployments involving multiple deployment units create room for human error, which in turn can lead to severe outages if deployment units strongly depend on each other,
- Multiple, not necessarily related changes build up to become part of a single deployment, and only one of them needs to have defects to make deployment fail or cause an outage,
- In case of an deployment-related outage or failed deployment, it becomes harder to investigate the root cause of failure,
- Breaking changes can be accidentally introduced, not only creating further opportunities for human error, but creating a risk that a failed or partial deployment may turn out to be difficult to rollback.
Solution: Independently deployable components
Similar to the idea of having mostly autonomous teams, software components in a distributed system should be deployed independent of each other. In order to achieve this, deployments should be atomic, and any changes, including breaking changes, should be rolled out in such a way that it does not require immediate orchestration with deployments of other deployment units. Some of the deployment strategies that enable this include:
- Breaking up a breaking change into a series of mutually compatible changes - so that in case of deployment failure, the deployment unit can always be reverted to its previous version,
- If changes in multiple deployment units require coordination, the deployment effort should again be broken into compatible stages, which preserve some order of introducing changes, but do not require strict timing for the system to remain available,
- Oftentimes, leveraging feature flags is useful as related changes can be deployed without having immediate effects on the system. Thanks to this, deployments of related changes no longer need to occur in a particular order, and can remain inactive until needed or possible to enable.
Solution: Frequent, unscheduled deployments
To prevent multiple changes from compounding, it is reasonable to deploy the change as soon as it reaches production readiness, even if that means multiple deployments a day. One may ask, however, a number of valid questions:
- How do we know a change is production ready?
- What to do if some changes are production ready, and some are not?
The answer to the first question is thorough testing and rigorous quality gates - ideally before even letting the change to be merged into the deployment unit’s mainline branch. A crucial aspect of this approach is extensive automation and minimizing the labour-intensiveness of enforcing the quality gate, as well as removing opportunities to bypass them:
- Strong branch protection, with no exception for administrators,
- Required code review before merging,
- Required CI quality gates that must pass, including passing tests, security scans, linters and deployment dry-runs,
- Pre-merge quality gates should be as extensive as possible, to reduce the risk of finding out the change has defects only after merging. Deployment to staging / pre-production from PR branch for E2E tests is advisable. Additionally, it helps to require all PR branches to be up-to-date with base branch before merging - this way, unexpected and untested interactions between independent changes are avoided.
With thorough validation of a change before integrating it into our mainline branch, we gain high level of confidence that our mainline branch is always deployable.
In case not all changes on mainline branch are deployable, or we simply do not want them to take effect just yet, we can avoid blocking the mainline branch or rendering it non-deployable by disabling such changes with feature flags. Once the change is considered good to deploy, we can re-deploy with feature flag turned on - and we are one flag away from reverting the change in case it has severe defects causing incidents. Lastly, once we are certain the change is going to stay with us and no longer needs its feature flag, the flag can be removed altogether.
Summary
Defining effective ways of working around a distributed system is a challenging endeavor, and it only becomes harder if the organization itself is large and already has its own procedures. It is not impossible, though, to adopt some good practices to gradually improve, leading to better results for the organization and improved experience for its engineering teams.