Architecture – Scaling Design

My position on architecture is different from most peoples’. My view works really well at scale (much better than the traditional view, in my experience), but it is very different.

I start with one traditional definition of architecture: any decision that would be costly to change. However, I also know that I (and those on my teams) learn quickly. Every problem that I found difficult at some point has later been found to be easy…once we learned more. This gets me to my final definition.

Architecture is any decision you make that you are not yet smart enough to be able to change on a whim.

The best large systems have no architecture, under my definition, at all.

Instead, these systems have good design. Good design comes from the following:

  • The cost and risk of local design change is low. Each design flaw, when found, will be cheaper and less risky to fix than to work around. Mechanized refactoring (via tools) gives you this.
  • Each local team is empowered and able to execute arbitrary design changes. Ability comes from proficiency at local refactoring (the first bullet). Empowerment comes from your org structure and source control sharing practices.
  • Each design change is made by people with detailed knowledge, and in a way that will, over time, converge with other design changes. Pairing gives you this within any one team. Encapsulation and marketplaces reduce the need for this between teams. Cross org networks and transparent, high-bandwidth communications give you this between teams.
  • Design changes tend to be local. Encapsulation, duplicate implementations, and independent teams give you this.
  • The cost of integrating design change is low. Single-branch development and ultra-high-frequency full-system integration gives you this.

Good design techniques can eliminate all architecture in a system. This is harder to do as the number of people increase. It is easier to do as discipline (work tiny + prove it) increases (and errors decrease). It is easier to do as decentralization and transparency increases.

Great. How do we get this design?

I could call on Conway’s Law here. Most people would. But people are still creating architecture instead of design after more than a decade, so I’m going in a different direction.

First, I’m going to assume that we can deliver small software well in small teams. If this is a challenge in your organization, then go do Extreme Programming. It is a solved problem. There’s a recipe for it, and those that follow the recipe by the book for at least 3 months (I prefer James Shore’s book) before they start customizing it get the same result every time. So if delivery is your challenge, just go fix it.

Now that your organization is able to reliably deliver single-team projects with high reliability, let’s discuss larger projects. If you’ve done XP, at this point your organization consists of cross-functional teams of 8±2 people each. Each of these is delivering one or more products end-to-end. They ship to the market on the market cadence with almost no bugs. They each own a value stream, and each is generating value for your business (not necessarily revenue, but probably that too). They each produce well-designed software and can change that design whenever they want.

There is significant duplication of effort (inefficiency): often multiple teams will be doing the same thing. This is especially true with dev tools: each team has hand-crafted its own. Many utility classes will be re-implemented by multiple teams. Teams may copy and paste code from each other, but they don’t share it.

In this fabled (but straightforward to attain) large company, there is not yet any way to ship huge projects. There is no way for the teams to come together to do something massive. So there are no large projects. But there are lots of small ones, each of  which is independently shippable, provides value to its customers, and is (hopefully) profitable.

What is the effective architecture we have here?

In other words, what is hard to change?

Well, first let’s look at the overall picture. We see…complete independence. Lots of little apps. Looks a lot like an app marketplace, actually.

Pieces of business value for customers are discovered / invented in a decentralized fashion. This, with transparency of information in the organization, results in the maximum number and value of innovations (see SemCo, Dutch Sociocracies, Open Source ecosystem, and the ecosystem of mobile apps for examples. I think Amazon’s marketplace of services is also an example, but have not seen it myself).

Each project guards its source from other projects. They don’t have a way to share. This makes systemic changes both easier and harder.

Assume that we wanted to make a global change. For example, assume we were back in the time before cloud providers. Each app has been storing its data in different places. Now cloud providers enter the picture. We want to move all of our data to AWS in order to decrease our costs. There are hundreds or thousands of apps, and each does data in its own way. We have a lot of work to do.

However, we also have a lot of people to do that work. If we can convince everyone that it is worthwhile, then everyone will independently move their app to use AWS. Each change is easy, because there are no dependencies between apps. Everyone makes the change in a decentralized fashion, in their own timeframe. The ecosystem ends up making a gradual but rapid transition to the new storage provider. If it doesn’t work for some teams, those teams don’t move. This is a system that is very good at trying experiments.

But we glossed over the hard bit: getting everyone to make that shift. And the next shift. And the shift from 500 ways to do things to only 3, but where you have to pick the right one of the 3 for your context. So our architecture here is in a place that people don’t think to look for it. It isn’t in the code.

Our architecture is the time & effort it takes for the teams to take concerted action. Our architecture is poor information exchange between teams. It is each team’s opacity that prevents others from seeing what it does. And it is the lack of a system-wide decision-making approach (whether than be a charismatic leader with passionate ideas or a set of shared values and a shared understanding of the economics, or whatever else).

So let’s fix the architecture

Optimally, we want to replace this architecture with good design. Which means we need a final system that is flexible and which is also changeable by local action.

Well, good thing people have already found some (actually several) for just this situation. They’ve even tested them out in a business context.

As a reminder, the architecture bits to solve are:

  1. Hard to see what others are doing
  2. Hard to share efforts to reduce development costs (tools & code)
  3. Hard for information about successes and failures to flow between teams (hard to learn from each other)
  4. Hard to coordinate efforts to achieve larger objectives

Well, #2 sounds a lot like open source software to me. OSS communities solve this problem between companies; let’s apply the same solution within the company. We just need to have a good, single repository. It needs to hold shared code, shared binary packages, and ways to communicate between producers and consumers. GitHub is a great solution. We can just ape it (better yet, use it).

#1 and #3 are susceptible to the same solution: working out loud. This means that the work products of each small team are located in a place where anyone in the organization can see them and (optimally) contribute to them—probably the same repository we use for our internal open source stuff. The conversations around those work products are held where anyone can join. Both conversations and products are searchable and maintain history. And each team curates its conversations and products, making the best stuff available to others in a condensed form (with back-links to the full detail).

This leaves #4. Interesting note: until this point, we have not yet needed any specific organization between teams at all. In fact, we had the same org structure and roles whether we assumed this to be a single massive organization or a marketplace of little app companies.

This is an organization where 100% of the people work on small, cross-functional XP teams directly delivering value into the marketplace and getting paid to do so. There are no managers or human cross-connections. There are communication systems (the working out loud stuff). Those allow each individual group to make local decisions based on non-local data. But we haven’t needed anyone to staff those communication channels. After all, they just move data.

So how do we get #4? There are at least 3 options:

  • Charismatic people with clear broadcast channels (Apple led its app ecosystem this way)
  • Decision-making economics to define shared understanding of trade-off costs (Donald Reinertsen’s books, and the teams which he describes, do this)
  • Decision-making process that transforms system problems into local ones (Sociocracies do this. So does James Shore’s large-scale agile by using kanban to coordinate the work of XP teams)

Pick one. I don’t care which. Personally, I love the third option (especially Sociocracies). I find the second pretty good. And I find the first problematic (charismatic people, like any other individuals, are more often wrong than right). But I’m an individual here and you all aren’t, so you’ll make a better decision than I would.

So where are we now?

Each piece of value is delivered, full-stack, by a small team. These small teams have visibility into each other’s work. They can choose to share efforts / code when that would speed them up (after including blocking costs). They can also choose to duplicate efforts when that would speed them up. Because they all refactor (aka design) all the time, they can reverse the make/buy decision at any time with tiny cost.

Such an organizational system results in a system-wide emergent design. Each part of the system can independently go in a new direction or continue to pursue its current direction, with approximately equal cost. There is no architecture, but there is design all the way up. The design is not the result of thinking by some small group of people who are isolated from the details of particular problems (even if you chose the Charismatic Leader option, since he only has influence, not control). It is the result of design thinking on an ongoing basis by every single person in the ecosystem. It tends towards very loose coupling (as each team is incented to isolate itself from others’ problems) and very high cohesion (as each team has its own profit motive and seeks to deliver real business value, not just some technical component).

There is a small cost in efficiency. Parts of the ecosystem will be redoing work that others have already done. But there are huge gains in throughput and effectiveness (more stuff gets done per unit time, and more value gets delivered per unit time). Higher profit, with some duplicate labor.

And because each individual group is constantly re-evaluating and has a way to share work, the scope of each inefficiency is limited. If it gets large, then removing it becomes a profitable endeavor and someone creates a product to eliminate it. The system, as a whole, accepts small inefficiencies to maximize throughput and encapsulation, but guards against large inefficiencies.

What this means for business

These ecosystems beat large projects nearly every time that we’ve seen them compete (read The Starfish and the Spider for a bunch of examples). Over time, I expect that to continue.

This is a disruptive change. The question for large companies (including mine – my day job is at Microsoft (all views expressed here are solely my own)) is how they can organize to start running internal ecosystems rather than large projects.

Wait, I’ve seen this before…

Yup, this is the same advice I gave last week, in Scaling Agile – The Easy Way. It’s also describing one way to shift from fluency at 2-star to fluency at 3-star.

First you excel at shipping. Then you solve the few remaining problems by following your nose. Excelling at shipping locally gives you a powerful, decentralized ecosystem. Add transparency and decision support and you’ve got everything you need for a decentralized organization.

In the end scaling design to encompass large systems is exactly the same problem as enabling your business to optimize value (across multiple teams). Once you enable each team to do good design, your architecture is the organization. Not “your architecture reflects your organization” (a rephrasing of Conway’s Law). Your architecture is your organization: the only hard decisions to reverse are the organizational problems.

So you fix those.

8 thoughts on “Architecture – Scaling Design”

  1. It seems that there may be an assumption here that apps are independent of each other. Any thoughts on when apps (or really components within a large system) depend on each other and the API/messaging is volatile?

    1. Yeah. The jerky / short-form answer: don't do that.

      Seriously. It's optional. Most people seem to treat dependencies as a requirement at scale, and to treat bad APIs as somehow not possible to encapsulate. Most people also treat bugs as a necessary work product of developers. We have found ways to be effective without incurring these problems.

      First, if you already have a dependency, encapsulate it. Use hexagonal architecture. You can have a different adapter for each version or supplier or whatever. You have a simulator; the unit tests for that simulator define your expectations of the dependency: they can be used as platform tests / ops dependency monitoring / whatever.

      Next, it isn't necessary to take dependencies. Teams in large companies often worry way too much about efficiency and not nearly enough about productivity / throughput / effectiveness. They see the waste in "duplicating something between teams" but not the waste in blocking multiple teams on each other and creating wait states. This is especially common with central groups of architects. They often spend great effort consolidating key functionality, then give it to one team. This simply guarantees one chance for a team to screw up many other teams.

      Distributed ecosystems of independent actors default to duplicating all efforts. This wastes a ton of code. But it often saves time: it reduces waiting, responding to upstream irrelevant changes, time spent coding ways to simplify the general-purpose component to your needs, time spent coordinating & planning, and wait time & coordination efforts related to release timing / version compat. Amazingly often, it is faster and cheaper to just have each group write all of its own code.

      The purpose for the transparency is to allow the ecosystem to identify cases where it is no longer worth duplicating efforts. At some point, some set of code will really become a common component. At that point, it is likely to be very stable. It will have been implemented & refined dozens of times, and the best design will be well-known. The cost to manage the dependency will be cheaper than the cost to write it again. So some team creates that component as a stable product and sells it to the dev community.

      The result is a system that explores rapidly and without blocking. Each group makes their own decisions and can move fast. But it also adapts once it has fully mapped some piece of terrain. Anything well-understood becomes a commodity, which some person creates and then everyone maintains & shares. It becomes a cheap open-source component.

      1. Great post Arlo. Aligns with my intuitions and gives me further reading to explore. I will be sharing it.

        IMHO, the topic of "duplicating code between teams" deserves its own focused post. It is an idea that many developers (one's working on cross team monolithic code bases) dislike and I lack the persuasion/data to evangelize well. The only reference I can draw from is Domain Driven Design allows duplicated code in bounded contexts but it all too often falls on deaf ears. I would title it "When DRY is a DUMB decision" to follow the stop making dumb decisions post. 🙂

  2. Sorry, I find this a bit confusing. What´s "architecture" compared to "design"? Or maybe easier to explain: What´s the task of an "architect" compared to that of a "designer"?

    If I show you a structure (elements (e.g. subroutines, classes, libraries) plus relationships between elements (e.g. inherits, uses, aggregates)), how do you know, if it belongs in the realm of architecture or design?

    1. You cannot know from the static structure what is architecture and what is design.

      The difference is simple: design is any decision that we can change at the same cost later as it took to make it now. Architecture is anything that, once decided, becomes fixed. One team's architecture is another team's design. The core difference is not the code or technology, it is how good that team is at changing code without introducing cost or risk.

      I'm not sure that there is a role for either architects or designers. But then I don't think there is a role for product owners or blocker-removing scrum masters either. So take the following with a grain of salt. (I'll write sometime about changing the rate of informal learning in a team & so changing the cost/benefit ratio between specialization and multi-specialization. I'll also write about the difference between group/no accountability, individual accountability, and full-team accountability & show how Designated Responsible Individuals is a great way to decrease the accountability & ability for a thing within your company).

      The point is that I want to make every person into a designer. 100% of them. Even the non-devs. Have a tester? Designer. Have a business analyst? Designer. Have a UX guy? Designer.

      Designers are people who can change the product without incurring bugs. in particular, they can make the product better on one axis (often tech debt, but not always) without altering its state on any other axis (bug for bug compatibility, same scaling & perf, same usability, etc: whatever axes weren't being intentionally changed).

      Once I have 100% of my team doing design, then everyone can easily get their ideas into the product. And I work in rapidly-rotating pairs, so I don't get individual flights of fancy. I get solid ideas from the whole team, polished by the team and implemented well.

      The result is emergent design. At all times we have a design. At all times we observe which parts of that design meet our needs and which parts don't. And any time we find a gap, we make a change. Because the cost & risk of a design change is no higher later than it is earlier, we are free to delay design until the moment when we have the most data – at the end. Because the full system works this way, we can make these changes at any level of scope – as long as we have figured out a way to break that change into small chunks, execute them sequentially and prove & commit each one before deciding whether to do the next.

      So far, I have yet to see a problem for which every team treated it as architecture. I've seen language choice be a design decision. I've seen system architecture (tiered app vs client/server vs peer-to-peer ecosystem vs whatever) be a design decision. I've seen persistence strategy / schema / whatever be a design decision. As in: I've seen each of these be changed over time, with no single step that took longer than about an hour from start to finish, and without every blocking the ability to ship.

      I'm getting increasingly confident that architecture doesn't exist. All that exists is limits that the team chooses to set for itself.

  3. Hi Arlo,

    I really liked your article. I have worked for such a central architecture/devtools team in a large traditional enterprise and it is interesting to read an alternative way of doing things. I have to admit however that the world you describe is pretty much unknown to me and that may be the reason why I didn't fully understand some of the sections.

    Maybe the biggest remark I have on your model is that it sets assumptions that I have never seen in the wild in large enterprises. Cross-functional teams for example. A lot of the architectural rules existed exactly to protect the operations teams from the local optimisations by the dev teams that proved deadly once the system is live.

    Giving the teams a completely blank slate to start with will also introduce a lot of unnecessary diversity within the organisation. Lots of technologies, dev tools, libraries etc are functionally very similar to one another so in these cases life is so much simpler if you have one or a couple of "default" software stacks. No need to say that many of the applications that are developed in these enterprises are administrative applications that have little or no specific technology needs. Of course, if there is a good reason to diverge from the default then that should be possible. And this default stack is not fixed in time either. As soon as a better technology is found the old one should be replaced.

    Such an approach can establish consistency on organisational level that makes it easier to switch teams, integrate applications, implement global changes (like moving to AWS in your example), share build and deployment infrastructure etc without compromising to much the individual freedom. This approach by itself doesn't require a dedicated architecture team. It could be accomplished by the dev teams themselves as a joint effort. The only requirement to make it happen is that the teams are able to think on an organisational level. This is very hard in large enterprises and I guess that may be one of the reasons why dedicated architecture teams are created. And I'm very well aware of the destructive forces this can create when this team moves into an ivory tower and starts abusing its power.

    So rather than "converge where appropriate" this style is more "diverge where necessary".

    Yo also seem to believe that the local optimisations that result from the team's self-organisation will automatically lead to global organisation-wide optimisation ("Decision-making process that transforms system problems into local ones"). Again this is not what I have seen in the wild but then again that may be due to the assumptions not being met (cross functional teams etc).

    Regards,

    Niek.

    1. I set assumptions that I have seen in every scale of company. Yes, there are fewer competent teams in large enterprises (no matter what your definition of competence), but they still exist. Small companies get killed if their software team can't deliver. Large companies just lose a bit of potential profit and eventually do a re-org. So the big guys don't get external forces to drive their evolution. More short-term stability; bigger die-offs when they do happen.

      That said, there are teams at my company (Microsoft) that operate as fully cross-functional XP teams. Just like there are teams (or, usually, work groups) that operate in every other manner known to mankind. Microsoft does not do development in one true way. It experiments (mostly unintentionally, but it still experiments).

      There is a critical difference between having a central system that you have to justify abandoning and having an ability to band together to create uniform practices where that would help. Well, several differences, but one critical one: how many teams actually end up using the thing that would work best for the system as a whole?

      Central systems end up being optimized for people in the center. Each individual contributor takes a productivity hit in order to make life easier for the managers. Each team takes a hit in order to make life easier for the central ops team.

      Distributed choice makes life easier for the teams. They can fix transactional costs locally. They don't need to go get permission to fix a problem. They just fix it. The result is that problems get solved. And that central authorities get left in the dark – they can't handle the vast amount of data and decisions that are going on at the leaves.

      This gives rise to a set of problems, especially if the central authorities have power. When that is the case, they overpower the leaves. They get a system that they can understand, at the expense of curtailing innovation from 95% of the employees.

      This, I would assert, is why hierarchical enterprise companies have so much difficulty innovating. And why small to medium companies, or enterprises which are organized as networks instead of hierarchies, routinely innovate markets away from the big guys. If you want to go fast, you need to avoid blocking on the central authorities.

      You do still need ways for the company to seek alignment. But there are tons of good ways to do this. Start by reading up on Sociocracies (Sociocracy.info) and on SemCo (anything written by Ricardo Semler).

      1. Thanks for the pointers, I will definitely check them out.

        This is a radically different way of thinking for me so I will give it some time to sink in now.

        Niek.

Leave a Reply to abelshee Cancel reply

Your email address will not be published. Required fields are marked *