Is there life beyond Scrum, part 2

Startup mode development is only possible for very small teams. It is not suitable for large projects. This is what I often hear as justification for introducing more formal development processes. So let’s play a large project scenario for Waterfall, Scrum and SMD and see how exactly SMD will loose this battle (I’m sure it will).

For starters, we need an interesting large project. Creating a big new product for a completely new market in a large project is a not very suitable example, because it is often bad idea economically. It is much more better idea to build something small, see market reaction, pivot, and iterate until you’ve found the product market fit. Unless you’re second Steve Jobs. Playing our scenarios for economically bad projects is not very interesting, because in these cases, typically, nobody is really interested in efficiency, quality, time-to-market, or any other metric of a development process.

No, let’s take an interesting case.

A web-site rewrite is such a case. Imagine we have a new cool product called Witter*, that has created a whole new market of micro-blogging. (*Note that I have no idea about how real Twitter was made and later rewritten. I’m just using it as example.) We have written our Witter using a decent dynamic language, that has successfully allowed us to find the market before we have burned up all of our investment money. Unfortunately, we have therefore all issues of dynamic languages, including problems with run-time performance, and we have an ad-hoc architecture limiting our scalability and availability. We want to rewrite Witter to solve these problems.

Waterfall

1. Specification Phase
Basically, we would start by describing the current state of the product, but some stakeholders – operations, usability specialists, designers, PMs will add some new requirements. Because the process requires a specification (in form of a written document), PMs will spend time not only describing what should be better, but also describing the current state.

Efforts: 200 man/days
Calendar time: 3 months, because all stakeholders will try to use the chance and add some features they’ve had desired for so long. Therefore, a lot of stakeholder meetings will be required.

2. Architecture Design Phase
If I were designing Witter for maximal run-time performance, I would first consider the data structure to hold the (t)weet feed of a single user. It must be extremely quick both for appending and for reading, and memory efficient. To avoid memory fragmentation issues and VM overhead, I would develop it in C (or C++). So, basically, the core of the Witter would be a service app written in C. When this core service starts, it would grab all available RAM and provide two interfaces: the one to add a weet, and the other to subscribe for weet stream. Core services will be sharded AND doubled so that the weet feed of one user is stored on exactly two hardware servers (for availability). When a new weet enters the system, it will be multicasted via UDP to all listening core services. Each service would check if it stores feeds of either the weet creator, or one of his followers, and if yes, it would add the weet into the corresponding feed. In case some consumers are subscribed for the changes for those feeds, the core service would push this change to all of them in order. I would model the persistance simply as one of the clients of this core service, which is subscribed to all feeds, and always gets pushed the changes first. When a change is pushed to persistence cluster, it would use some standard DB to store it. I’d provided a public-facing web service with an API required for a number of frontends, including the web frontend, mobile web frontend, iOS app, Android app and WP8 app. Internally, this web service would communicate with the core service and the persistence. The frontends would be developed fully separately, and share, if at all, only some basic CI and UX guidelines. On the server side, another two major building blocks to be mentioned is a OAuth authentication service, and a tracking and monitoring system presenting health and stats of all systems in a single dashboard. Of course, the task of managing of thousands of servers has to be considered as well as a number of scripts and solutions for commissioning and installing new servers, and software deployment. Most of the services would be sharded and load-balanced, and principles and ideas of “Release It” will be applied.

I would insist on building a working prototype of the most technically important parts of the system – the core service and the persistence, develop it, and measure its throughput and other run-time parameters to ensure they are of acceptable level.

Efforts architecture document: 30 MD; efforts prototyping and measuring performance: 30 MD
Calendar time: 3 months

3. Development planning
As a result of architecture phase, interfaces between parts of the system have been defined, so that teams can be build for developing parts of Witter, and they can start working in parallel. In total 9 teams will work on the re-write:

1) Core service team
2) Persistence team
3) Authentication service team
4) Web and mobile web frontend team
5) iOS frontend team
6) Android frontend team
7) WP8 frontend team
8) Monitoring and tracking subsystem team
9) Build, release, and rollout infrastructure team

Total team size: 40 persons.

Efforts for planning and kickoffs: 120 MD
Calendar time: 3 days

4. Development Phase
Here, all teams develop independently and in parallel. If needed, they create mock services or mock clients to be able to test their parts.

Efforts: 1500 MD
Calendar time: 2 months

5. Integration phase
All subsystems will get integrated together and bugfixed

Efforts: 800 MD
Calendar time: 2 months

6. Rollout
Because this is such a large project and this is the first rollout, there is no need to wait for next release window.

Efforts: 100 MD
Calendar time: 7 days

Total project efforts: 2780 MD
Calendar time: 10.5 months

Scrum

Let’s assume that the current implementation of Witter is a monolithic web app, consisting of a single Ruby file handling all requests, a bunch of static files, and a MySQL database, and that there is only a web frontend. When the frontend posts a weet using Ajax, it will be received by the server side. Then, all feeds that have to be updated will be determined by querying the database, then the weet will be saved into the receiving feeds. And when the frontend polls the feed using Ajax, the latest weets will be retrieved from the database and sent to it. Simple, trivial, and does its job. The whole thing is deployed on a single server.

Using Scrum for teams of 40 persons is usually not the best idea, so that realistically, one would do a little architecture considerations and planning upfront, and split the project into the sub-projects. On the other hand, the power of Scrum is in incremental delivery of business value. Therefore, the existing Witter solution will be incrementally rewritten.

For software developers, after some consideration, the following strategy will be agreed upon. All devs are divided into the infrastructure and application groups. The infrastructure group works on improving scalability and availability. The application group first introduces a public web API, and then several smaller teams work on implementing frontends in parallel.

The sprints will be set to 4 weeks, and will cost 80 MD each for the infrastructure group, the core app team, the web frontend team, the iOS team, the Android team and the WP8 team.

During the first two sprints, only technical changes will be made and the existing UX will be re-implemented as is, giving 8 weeks time for web frontend PM and designers to prepare user stories for the improvements. The mobile frontend PMs and designers are given 4 weeks time to prepare first UX deliverables.

The following sprints will be probably done:

1. Sprint
Infrastructure: posting a weet will send a message over message bus to a new core service written in Java, which will determine the proper target feeds and store it into MySQL. Reading weet feeds will also be done over that new service, to improve decoupling between the app logic and the database.
Core app: the first version of API will be introduced in form of a simple “if” statement in the Ruby file; responding with XML will be done in the same way as responding with HTML.
Web Frontend: Re-created the existing UI, moving a lot of server side code into the client and preparing for using the API.
Mobile Frontends: Installing the development environment, making prototyping for designers, registering in the corresponding marketplace, being idle.

2. Sprint
Infrastructure: a huge memcached cluster will be introduced to help with the problem MySQL being the bottleneck. The structure of users and followers is stored in the RAM completely, so that one can determine the proper target feeds without MySQL queries.
Core app: The pushing into the public web API is implemented as polling of the core service. The API is rewritten in Java.
Web Frontend: making frontend to actually use the API, first experiments with live pushing of weets.
Mobile Frontends: started implementing UI.

3. Sprint
Infrastructure: as the number of users continues to raise exponentially, teams pro-actively introduces sharding and load-balancing of all servers.
Core app: previously, the web API has relied on authentication cookie set by the Ruby app; a new OAuth service will be employed and the API v2 will be released.
Web Frontend: switching to API v2, starting to implement frontend improvements desired by PMs and designers
Mobile Frontends: Finishing to work on UI, introducing animations and making performance test on real devices.

4. Sprint
Infrastructure: the core service introduces subscriptions so that an incoming weet will be pushed into clients connected with the web API, without polling. Also, incoming weets will be stored first into the memcached cluster, and then picked up and saved into MySQL asynchronously by a separate process.
Core app: switching from polling to the new pushing scheme; the public API doesn’t change. Also, eliminating the last parts still implemented in Ruby, and re-writing them in Java
Web Frontend: Implementing frontend improvements. Communicating with the core app team to implement API v3 with enhancements needed for the new feature.
Mobile Frontends: Coupling the apps with the web API.

5. Sprint
Infrastructure: as network congestion between the core services and the memcached cluster is determined to be the new bottleneck, instances of core services and memcached will be bundled into the same physical servers and the system will be re-sharded so that one core server will only talk with one local memcached instance. Multicasting of incoming weets will be introduced.
Core app: Switching to the new message bus supporting UDP multicast. Started implementing dashboard for monitoring and tracking. Implementing web API v3.
Web Frontend: Finishing implementing frontend improvements.
Mobile Frontends: Implementing push notifications, wrapping the apps and submitting them into the corresponding marketplaces.

6. Sprint
Infrastructure: Java VM, Java GC and memory fragmentation are identified as the new bottleneck. Core services are being rewritten in C without changing their interfaces.
Core app: Finishing implementing monitoring dashboard.
Web Frontend: Profiling the site with YSlow and improving performance by doing magic with CSS and Javascript files.
Mobile Frontends: iOS team implements the iPad app version utilizing more screen space; Android team is busy with testing the app UI on huge variety of devices, WP8 team prepares with Win8 version of the app.

Total project efforts: 2760 MD
Time: 6 months

Although there was no need to spend time and efforts creating an up-front architecture documents, the cost savings were almost compensated by more work to do (infrastructure team first trying Java before finally resorting to C; Core app team having to upgrade to new versions of interfaces). The time-to-market is significantly better than in Waterfall, not only because the project is finished after 6 months, but also because the first scalability and availability improvements were already live after the first sprint (4 weeks), and monthly improvements have allowed the company to invite more and more users every months. On contrary, in the Waterfall model, the company would have to wait 10 months before leaving private beta status, which would severely endangered its market standing.

SMD

Note that according to the requirements of this scenario, we’re not allowed to grow the team (at least not too much), because the process will stop being SMD.

1. Jeanne: guys, I’ve analyzed the latest stats, and WE BEGIN TO BE VIRAL NOW!
Mark: HURRAY!
Tim: F*CK, WE’RE UNPREPARED AND IN A DEEP SHIT NOW!
An ad-hoc crisis meeting follows, where the following decisions will be taken:

Development of mobile frontends will be outsourced. This means worse quality, but time-to-market is more important.
All the servers and databases will be hosted in the Cloud. This is more expensive, but time-to-market is more important.
Tim’s buddy Andy will be head-hunted. Andy works in Acebook infrastructure team, so that he has a lot of technical knowledge helpful for Witter.
They will hire Brendan, who is a Javascript and HTML5 guru.
Tim will concentrate on the API and ongoing operations, Andy will provide infrastructure, Brendan will work on the web frontend, Jeanne will lead UX development for all frontends, and Mark will be responsible for the project management of the contractors.

2. Next 2 weeks:

Jeanne is preparing first mobile UX sketches
Tim releases the very first draft of web API and uploads it on the Cloud
Mark is sourcing good contractors for mobile frontends
Team is waiting for the new guys to come

Efforts: 30 MD
Time: 2 weeks

3. Andy will develop the core service in C. First, he will create all needed data structures and algorithms, including writing a custom malloc, patching linux kernel and measuring performance. This will take him two months. He will then proceed with networking code. Unfortunately, POSIX contains API for networking, so he will use POSIX. After having experience with WinAPI, I have worked with POSIX and my feelings were not unlike of touching a fossilized mammoth shit. When POSIX was created, they didn’t know the word “usability” yet. This means, Andy will take another month for the, actually trivial, networking code. Finally, it will take him another month to add business logic and the code needed for logging, tracing and monitoring.

Efforts: 100 MD
Time: 4 months (in parallel with others)

4. Brendan will reimplement the web frontend using the latest HTML5 and JavaScript frameworks. This will speed up the frontend and make it to appear more responsive so that the scalability problems the team has won’t look so bad. It will take him a month. And another month for the mobile web frontend. He then start implementing desired new features, which will take another two months.

Efforts: 100 MD
Time: 4 months (in parallel with others)

5. Tim’s job is the most stressful. He will have to create web API to unblock Brendan and contractors, but on the other hand try to extinguish the most rampant fires caused by lack of scalability. First, he will split the MySQL database in one containing only the social graph, and the other with weet feeds. He will shard the latter based on userid. He will also split the Ruby app into the “Forwader” and “Feed server” parts. Incoming weets will hit the Forwarder system, which will be connected with the social graph DB. The Forwarder will decide what feeds have to be updated, and forward the incoming weet to the corresponding sharded “Feed” servers. Requests to read weet feeds will be sent directly to the corresponding Feed server. Having this basic scalability in place, Tim will deploy the system into the Cloud. In parallel, he will introduce a first version of the API (without push yet, but with some quick and dirty OAuth implementation). This will take him one month. Next, he will replace the social graph database with a MongoDB cluster, provision several identical Forwarder systems speaking with the same MongoDB cluster, and use the Cloud feature to round-robin the incoming traffic between the Forwarders. In parallel, he will implement the push for the API, because it just means the the Feed server has to trigger a push via COMET each time it saves a new weet in the MySQL. This will take him another month. Finally, he has to integrate with the Andy’s infrastructure. When he did the Forwarder, he has already explicitly decoupled the web API code from the forwarding logic, anticipating Andy’s work. Now, he will throw away the forwarding logic and just multicast weets to the Andy’s core service cluster. He will add code to the Feed servers so that they subscribe for changes pushed from core services. And he will need to re-implement the push functionality in the API. This is a bigger change, and this will take him 2 months.

Efforts: 100 MD
Time: 4 months (in parallel with others)

6. The 3 contractors will need 4 months each to implement the corresponding mobile apps. They will bill 100000€ each. Mark will need to spend at least half of his time managing them.

Efforts: 300000€ + 50 MD
Time: 4 months (in parallel with others)

7. Jeanne will spend her 4 months creating and discussing UX for three different mobile platforms, and creating and discussing web frontend UX improvements. In the “spare” time she has to work as community manager trying to respond on angry user mails who are not satisfied with the service’s current stability and availability.

Efforts: 100 MD
Time: 4 months (in parallel with others)

Total project efforts: 450 MD + 300000€ + expenses for the Cloud
Time: 4.5 months

SMD costs just around 30% of Waterfall or Scrum expenses, and its time-to-market is 1/3 lower than that of Scrum. One of the reasons for this efficiency is a tiny team co-located in one small room. This reduces possible time waste for technical discussions, misunderstandings, creation of documents, and office politics. Another reason is partial use of contractors, because their business model is streamlined for efficient creation of mobile apps, and they typically have building blocks they can reuse from customer to customer. Also, because of such a small team, many things will be done in a quick-n-dirty way, meaning it will be hard to maintain and evolve them in the future, unless more investment will be done in rewriting them yet again. But the main reason, and at the same time showstopper for many startups, is hiring programming gods to accomplish the task. Andy, Brendan and Tim from this example are all extremely efficient, extremely knowledgeable gurus. It is very hard to hunt such persons for your small startup. Especially if you’re situated not in the Silicon Valley. And it is even harder to get these guys on 2-week notice, exactly at the moment you’ve recognized that your startup is getting traction.

And if you cannot afford working with best of the breed, you are forced to work with whatever software developers are available. Those devs are by no means less smart than gurus – they just don’t have so much experience yet, and perhaps, not so much dedication. This means, on average, you need more of them. And at the moment when you have too many devs to be comfortable with in a single room, you have to introduce processes, and here you go, you have Scrum in your company, even though your product is just a small website.

I remember the time looking in the eyes of a manager who was so happy telling me he has grown the company and hired a lot of new devs, and I was thinking, like, shouldn’t you feel a bitter drop of failure in your jar of honey?

Summary

SMD was still a preferable way to develop software also in this example of a larger project. Unfortunately, it is very hard to follow it, and it has additional implications (like heavier reliance on contractors and Cloud services, as well as a tad worse truck factor).

Just to remind you of a real-world example: Instagram had achieved 30 Mio users with only 13 employees (now they have 100 Mio). I have no idea what process they are using, but I doubt it is a formal one.

Maxim Fridental

Is there life beyond Scrum, part 2

Waterfall

Scrum

SMD

Summary

Leave a Reply

Categories

Archive