Driving Surgical Refactoring: A WeTransfer Android Engineering case study

hariutd

9 months ago

Refactoring plays a crucial role in ensuring mobile applications remain robust and scalable, especially in an ever-evolving mobile landscape. In this post, we’ll dive into some of the interesting stories and challenges I faced when driving the refactoring initiatives both at the technical level and with onboarding stakeholders at WeTransfer.

When I joined the company, the codebase was a few months old. It was using Jetpack Compose with Navigation component – meaning, a combo of Fragment and Compose. Some of the core user journeys were built using only XML and Fragments. It was a mix of many things, with minimal use of programming patterns, leaking states but it was okay as we were only focusing on hitting the market ASAP and make it shiny later. Explaining this history is important because every refactoring initiative has different intentions and needs. Knowing the past decisions and pivots in the codebase is a crucial knowledge that will help you in identifying the opportunities for improvements, and more importantly narrowing the scopes.

What is Surgical Refactoring?

Surgical refactoring refers to a highly targeted and precise code refactoring process. Instead of broad, sweeping changes across the codebase, surgical refactoring focuses on improving or modifying specific areas of the code, often to address a particular issue such as performance bottlenecks, bug fixes, or code optimization.

Key characteristics of surgical refactoring include:

Minimal disruption: Changes are limited to a specific module, function, or component without altering the broader structure.
Targeted improvements: The refactoring aims to improve maintainability, readability, or performance in a small, focused area.
Preservation of behavior: The overall functionality of the system remains intact, with no new features introduced.
Low risk: Because the changes are isolated, there is a reduced risk of introducing bugs or creating regressions in unrelated areas.

This approach is often used when teams need to refactor quickly or with minimal resource allocation, such as during urgent bug fixes or when cleaning up technical debt in critical paths.

When do you refactor?

There is a famous quote – not sure who said it but,

A good chef washes the dishes as they cook

This may apply to most of the cases. Yes, we refactor things as we develop new features. But this may not always be true. Before we launched global, we improved somethings here and there but those were quite minimal and we were not touching major core journeys. For WeTransfer, it is mainly, sending, receiving and previewing files. But after launching the app, we reached 1 million downloads within six months, largely driven by brand awareness. So, this put us in a situation where we need to do major refactoring like an open heart surgery. In a less graphic way,

We needed to refactor major parts of the app while keeping it operational, akin to changing the tires on a moving car.

Planning

This was quite an important phase, where I spent a lot of time identifying opportunities and reasons why something needs to change. Here are some of the things I collected

Product road map – Knowing the future or at least a couple of steps ahead helped me to identify which new features can cause a lot of technical debt. For example, if a new feature is added to a user journey that requires clean-up, it can create dependencies that contribute to technical debt. Without knowing what needs to be modified or cleaned up when implementing the new feature, the situation could worsen instead of improving. Knowing this information allowed me to pick the feature that needs the more love.
Crashes, ANRs, and performance—After narrowing down the feature, checking its health helped me onboard many stakeholders. Data is your strongest weapon when you want decisions and approvals. I started diving deep into the Google Play Console and narrowed down the issues linked to the feature. Using tools like DataDog (not a sponsorship—any reliable reporting tool will do) helped me visualise all the errors and filter those that “refactoring” would be solving them automatically rather than spending time to fix them like a band-aid. If there is an opportunity to identify what we can improve – performance, scalability is a value addition.

Collaboration with PMs, Designers and QE

For technical refactor, this may not be necessary but if you are making changes to the whole flow or a user journey, it is quite important include PMs, Designers and QE/QA. They can shed more information on perfecting the refactoring. PMs and Designers often will be a little ahead, working on a prototype or future feature for the product. Once I mentioned that I’m going to make some changes, they suggested a few modification that helped us reach the future state of the product soon. This is a win-win situation for both of us – as Engineers need not touch this file again in the future and for them this has been delivered quickly. Talking to the QE/QA helped me understand some of the bugs that were lingering around that zone. Of course, there is no pressure in adding everything what they suggest but a fruitful negotiation and consultation is much encouraged.

Minimal Disruption with scope definition

Developers can most of the time spiral out on making improvements. It is a rabbit hole that we don’t want to dive in to. Most of the Engineering manager’s job would involve bringing the Engineers back to the reality. Talking to many Managers, I realised that often they don’t approve refactoring as the Engineers go out of control making too many changes, take a lot of time and that would affect the product roadmap. Time is money, yes!

Addressing this concern helped me in my career to accelerate the refactoring initiatives decisions. After identifying what needs to be changed, we write a TDD (Technical Design Document). This contains the following

Problems in the current system, this would help everyone understand why we need to change
Solution to make it better
Jira epic with all the tickets – this is quite important and the team should agree on deviation, say 10%. Practically, we would discover a lot of things when we get our hands dirty but limiting this is important. In such cases, we split the refactoring initiative in 2 or more phases. We release the first phase, comeback and do the second phase.
Approximate time frame – having a time frame with deviation would help the PMs plan the feature and resources accordingly.
Releasable entities – ensure that the tickets or the scope of phases of refactoring is scoped to release-able entities. For example, if a screen is changed to a new UI, we ship new components on the old screen giving a slow onboarding to the new experience. Of course, this cannot be applied to all the cases. What if there is an active development on the screen, leading to technical debt? so, plan according to your team’s work

Keeping the stakeholders informed for preservation of behavior

The interaction between all the stakeholders in the team usually be minimum during a technical refactoring. There will be interaction only among the Engineers. But it is quite important to over communicate with the team during this phase. We would setup weekly syncs with all Engineers and invite other team members to take a peek. We would also setup a separate Slack channel for this initiative and post our update every week in async. This would include information that is digestable by non tech members as well – like our Support Agents. We will include metrics, demo videos and screen shots etc. This will keep the whole team at ease and know for sure that the momentum is rolling. This can also help everyone informed that something is changing and if they have an overlap they would reach out to you. For a migration, it is always a good idea to Deprecate functions or classes to let everyone know the replacement or alternative. I’ve talked separately how this approach made us faster here

The art of shipping faster

We also setup design reviews and testing sessions to incrementally get feedback as we are touching core user journeys.

Phased refactoring for targeted improvements

We already covered a little bit about this in the scope definition section. Phased refactoring strategy has helped us to think in the context “What is the actual problem here? what is our risk tolerance?”. Sometimes the solution can be simple and sometimes for the sake of scalability we may need to do more. But there will always be an urge to do more, improve more, make it more concise, detached, efficient and trust me, I’ve often wrestled with the urge to make things all better, but it’s crucial to stay within the defined scope. But it is also necessary to keep a boundary around the initiative and not touch anything outside of the scope. What we usually do is create a new epic as phase 2 of the particular initiative and add the tickets there. You may ask that how to ensure that this new epic and tickets don’t become a black hole where no one cares later? Here’s is where having a proper vision and Engineering Excellence backlog plays a huge role. I’ve written a separate snippet for this, please take a peek if you are interested 👇🏽

Empowering Teams: The importance of crafting your team’s vision

Reducing risk via Feature flag

When we refactor components, we release them on the fly and validate but if we are changing the whole core journey, we always build it behind a feature flag. With mobile, there is always a rollback challenge. Once we release a new build, it is out there and cannot make changes instantly as the Web does. Having a feature flag ensures that we can rollback the new journey if it caused serious issues. Not only with rollback, they also ensure that the new changes are rolled out in a phased way, so that we can monitor and measure the rate of success or failure gradually. Without the feature flag, if things go south, we would be having incidents and users complaining in the Play Store. This is like changing one tyre at a time and measuring if the car is doing okay. Feature flag is a tool that help us swap back the tyre if the new one is not doing any good

Final thoughts

Refactoring is more than just cleaning up code—it’s a crucial practice that ensures the long-term scalability, performance, and maintainability of your application. My experience at WeTransfer demonstrated how surgical, targeted refactoring can improve core features while maintaining stability, especially in high-traffic, rapidly evolving products. Successful refactoring requires not only technical execution but also careful planning, collaboration with cross-functional teams, and thoughtful communication with stakeholders. By narrowing the scope, using phased approaches, and leveraging tools like feature flags, teams can mitigate risks while delivering valuable improvements without disrupting the user experience.