I come across many businesses that are not grounded in basic high level architecture and design goals. Basically, they simply leap from features to building stuff, most often this happens because everyone is in a mad rush to just get something working as soon as possible. The fundamental flaw in this approach is the belief that spending the time to build it “the right way” takes longer and is more expensive. I’ve found the exact opposite to be true. And for the record, the idea that you can just slop something together quickly for a minimally viable product and later insert a good architecture is pure non-sense – it never happens, ever.
Here are a few basic guidelines that apply to almost anything that you are building and I’m certain that there are plenty more that are specific to your particular challenge and some of the technology choices that you make along the way.
Don’t cripple configuration automation and unnecessarily burden your operations team with configuration management. Design configuration points in a hierarchical fashion such that all deployments derive from a base configuration and implement deployment specific configurations that override the defaults. Wherever possible configuration should be modifiable at runtime and should be persistent across restarts.
Facilitate Production Troubleshooting
Don’t count on your development team’s access to production, it isn’t a good practice to allow a live debug session attached to your production environment. When log messages are written to record exception situations they should include as much contextual information as possible in order to enable production support staff to recreate the conditions present at the time of the exception or undo data corruption that results from the error.
Don’t unnecessarily retry what you already know won’t ever work. If exceptional conditions occur the system will not be configured or coded in a way that directs it to retry the action that failed. This rule is particularly important where interfaces into 3rd party APIs are being configured. If a 3rd party API is failing and we have no expectation that is should ever fail (say a pool API that provides us with database connections) there should be no attempt at reconnection. The exception should be logged as a high priority exception (fatal) and messaged to a management system.
Automate Test Environment Management
Avoid accumulating manual test drag, there is a huge return on investing in test automation. The design of test infrastructure will include a framework by which the overall test “suite” initial setup (configuration, code, results and data) is configured to a well known state and that individual unit test are otherwise atomic in their own setup and execution.
Lots of stuff happens in the real world, stuff that you will never anticipate. Ensure that anything developed has at least one additional instance in the event of failure. There should never be less than two of anything.
Design for Rollback
Your release will fail, no doubt about it. Any new design should be backwards compatible with previous releases. Test your rollback before every release or you will get caught and the impact may be fatal.
Design to Be Disabled
Enable efficient maintenance and minimize outages – planned or unplanned. Any system or service endpoint should be designed to be capable of being “marked down” or disabled.
Design to be Monitored
There typically are signs that a failure will occur soon, make sure you know that bad things are accumulating. The system should be able to identify when it is performing differently than it normally operates in addition to alerting when it is not functioning properly. An example of this principle is instrumenting the application to report performance statistics on page render times or query execution times.
While it is nice to count on quick and efficient compute pathways, high scalability platform often benefit from offloading and distribution but this usually relies on asynchronous designs. Wherever possible systems should communicate in an asynchronous fashion.
Atomic Compute and Stateless Systems
Don’t attempt to store state outside of your persistent data storage – you’ll unnecessarily create scalability obstacles and cripple your ability to build resilient platforms.
Scale Out Not Up
You’ll eventually not be able to buy a big enough server. The system should be able to be horizontally split in terms of data, transactions and customers.