Stories from Platform Team
Since starting working in the Platform Team in our company, one of our main tasks was building a flexible and reliable CI pipeline for our 100+ developers (and 150+ services).
In this article I will share the challenges we faced and the solutions we came up with!
CI = Continuous Integration
What is CI? a process to automate testing and integrating code to the main branch
Basically we want to build, test, push, deploy to staging. Easy enough, right?
In our company it was decided to have single generic CI for all teams, to save time for developers.
Our tech stack
- We use Tilt to build and test our services
- We use Github Actions to orchestrate our workflows
- We use bash + python to write the code of the workflows
Challenges
we got many requirements from our CI which required it to adapt, I will share a few of them here:
Monorepo Support
Right when we started, we had to make our CI work for a medium size repository with more than 40 services.
The Problem: CI for a Monorepo is slowwww. It will build and test all the services , for every commit, which is wasteful and slow 😵
Solutions:
- Use a base image — services use a base image which is built rarely and then it can save time for each build. only if it was changed we need to build it. this saves quite a bit of time.
- Filter using git diff — to save more time, lets make sure that only if a service was changed in the PR, we built it! we tracked git diffs and this was very easy optimization to do.
- Use code hashing — to optimize it further, we also used code hashing, if the code of the service was not changed we don’t build it! we use hashing library on the entire folder and compare it to previous known hashes we already built.
Things were starting to feel easy for our CI pipeline, maybe its not so hard?
but then came a requirement we all feared:
Multi repo support
A small Monorepo vs Multirepo discussion was done. The results: we had to support both monorepo and multi repo with the same CI 😅
The Problem: how can we support multiple code repositories with the same CI Pipeline?
Solutions:
we got inspired from this great video: https://www.youtube.com/watch?v=-5_ZNmeTTvg&ab_channel=EuroPythonConference
we made every PR send a workflow dispatch for every PR event to a single generic “handle_event” workflow.
We then trigger all workflows the PR needs.
Each workflow sends github checks to the PR commit, for the user to see the results (apparently you need github app to do that)
See the diagram below :
To make this multi repo work we had to rewrite all of our CI code, which caused many downtimes and failures for users. Our next task was to make the CI stable.
CI Stability
Problem: Our CI code is being constantly changed, and breaks often. How can make the failures less painful for the users?
Solutions:
- Testing — Add more tests for our CI code !
- CI code uses versioning! — when we merge a CI feature to master, it will not be released immidiately to the users.
only after a week, we bump our “stable” branch to latest master, which users get. - Monitoring — We created a CI dashboard to monitor issues.
- Feature flags — we use feature flags to hide new CI features, using config
- Using your own tools — we develop the CI code using the CI itself which makes it easier to test and also understand the user experience.
Even when our code is great, the teams found the CI logs hard to read and understand, and we got many requests for helping with CI failures. We knew we had to make the CI easier to work with..
Debugging CI Failures
The Problem: CI runs remotely, so we can’t debug it. what should we do in case of an error?
Our Solutions:
- CI is Consistent with local env — CI uses the same tools as local environment. which makes sure most CI issues can be reproduced locally.
- Large portion of CI code is written in Python — we use testable python modules in the CI workflows which can be unit tested.
- Allow to SSH to the runner — we have a guide for ssh-ing to the CI runner and test it there, which can help in rare “WTF” moments.
At this point the amount of micro-services in the company doubled, and we saw many requests coming from teams, each one need a custom behaviour for its CI. We decided we need to make the CI more flexible
CI Templating
problem: each team needs a different requirements, linters, tests execution. How can we make one CI pipeline work for all of them?
solutions:
- Config file — we provide to users a ci_config.json file — for controlling which workflows to run and with what parameters. which services to run and which modules to run
- Build configurations — we used Tilt for building the service in ak8s cluster. K8s Helm charts are very powerful, and each team can fiddle and change it according to its need.
- Custom Test scripts — each team control how its tests runs using bash scripts!
- Linters scripts — each team configures/chooses its own linters using bash scripts!