A roach in the system#
Many developers have come across a similar issue. You have an application running and your customers are reporting issues. After a round of communications, getting any and all relevant information you can, you boot up the application on your localhost to figure out what is going wrong and you start working on fixing it. Maybe you add a few logs here and there to double check that everything is working as expected. You fix the issue, deploy the application, and everything is good. All is right with the world, and you feel like an actual professional engineer. But what if you ran it on your machine, and you can’t even replicate the issue? That is what I have recently had the privilege of experiencing at my day job.
Let me paint you a spookier picture. The application in question might be running on Google App Engine, with autoscaling enabled, the works. No logs anywhere to be seen (maybe there wasn’t time to add logging), and the customers (rightfully so) are unable to give you much information about the specifics of the issue. All you have to go on is that the application is running slow in a few key areas, like a specific page. Oh, and the codebase is older than your employment period by a magnitude of 10. Much of the code is written by people who are no longer part of the team and you need to decipher what is going on, in a timely manner as the slowdown is starting to affect the customers.
Leaving a trail#
First thing you do is quickly add some logs to the methods used in the specific page. Hopefully, that gives you a good idea of what is happening. For my “example” above, Google Cloud has a built-in logging system that houses the logs for your entire App Engine application. All the logs for all the instances right there in one place. It kind of looks something like this:
I took this screenshot directly from the Google Cloud documentation, so you can ignore the arrow, but you get the basic idea. The logs are all there, nice and neatly arranged. You have all the filters you could ever need, and even the ability to filter for all logs in a specific time frame. One nice thing about the logs explorer for App Engine is that it automatically captures your stdout
and stderr
as well. Another fun thing is that your API requests are also automatically logged with App Engine’s request_log
, giving you some useful debugging information like request mode, latency, response code, etc. We quickly found out that the issue was related to some APIs not performing well, but how did we find that out?
Time in a bottle#
In JavaScript, you can do some fun things with the console
. Because we knew that the issue was related to API performance, we started by adding timing logs to the problematic APIs in an attempt to identify the portion of the logic that was causing the biggest slowdown. console.time()
basically allows you to do exactly that. Below is a simple implementation of how it is used:
// generate a unique request ID for the current request
const requestId = generateUniqueRequestId();
// start the timer
console.time(`businessLogic-${requestId}`);
// some business logic
// end the timer
console.timeEnd(`businessLogic-${requestId}`);
This gets you a lot of useful information in the logs.
You have the exact duration each section of your code took to execute, and you can identify the specific request that the logs are tied to because you smartly added the requestId
, so you can quite easily find all the logs for a specific request. But remember that our App Engine application is running with autoscaling, multiple instances. This means that there are a lot of logs happening at the same time, our logs were all over the place with some messages for the same requests coming into the stream much later than others.
So now, in order to find the logs for a specific request, you need to filter by the requestId
. The Logs Explorer obviously lets you do that, but there is an even better way to do it. Traces.
Tracing your steps#
Google gives you the ability to group your logs by request with a traceId
. Every single request the hits your application will have this. But how do you get access to this from your code? By using the X-Cloud-Trace-Context
header.
You can read more about it in the Trace Context docs here , but let’s just go through a quick way to extract this information and structure our logs in a way that GCC expects.
const getTraceId = (req) => {
const header = req.headers['x-cloud-trace-context'];
if (!header) return null;
const match = header.match(/^([0-9a-f]+)(?:\/([0-9]+))?/);
return match ? match[1] : null;
}
All we are doing here is simply getting the x-cloud-trace-context
header from the request, and extracting the traceId
from it. Then, we can pass that to another function to log our messages with the extracted traceId
in the format that is required:
const createStructuredLog = (message, traceId, severity = 'INFO') => {
const projectId = getProjectId();
// Structured log
const logObject = {
message,
'logging.googleapis.com/trace': traceId ? `projects/${projectId}/traces/${traceId}` : undefined,
severity
};
console.log(JSON.stringify(logObject));
}
The function above allows us to pass a message, the traceId
that we can now easily extract from the request header, and even the severity of the message we want to log for better readability.
And all you need to do is call the function in any API method that we want to log, and it will structure a JSON object with the proper keys and values so that the Logs Explorer will automatically be able to group them with a given request. There are other ways to achieve this like using the OpenTelemetry framework, but this works just fine for our use case, and brings in 0 additional dependencies.
Now, you can click on a specific request in the Logs Explorer, and you can see the trace for all logs for that particular request.
And hopefully, the next time we encounter a similar issue, we will be able to better identify the problem and fix it with the help of this enhancement. Since this experience is still relatively fresh in my head, the next post will cover the next steps I took to improve our code quality by exploring and implementing load testing so that we can better identify performance issues like this before they even reach production.