Sunday 27 September 2020

Kernel Debugging - Nightmare


This is about an interesting but strenuous exercise I had gone through in life as a software engineer. This happened many years back and some parts are blurry.

Disclaimer: The methods followed in this might not the best ways to debug the issue.

Scene

So the story goes like this. We were developing a gateway product and were in the last stages of doing the performance tests on the gateway.The performance expectations of the gateway was so well documented that we have a completely detailed input and output model which exactly predicts what is the kind of load this gateway can handle (probably one of the best performance models I ever worked with). During the performance tests we observed that the system CPU usage of the Linux kernel is going really beyond what is expected as we were loading the system towards the peak load it can handle. The details of this will be explained once I finish the architecture overview in the next section.

Architecture

From an architecture point of view, at a really high level, this gateway includes two network processors.
  1. Data plane - One processor which actually runs a simple while loop for handling each packet that comes in, make some decisions and route to the appropriate path. If this logic cannot determine what to do with the incoming packet, it will throw it back other processor where it lands on a standard Linux kernel that handles the incoming packet.
  2. Control plane - The second processor, serves as the control plane for the data plane and sets up all the routing rules so that data plane can do its job.

Additional highlights:

  • The processors used were network processors, which essentially means, the CPU itself can decode upto L7 packets and make routing decisions based on that.
  • This was a customised Linux, from Wind River, that supports these additional instructions of the CPU to decode network packets.
  • Each processor had 16 cores.
  • The processor was capable of handling 600,000 packets per second.
  • The input/output model calculated that we should be able to support 20,000 edge devices per gateway. (15 packets/second/device/direction, I think so)

Functionality

  • The gateway aims to terminate, let's call them as edge devices (I am not taking the actual names of the devices) that sits in the customer premises and connects to this gateway over internet. This edge device provides mobile services for the end user.
  • In order to ensure data encryption each edge device creates an IPSec tunnel to the gateway over which the traffic flows to/fro to the edge device.

Problem Definition

As we briefly saw in the beginning, the issue manifested as abnormal increase in the system CPU usage in the Linux kernel. The tests that we performed on the gateway using simulator nodes uses a batch-size of 1000 nodes and starts establishing the IPSec connections and within each IPSec connection establishes the connection to the gateway. As it progresses and goes beyond 5000 connections, we start seeing the behaviour and by the time it reaches 15,000 connections, the system was in super bad shape.

Fun fact: In order to analyse the behaviour of the system, we developed a small perl script which periodically dumps the output from various utilities of Linux like mpstat, iostat etc. which is fed into another analyse script that converts it to well formed CSV files. These files were loaded into an Excel macro that generates all the needed graphs. Btw, stop laughing, those users out there who are used to all those modern monitoring tools like DataDog etc. 😆. The world was not that easy those days. 

Challenges

  • None of the advertised proc variables gave any indication of any abnormal behaviour as there are no parameters changing in the state of the kernel.
  • There was practically no way to analyse what is happening in kernel by adding logs. Kernel executes so much code each cycle that adding logs is so impractical and it will causes the kernel to hang emitting to console. Even though we had access to the source code and methods to rebuild the kernel, we were helpless (we had access to whole code as it came from Wind River allowing us to build it).
  • There are millions of lines of code and we were unsure where to start and end. 

The Long Route

As none of the experiments were yielding any results and the release date was nearing. We had no other choice to take but the long route of getting a hold on to the kernel code. So we decided that we should try to understand what happens when a packet enters the system and traverse down the code path. Believe me, it is not easy, but we had no other choice. This reading made us realize the following regarding packet interrupt handling in Linux kernel.
  • Every time an interrupt happens, the interrupt routine essentially sets a flag in a 16 bit sized element (I think so) and leaves the routine (hard-irq)
  • Linux kernel has a timer that fires 4 times a second (250ms) per core which is essentially a timer interrupt (IRQ) that happens. 
  • This causes the kernel to pre-empt what is running in the core, saves the state and starts doing many things like, scheduling, dealing with the interrupts that came etc. 
  • The flag that is set in the above structure decides which routine in being invoked. This is done by the softirq methods registered for that bit position.
Now the next question was, how do we measure which part of the code is taking time while the packet processing is happening inside the kernel? As mentioned above, while in kernel space we do not have the flexibility of recording start and end times and printing logs (like we do in an application development language like Java).

We managed to find a solution here by doing this.
  1. Define an array of size 16 to capture something called as tick-count that elapse in functions. Any idea why the array size is 16? It because we need to record this result for each of the core and value at each index represents the value for one core of the CPU. 
  2. We will store the starting tick count and the difference when the routine finishes to measure the execution cycles. We are not really interested in the absolute meaning of this value, but rather the relative values during our tests.
    • A tick-count is number of times the timer event has occurred after system boot. This can be considered as a unit of time. 
  3. Also note that, the tick-count difference does not really mean how much time the routine took, it also includes the time that was spent by kernel during the timer events in invoking re-scheduling threads on each core. But this time can be considered a fixed time for our comparison with a linear growth and hence we ignored that aspect. 
  4. Now comes the hardest part. To divide and conquer the code base of the kernel, the following was done.
    • Start this tick-count difference calculation from the beginning of the softirq method till the end.
    • Re-compile the kernel.
    • Perform the test and record the results.
    • Divide the whole function into two (kind of a binary search) and perform the tests again.
    • Whichever half shows the similar tick-count increase of the full method, that is our next target.
    • Repeat the whole exercise again till we narrow down to the actual piece of code that is causing this slowness.
  5. Now, you might have this question on why we created an array in beginning and started recording the difference of values for each core for the tick-count in the array? This is for observing the system. We defined an entirely new /proc method that simply prints the whole array as a table. We periodically print the path while the test is executing to extract the tick-counts being consumed by the method under test. This was how we managed to allow kernel to run without being interrupted while we can extract the information anytime we want. Every time we extract this it gives us the last recorded tick-count difference. 
  6. This entire test cycle repeated for almost 1.5 months before we could conclude on anything. Many times I lost hope in this process but we had no other choice but to go down this path and figure out this issue.

Solution

So what did we finally find? It's quite interesting.
  • Telecom communication uses SCTP protocol (as against TCP which is usually used in the internet world). SCTP allows to use redundant paths to increase resilience and reliability (borrowed this line from the Wikipedia article). Remember this point, we need this information later for our discussion on the fix.
  • Linux kernel has support for SCTP (I am not sure if it was because Wind River supported it earlier than the standard Linux kernel) and like any other protocol support, for SCTP too, the kernel maintains a list of all active connections.
  • What we observed was that, whenever a packet comes into the stack, the method that retrieves the connection object from the stored connections is taking more and more time as the number of connections increases. By design, this method is supposed to be of O(1) time complexity.
  • This made us curious to see what is the kind of data-structure being used for storing the connections. The data structure was a hash table. Each entry in the hash table is a linked-list of connections for all entires that happens to be giving the same hash value (hash collision). 
  • I guess by now you would have got the hint. The hashing function that was being used for identifying the connections was resulting in the same hash value for all of the connections, which essentially means, the table had just one entry while the entry is a linked list with 20,000 connections in sequence 😲😲.
  • This results in the kernel actually traversing the entire linked list of 20,000 connections for every single packet that comes in and this was the causing the system CPU to be so high. There is really no metric, absolutely nothing that can reflect this code execution happening inside the kernel.

The Fix

  • So, why was the hash function returning the same hash value for each of the connections? Interestingly, the SCTP code was using the client side port number to generate the hash key. For each device that is connected in the network, the client side port was really not a random value, but a fixed port number (the client side port can be the same in all devices as each of them are distinct devices in the network). Due to this, the hashing function always returned the same value.
  • This made us ask this question, why did the implementors chose the client port as an identifier instead of using any of the parameters of the actual path like IP address? Let's go back to the definition of the SCTP. The whole reason why SCTP exists is to use multiple paths and provide resiliency over path failures. Which means, one cannot tie the connection identifier to a parameter which is related to the path like IP address.
  • In our case, as our use-case was including only a single path as we were not really leveraging the strength of SCTP, we decided to go with a hack to introduce the client IP address also into the hashing method. And snap, everything back to normal with the retrieval function back to O(1) time complexity. 
  • This was submitted back to Wind River and it got integrated in the code base. But I guess it was really not the right fix and we should have found another parameter to add to the hashing function. I guess that was really not the priority considering our use-case.



Share:

Sunday 16 February 2020

Minimalist Mesh for Micro Services

So the story goes like this:
  • You have container workloads running in production (nomad).
  • You are on a bare metal environment.
  • Multiple container networking software solutions are in use in different data-centers - contiv and flannel.
  • The perimeter is secured for the cluster (firewalls, WAF).
  • Service to service communication within the cluster is non-secure (the journey started before service mesh concepts was in place).
  • Customer insists that service to service communication is over https within the cluster when it crosses machine boundaries within the perimeter too.
  • Incremental approach for migration service by service is mandatory.

Options

Introduce a full-fledged service mesh

  • A complete networking and software stack upgrade is impossible without a downtime.

Replace the existing container networking with one that supports encryption

  • Do we have one such solution which is usable in production?

Solution

“Introduce a light weight sidecar proxy that can do this job”

Details

  • Nginx as a sidecar.
  • We added it as part of the base images we supply and with the single flag in Dockerfile, the sidecar is launched.
  • For launching the sidecar we extended the configuration file of containerpilot which we were already using.
  • The certificates are auto created during the launch of the container. How did we achieve this without a control plane? We off-loaded that into the startup scripts of the container itself to generate the certificate.
  • Well, the next question is, how can we use the same certification authority across the same cluster? The answer is to inject the intermediate CA and Key into the container during startup sequence using Vault and use that intermediate key to sign the dynamic certificate that is created.
  • The sidecar used the existing variables that specifies the service ports to run a reverse proxy to send the traffic to the internal application process inside container.
  • The containerpilot configuration also switches it’s behaviour to register the service using the new TLS port instead of the non-TLS port that it was doing before.

Overall, we got our first service up in production in a week running TLS without the application team doing anything, other than setting a single variable in their configuration file.
Share:

Tuesday 20 February 2018

Etiquette in Workspace Collaboration Tools


These are a set of guidelines that helps everyone to play nice and fair while communicating especially on mediums like Slack, Teams or Mattermost be it at your workplace or other online communities.

Remember, while we are communicating over these mediums, neither our body language nor our voice is heard by other individuals who are reading our messages. This requires us to put in an effort while writing replies or initiating new conversations.

Direct vs Indirect Messages

Always try to avoid direct usage of english. They might find your directness rude and possibly offensive. By adjusting your tone, you are more likely to get a positive response from your reader.

Consider these:
  • Direct – You are doing it wrong.
  • Indirect and polite – The way it is being done might not be the right way. Do you mind checking it once?
  • Direct – You are using a non standard practice. You need to follow the steps as documented.
  • Indirect – I’m afraid that what is being followed might not be the standard practice. It will be beneficial to follow the steps as documented.
  • Direct – It’s a bad idea
  • Indirect – To be honest, I’m not sure if that would be a good idea.

We instead of You

Try to avoid the usage of “you” and substitute that with “we”. It can create wonders. The moment we use “we”, it becomes an inclusive conversation and the team feeling is higher and collaboration increases.

Consider these:
  • You – You need to look into the logs and figure out.
  • We – We need to look into the logs and figure out.
  • You – You have to request for it with another team. Raise a JIRA ticket.
  • We – We have to request for it with another team. Please raise a JIRA ticket.

And instead of But

Avoid the usage of “but” and substitute that with “and”. Try it out and see how the negative connotation of “but” just disappears.

Consider this:
  • But – The logs are placed in a directory but you need to browse to the path to get the details.
  • And – The logs are placed in a directory and we need to browse to the path to get the details.

Complete and Simple

Write complete and simple sentences.
Do not pollute the conversation thread with one liners one after the other.
It is not an ideal situation for the reader to get 10 notifications on his phone for every line you type and press enter. Instead send a single message with multiple lines and complete sentences.

Basic English Rules

Be very careful of capital letters, punctuation, spelling and basic grammar.

Nothing is more unpleasant than getting a reply which requires the reader to read multiple times to figure out where the sentence starts and ends along with multiple grammatical and spelling mistakes.
Share:

Sunday 12 November 2017

Story Points :facepalm:

Why is that many leaders still don’t get what is story point based estimation?

An attempt to explain the basics.

What Are Story Points?

Story point is a relative value of the size of a story in comparison with another story. For a team we can have multiple such base stories to ease estimation as every team member might not be aware of every type of work. The absolute values we assign are unimportant and what matters is only the relative values.

What Values To Use For Story Points?

In practice, we can use the fibonacci series (pronounced as fibi-no-chi), which takes these values - 0, 1, 2, 3, 5, 8, 13

What If The Story Point is Higher Than 13?

That indicates that the story is actually too big to be handled as a single item. Discuss and work with the product owner and split into multiple stories as functional slices. Do not split into pieces like "Write Code" and "Testing" as two different stories. That introduces waterfall into Scrum.

Who Does Story Point Estimates?

It's the Team. There are multiple ways to do this when there are many scrum teams. The core idea is that it's a relative value and hence to maintain uniformity it needs to be done in a uniform way across sprints.
  • Team representatives sit together and do this estimates during the Grooming meeting for the top 10-15 items for 3-4 scrum teams. (Use planning poker)
  • Delegate this to an individual who is well aware of the complexities and technical details of the project.

What's the Use of Story Point Estimates?

This is to achieve what is known as the burn down of Epics and Stories (this is very different than the Sprint burn down which is based on effort estimates of tasks which is in hours). This is critical in doing Release planning which in turn helps to make customer commitments.

Let's take an example.
  • An Epic has a total story point of 100. (Remember this value of 100 itself changes over times as team starts the implementation and more and more stories surface)
  • The team is able to burn on an average of 10 story points over the last three sprints (it's always a rolling average of last three sprints). This is also called as velocity of the team.
  • Let's assume that out of 100 story points, 40 is already completed.
  • So to burn the remaining 60, it will take 60/10 = 6 more sprints to complete the Epic.
Share:

Monday 21 August 2017

The so called Hi messages

Many times we get greeted by just Hi or Hey messages from our colleagues and friends in our various messaging systems. Be it Slack, Mattermost, Teams, Hangouts or whatever messaging systems that you use. This is one of the most irritating (sorry to use that word, couldn’t find a subtle one), messages that you can ever receive.


Messaging systems are meant for asynchronous messages. Expecting that the person at the other end of the conversation is always available to answer the hi/hey messages is too much of an expectation. Remember messaging someone is NOT like making a phone call. You cannot expect an immediate response there.


Follow these simple etiquettes next time you message your friend or colleague.
  1. Write the complete content that you are planning to ask the person in a single message (as much as possible)
  2. May be you can start with a Hi, but make sure it is a multi-line message with the actual content. There is nothing more irritating than receiving 10 notifications in the mobile or laptop when someone starts messaging you.
  3. If you do not get a response, do not start nudging them immediately. There are many reasons why someone decides not to respond; they might be busy; they might have to check something before they respond; they might be away from their mobile/laptop; they might be on vacation.
Happy messaging.
Share:

Wednesday 1 March 2017

Testing Thoughts

I am not the original author of this post’s contents.

I recently found this email reply from John Mitchell while discussing the testing aspects of micro services in our project. It is so informative that I didn't want this to get lost in emails. Adding this here.

(Names have been changed to keep the focus on the topic under discussion)


Hi PersonA,

I think our discussions about the Quality Criteria were pretty clear…
  • And to be even more precise, they must be able to be automatically run as part of the appropriate phases in the pipeline.
  • APIs
  • Message / file formats, etc.
  • UI’s which they provide
  • Telemetry which they generate for business/application/operation/etc. monitoring/alerting/tracking/etc.
Martin’s slide that you pointed is no excuse to get out of any of the above. He’s pointing out, as I did, that false metrics such as “code coverage” percentages are misleading and inefficient ways to measure actual quality of the software (as opposed to trying to force the team to spend time creating “tests”).

Again, as we discussed in our last meeting, if you achieve anything even remotely close to my list above (and on the quality criteria page) you will have created software that is quite high quality.

As I noted, people must stop over-indexing on this so-called end-to-end testing to the exclusion of the rest of the criteria. That’s what’s lead to the current mess that we have with the AppA bundle today. It’s precisely the fact that each of those various angles/levels of testing are done (automatedly) that provide what we have talked about in terms of being able to bracket problems for much faster triage and debugging, achieve the sorts of testability for multiple uses, provide reality based assurances to InfoSec, the app teams, the management, etc. that are stuff really does work.

Thanks,

John
Share:

Thursday 22 September 2016

Using GCViewer For GC Analysis

GCViewer developed by Tagtraum Industries is one of the free tools that you can use to visually analyse the garbage collection logs.

Running GCViewer

The latest version is now maintained by chewiebug and available for download at http://sourceforge.net/projects/gcviewer/files/
  • Download the gcviewer-x.xx.jar to your local system
  • Launch the viewer by running java -jar path_to_gcviewer-x.xx.jar
  • You need Java version >= 1.7

Using GCViewer

  • Open the GC log file from the test run using GCViewer
  • Adjust the zoom levels (dropdown at the top) so that the graph fits the window and there are no scroll bars (to get an overview)

  • Check the trend of the 'Total Heap' usage of the VM. As long as it does not show an upward trend, VM is considered to be fine.
  • Check the right side of the tool for the information related to the run - Summary, Memory and Pause.





Summary

  • 'Number of full gc pauses' are of concern and healthy VM should not generally be doing a full GC (which means it should ideally be zero). Full GCs result in the VM being inactive till the GC completes.

Memory

  • The 'Total heap' gives an indication on how much the VM memory is loaded.

Pause

  • Total time spend on 'gc pauses' is of interest.
Share: