Sunday, August 5, 2018

Tech Field Experiences 01 - ISP NTU output queue issue

Hi, this "Tech Field Experiences" series is a new series in my blog which I will talk about some real troubleshooting cases in my daily tasks and the theories behind the scenes.

In the first post of this series, I would like to talk about an issue I had have when working with a new Layer 2 leased line circuit provided by one of the famous ISPs in Australia.

This is a 100Mbps Layer 2 link. Site A is the remote site and Site B is in the Data Center. A Single mode fibre connection is provided by ISP at the B end. And the far end has a 100M copper hand-off. On the ISP NTU, 100Mbps IN policer is applied. And on the CE interface, 100Mbps OUT shaper is applied. You can click the following diagram for more details:

Click to Expand
When the link was delivered, we tested the end-to-end PING connectivity and it was good. Then we did the bandwidth testing. Uplink testings with iperf from A end to B end was good with both TCP and UDP. We had 98 to 99Mbps throughput.  However, download testing from B end to A end had problems. UDP test was ok. It was about 98 to 99Mbps but with about 1.2% packet loss. The TCP test was very bad. We just got about 10 to 13Mbps. 

As normal, we raised it with the ISP. ISP send technician to A site and did the loopback test. However, the loopback test results were good on both upload and download. It's 100Mbps with no packet loss. After that, we got the standard ISP answer "Cannot find any issues on the link. It's not out fault" :-)

So the hard work began, we needed to find the fault by ourselves. We did a lot of isolation testings to isolate the L2 and L3 devices between the testing machines on both ends. Finally, we connect two laptops on each end of the circuit and test (on the B end, we still used a L2 switch as fibre to copper converter as no fibre port on the laptop). At the beginning, the test result was the bad as before, TCP download speed was about 10Mbps in this PC to PC scenario. Then we tried to hard-set the speed/duplex on the PC on B end to be 100Mbps/full. After that we had good TCP download result. Finally, we found the root of the cause which is on the ISP NTU on the A end.

Click to Expand
The ISP NTU on A end only has a 100M copper interface facing the customer devices. I believe this A end NTU 100M copper interface must have a very short output queue depth and cause packet drop when transmitting data in the over subscription cases.

If we are not familiar with the shaping mechanism, we may say the 100Mbps shaper has already been applied on B end, there is no over subscription. But we need to look into the shaping token theory. When shaping is applied on the interface OUT direction, every time cycle, a certain number of "tokens" will be used for packet transmission. The number of "tokens" is defined by the configured shaping rate. As the packets are sending out, number of tokens decreases. And the token number will not be refilled until the next time cycle.

In our case, we configued 100Mbps OUT shaper on the B end router interface, so we can only transmit 100Mbits per second. The 1G interface on B end will transmit 100Mbits within the first 100ms. Then it will become idle for 900ms before the next time cycle. In other word, this 100Mbits will be transmitted within the first 100ms in 1G interface line rate.

Click to Expand
The NTU at A end receives 100Mbits within 100ms. But since the interface facing Customer router is a 100M interface, it will take 1 second to transmit 100Mbits out. So over subscription comes up in this case. To deal with the over ascription, network devices will use output buffer (output queue) to store the packets which cannot be sent out. As shown in the following diagram, if the ISP NTU at A end has enough output buffer, the 90Mbits will be stored in the output queue and wait to be sent out after the first 100ms. 

Click to Expand
But in our case, the ISP NTU doesn't have enough output buffer the some packets will be dropped in the case of over subscription. (shown in the following diagram). The dropped TPC packets will be re-transmitted but the TCP window will never reach the ideal size which results in a very pool throughput performance. 
Click to Expand
When the ISP technician came to site A and did the loopback test, he just push traffic into the link in 100Mbps line rate. So no over subscription happened. The same, when we configure the Laptop interface as 100M/full in B end, over subscription also didn't happen. So we didn't have the poor download performance on A end in these two scenarios.

In order to fix this issue, there are two options. One is to get the ISP to change the NTU on A end to have a 1G interface facing customer. However, this may take weeks or even months.

Another option is to force the transmit traffic on B end to have a 100Mbps line rate by adding an extra hop between Router and switch as below:

Click to Expand

In these extra hop, 100/full are hard-set on both switch interface and router interface. And in our test, the output queue on B end router can handle the over subscription traffic well. TCP download performance was good and we got 98 to 99Mbps as the test result. 

Conclusion:

When the inbound interface speed doesn't match the outbound interface speed, over subscription will happen. In this case we need to pay attestation to the output queue depth. If we control the device, we can increase the output queue depth to avoid the packet loss happening. If we don't control the device, we may need to configure the upstream devices to provide the same line rate as downstream.

References:










No comments:

Post a Comment

NSX Load Balancer "Application Rules" Examples:

Load Balancing is one of the features provided by the NSX Edge Services Gateway (ESG). It can provide L7 Load Balancing by utilizing the HA...