I am a software developer building an IoT water treatment product that is currently in soft release. (https://dropconnect.com) The core of the system is a WiFi-enabled hub that communicates with Amazon's AWS IoT Core product. We have several customers that use HughesNet Gen5 HT2000W equipment where our product cannot communicate with AWS. The TLS encryption handshake gets about 80% of the way through the process and then stops when an expected response from AWS never arrives. The problem only happens over a HughesNet satellite link, so we purchased Gen5 equipment and entered a two-year HughesNet contract specifically to diagnose this problem.
So far, I've proven that network latency is not a contributing factor, DNS lookups are not part of the problem, and the state of the firewall and web acceleration features on the HT2000W have no effect on the problem. I've tested our device on a HughesNet Gen4 connection and it works just fine, so the hangup appears to be specific to the Gen5 platform. Basic communication with the server at Amazon is not a problem; by the time the TLS handshake fails, there have already been several rounds of back-and-forth communication with the server. The TLS handshake is performed using the Mbed TLS library (https://tls.mbed.org/) which is solid and commonly used on embedded IoT products for encryption. I can provide much more information about exactly what is happening during the TLS handshake, but for now I'll save that for someone who is interested...
Can anyone suggest features or behavior of the HughesNet Gen5 service that might be contributing to this TLS handshake failure? What is the best method for getting the attention of technical engineers at HughesNet that could help diagnose and solve this problem? Any ideas that could help me chase down and solve this problem would be appreciated.
Chandler Systems, Inc.
Solved! Go to Solution.
Just adding an update for closure on this thread...
In the end, the problem was corruption of the TLS handshake caused by a default buffer size in Microchip's TCP/IP library being too small. I haven't completely studied the cause yet, but it appears that traffic received via a HughesNet link uses a larger than typical MTU setting or something along those lines. Simply resizing that buffer made the problem disappear.
I'm glad you found the community, thank you for posting. Wow, I admire your dedication to finding a solution. Let me send this over to our engineers for their input. I'll post back once I hear anything.
Good news, I got a quick turnaround from engineering and they are interested in looking at this. They'd like to get in touch with you. Although I have your SAN, it looks like it's for Mr. Chandler. Please privately message me your contact information so an engineer can reach out.
If I could chime in...
I noticed something similar when going to my bank's web page yesterday (just didn't have time to report it) which also hung on a TLS handshake to AWS. To add to this, it seems like it's happening very sporadically (or transitionally, as I explain later) and eventually clears.
AWS switches their IPs around very often (a pure annoyance from a web security standpoint, imo). It's very possible that when they do that, the IP caching used in the DNS acceleration may get confused and try to handshake with the wrong IP, thus causing a TLS error. If that's the case, there might need to be exceptions made for AWS and any other cloud/server farms that tend to do the same thing, like DigitalOcean, etc.
I'd venture to guess this is part of the problem people were having going to amazon.com recently, as well.
Thanks for chiming in, I just noticed your post. Let me also send this over to the engineers for their information.
I have been in communication with an engineer at the ARM Mbed TLS group, and his latest response after looking at the detailed logging of a failed connection over a HughesNet link is that the server side at Amazon likely silently terminated the connection due to an inconsistency in the TLS handshake packet. His guess was it might be a bad MAC or something, and my interpretation is that it might be due to some sort of cacheing or acceleration mechanism that's getting in the way. (No conclusive evidence at all, of course)
However, one of the tests I performed was to look up a valid IP for our API endpoint at Amazon and then hard-code that IP into one device trying to connect over a HughesNet link. It spent the better part of a day trying to connect to that one IP and it never succeeded. (I proved at the beginning and the end that the IP was functional by switching to a non-HughesNet internet link; it connected immediately) Also, if it was a cacheing issue, you'd think that we would see at least an occasional success while trying to connect. I have two units in the field on HughesNet connections and they've been trying to connect to Amazon every few minutes for over two months. In that timeframe, neither one of them have ever finished a single TLS handshake. That 100% failure rate makes my issue feel different than what you've observed...
Good morning Patrick,
Thank you for PMing me your contact info. One of our engineers informed me he'll be reaching out to you soon. Looking forward to some productive findings!
...That 100% failure rate makes my issue feel different than what you've observed...
Just a thought... could it be a timeout? I'm seeing an increased level of delays out of CenturyLink again, especially wrt TLS-related oauth timeout failures between my webserver hosted in PA and Twitter. That route is outside of HughesNet, but CL is an upstream provider to many HN groundstations.
I can't rule that out, of course, but it doesn't feel like a timeout problem. I simulated a network connection with 2,500ms latency and the TLS handshake completed normally over a landline internet connection. I also added an arbitrary 15 second delay between TLS handshake steps, and the handshake completed normally despite taking a very long time to finish. When our device tries to connect to AWS over a HughesNet connection, the steps that lead up to the missing packet don't take any longer than expected, aside from the overhead of the latency inherent to the link.
I'm just beginning a conversation with an engineer at HughesNet, and I'll ask about the possibility of a timing problem if we don't head down a more likely path first. Thanks for continuing to add ideas to the mix...