Forum Discussion
TLS handshake failure between IoT product and AWS
- 7 years ago
Just adding an update for closure on this thread...
In the end, the problem was corruption of the TLS handshake caused by a default buffer size in Microchip's TCP/IP library being too small. I haven't completely studied the cause yet, but it appears that traffic received via a HughesNet link uses a larger than typical MTU setting or something along those lines. Simply resizing that buffer made the problem disappear.
Patrick
Mark,
I have been in communication with an engineer at the ARM Mbed TLS group, and his latest response after looking at the detailed logging of a failed connection over a HughesNet link is that the server side at Amazon likely silently terminated the connection due to an inconsistency in the TLS handshake packet. His guess was it might be a bad MAC or something, and my interpretation is that it might be due to some sort of cacheing or acceleration mechanism that's getting in the way. (No conclusive evidence at all, of course)
However, one of the tests I performed was to look up a valid IP for our API endpoint at Amazon and then hard-code that IP into one device trying to connect over a HughesNet link. It spent the better part of a day trying to connect to that one IP and it never succeeded. (I proved at the beginning and the end that the IP was functional by switching to a non-HughesNet internet link; it connected immediately) Also, if it was a cacheing issue, you'd think that we would see at least an occasional success while trying to connect. I have two units in the field on HughesNet connections and they've been trying to connect to Amazon every few minutes for over two months. In that timeframe, neither one of them have ever finished a single TLS handshake. That 100% failure rate makes my issue feel different than what you've observed...
Patrick
pfrazer wrote:...That 100% failure rate makes my issue feel different than what you've observed...
Interesting.
- MarkJFine7 years agoProfessor
Just a thought... could it be a timeout? I'm seeing an increased level of delays out of CenturyLink again, especially wrt TLS-related oauth timeout failures between my webserver hosted in PA and Twitter. That route is outside of HughesNet, but CL is an upstream provider to many HN groundstations.
- pfrazer7 years agoFreshman
I can't rule that out, of course, but it doesn't feel like a timeout problem. I simulated a network connection with 2,500ms latency and the TLS handshake completed normally over a landline internet connection. I also added an arbitrary 15 second delay between TLS handshake steps, and the handshake completed normally despite taking a very long time to finish. When our device tries to connect to AWS over a HughesNet connection, the steps that lead up to the missing packet don't take any longer than expected, aside from the overhead of the latency inherent to the link.
I'm just beginning a conversation with an engineer at HughesNet, and I'll ask about the possibility of a timing problem if we don't head down a more likely path first. Thanks for continuing to add ideas to the mix...
Patrick
- pfrazer7 years agoFreshman
Just adding an update for closure on this thread...
In the end, the problem was corruption of the TLS handshake caused by a default buffer size in Microchip's TCP/IP library being too small. I haven't completely studied the cause yet, but it appears that traffic received via a HughesNet link uses a larger than typical MTU setting or something along those lines. Simply resizing that buffer made the problem disappear.
Patrick
Related Content
- 4 years ago
- 6 years ago
- 6 years ago
- 6 years ago
- 9 years ago