Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Shoretel Switch connectivity issues over vpn

    We have a client that has 5 sites and they are all connected via vpn. The vpn's are terminated with Juniper SSG firewalls. We have an ongoing issue with the switches losing connectivity on the Switch Connectivity screen (We call it the mine sweeper screen). Network connectivity is there among the switches (we can ping), but we can't complete lsp_pings so the sites show up as red on the Switch Connectivity.

    We can clear the sessions on the SSG's, to reestablish the connectivity, but at just random times the switches will lose connectivity. There is traffic shaping in place to give the Voice precedence over the Data. We initially thought that their may have been large amounts of data passing through the vpn's, but we have had lost connectivity after hours when nothing is going on.

    We are a Juniper reseller, so we have implemented this at other client sites and have been successful in the past. We have had some issues, but we have always been able to work through them and get it working.

    We are using route-based vpn's between sites.

    If anyone has had similar experiences please advise.

    Thanks,

  • #2
    What are the event ID's in HQ servers event log? There is a bug in earlier versions of 7.5, that are fixed in the latest build.

    Charles

    Comment


    • #3
      What version of ScreenOS? What are you VPN settings?


      Be careful with QOS on SSG's, the screenos devices do not implement QOS like their JunOS based devices. Generally, we've seen it do more harm than good on ScreenOS (we are an Elite reseller and Implement Specialist across all the product lines).

      Comment


      • #4
        Never had problems with the switch connectivity between sites (at least that can be attributed to the Juniper route based VPNs, ScreenOS versions from 5.0.0r9 through 6.0) - but if the lsp_pings are using larger packets you may need to set/lower tcp-mss, and make sure flow path-mtu is set; also if your VPN tunnels terminate in different zones make sure don't have NAT enabled on the corresponding permit policies... JD

        Comment


        • #5
          If you keep a constant ICMP ping from one side of the VPN to the other (say for 24 hours) do the SG switches still lose connectivity?

          Comment


          • #6
            Thanks for the replies:

            All get the all the info together for the next post. Standby.

            Comment


            • #7
              The Event ID's that are genererated are Event ID 116

              Shoretel version is 7.5 buidl 12.15.8200.0 . This is one version below the latest that is out.

              Juniper SSG 20's are on 6.1.0r4.0 using route based vpn's. Traffic shaping had been used previously, but we have been changing this for troubleshooting purposes.

              I have not tried the constant ping for 24 hours. I'll try that out.

              It is not always the same switches that lose connectivity.

              Comment


              • #8
                This is a problem I've been working on for a long time. I'm almost loath to share what I know, since I've spent 2 years working on it, and I hate to share my hard won knowledge with the world without any guarantee of getting any benefit for myself.

                But here goes nothing.

                So what happens, is when a tunnel goes down and comes back up, the session gets hanged in the firewall. So switch A might have a session to switch B and C and D, when the tunnel goes down, that session somehow gets stuck and never times out. The firewall may or may not be trying to create a new session, but the end result is the same: the old session is stuck and no traffic can pass.

                I am 99% sure the trigger for this is when a tunnel goes down. I am not sure if this has any relation with routing protocols or not. But in all cases we see it, the customer is using OSPF on their network. But it may or may not have an OSPF tie in.

                We've worked with Juniper TAC on it a large number of times over the last 2 years, and they have never gotten anywhere towards fixing it. I find it astounding nobody else has these problems.

                So on to solutions. As you may have discovered, the initial fix discovered was rebooting the firewall. We then learned we can clear sessions, but that can be intrusive to users. Then we learned about clearing only sessions based on the source and destination IP addresses of the shoretel switches. (clear session src-ip x.x.x.x, clear session dst-ip y.y.y.y)

                Then I found my best solution yet. clear session src-port 5440. This clears all sessions for any traffic associated with the shoretel lsp keepalive port, which is udp 5440. This will fix the problem without disconnecting any user important traffic. The problem is that the problem occurs in the first place, and that I have not found a solution to.

                So I've shown you mine, now show me yours...

                Comment


                • #9
                  another thing

                  One thing I also forgot:

                  Another thing that helps, is stopping tunnels from going down. The number one time tunnels go down is when they try to re-key. It is less secure, but if you increase the timers from 1 hour to 8 hours or longer, it will generally greatly reduce the incidence rate of problems.

                  Comment


                  • #10
                    NOW I'm having the hung session problem...

                    OK I take it back - now we've seen this issue as well; had to RMA out a SG50 switch at the HQ site, installed the replacement but then none of the remote site switches would connect back to it (ie bad vibes from the Minesweeper screen).

                    The link between the sites is over a private circuit which remained up throughout (of course the switch itself was replaced) - there are several Juniper/Netscreen routers in the traffic path all of which had active sessions for port 5440:

                    [replacement switch]
                    [LAN]
                    early SSG device firmware 5.0.1p1.6_ssg <<<<< had to clear the sessions here
                    [LAN]
                    older NS device firmware 5.3.0r3 << clearing had no effect
                    [private WAN]
                    newer SSG firmware 6.0.0r4 << clearing had no effect
                    [LAN]
                    [remote site switches reporting replacement switch down]

                    One suggestion (which we'll try) is to define a custom service in the Juniper units for tcp source 5440 dest 5440 and set a low timeout value for that protocol, then add a high ranking policy specifically allowing the new service so that its sessions are marked with a short timer, say 3 mins (instead of the default 30).

                    Lots of other issues post-8.1 and ECC 5 upgrades last week but no others we can blame on the Juniper/Netscreen boxes yet!

                    JD

                    Comment


                    • #11
                      Does the new SG50 switch have the correct firmware on it. Can you push a firmware update to all of the existing switches? We had a problem where a switch would loose connectivity for no reason, ST had us re-install the same version firmware and it "fixed" the issue for us.

                      Comment


                      • #12
                        Sg50

                        The SG50 flashed fine to the latest firmware (whatever is being distributed by our 8.1 install); the remote 90s and T1s all flashed fine as well. The reason we replace the SG50 was post-upgrade (and multiple re-flashes) the one switch wouldn't allow outbound external calls (reorder tone). Replaced the S50 and one of the T1 units and it all started working fine again. At the same site another SG50 and another T1 upgraded and kept working fine. Didn't get deeply in to troubleshooting as we had the option to just replace the switches, and we were (are still) more worried about problems we found after upgrading EC to v5.

                        Comment


                        • #13
                          I have the same problem with a customer with 5 sites. All connected via Adtran 1335's with VPN and GRE tunnels. There are a few differences, one being the remote site can hit the main site (voicemail, etc.) when the tunnel comes back up, but a remote site can't extension dial another remote site. also, Quicklook in Director at the main site see's all devices green and happy.

                          If I reboot the Shoregear switch, it all works fine. It's not like the routes change or anything.

                          Comment


                          • #14
                            To all in this contributors in this thread ...

                            Anyone still experiencing this issue? I’m on 7.5 build 12.14.6904.0 and with a recent CheckPoint upgrade cycle I’ve experienced the same type of symptoms.

                            Issue started due to CheckPoint dropping UPD 5440 (when dport & sport are the same) in a VPN tunnel when all other traffic was passing. I theorized that firewall was getting the connection confused because very unusual to have all connection with dst.port=src.port. I put a long timeout value on the service and traffic immediately started to flow and kept flowing as monitored with TCP dump utility. In short, CheckPoint will aggressively ages UDP connections and I theorized that the connections will being ditched from the table before communications was complete.

                            Now the ShoreTel strangeness … even after restoring traffic to the site, switches would not respond. So I rebooted the switches and they still don’t respond. I telnet into the switches and execute a “lsp_ping” to another switch across a VPN tunnel. All packets get a response and 0 packets lost. I turn up the debug level and watch each packet hit the remote switch and respond but the ShoreWare Director still shows the switches as unable to communicate. Sites affected by the upgrade would show “Unknow” status on all connections to other switches and a “Not Connected” status on a switches connection to itself. The only way to clear it up was to down the entire site (switches and DVM) and bring it back up.

                            I’ve attached a screen shot of my Switch Connectivity so you can see what I’m speaking of … I still have one site out. I find it very strange to see that the system thinks a switch cannot communicate with itself.

                            A few questions that come to mind per Charles’s comment on 3/9 on the bug in 7.5.:
                            What release # is the bug fixed in?
                            Is there a knowledge base article that details the bug?
                            Is this also an issue in later releases (8.x, 9.x)?


                            If there is anyone listing??? I hope so … I am continuing to work this issue to get the UPD 5440 Location Protocol to flow smoothly through the VPN tunnel and will update with any new findings.


                            Thanks to all,
                            BK
                            Attached Files

                            Comment


                            • #15
                              Make sure that the VPN is established to be UDP aware.

                              The Shoretels Keep alive is not ICMP but using UDP, so oyu can ping until hell freezes over, even though ICMP says there is an (Layer1) electrical entity responding this does not mean the SG is responding to the LSP keep alive.

                              Below is a list of the ports needed as well as transpoprt for Shoretel

                              Details:
                              ShoreWare Server:
                              UDP: 5004 for voice packets
                              UDP: 5440 (request and response) Location Service Protocol
                              UDP: 5441 (request and response) ShoreSIP
                              UDP: 5443 (request) UDP 5445 (response) Bandwidth Reservation Protocol
                              UDP: 5442 and 5446 DRS

                              SMTP: 25
                              HTTP TCP: 5440 for CSIS
                              MS RPC: Ranges from port 1024 through 65535 not configurable

                              ShoreGear Switch or Teleworker:
                              UDP: 5004 for voice packets
                              UDP: 5440 (request and response) Location Service Protocol
                              UDP: 5441 (request and response) ShoreSIP
                              UDP: 5442 Call Routing Service (DRS)
                              UDP: 5443 (request) UDP 5445 (response) Bandwidth Reservation Protocol
                              UDP: 5444 Bandwidth Reservation Service

                              UDP: 2427 IP phones listen on this port
                              UDP: 2727 Switches listen on this port
                              UDP: 67 BOOTP
                              UDP: 68 BOOTP
                              UDP: 111 RPC used to negotiate TCP ports for network call control
                              UDP: 161 SNMP
                              TCP: 21 FTP
                              TCP: 23 Telnet
                              TCP: 111 RPC
                              TCP: 513 rlogin
                              TCP: 5555 Shoreline diagnostic port (ipbxctl –diag)

                              IP Phone:
                              UDP: 5004 for voice packets
                              UDP: 2427 IP phones listen on this port
                              UDP: 2727 Switches listen on this port

                              Distributive Server:

                              UDP: 5004 for voice packets
                              UDP: 5440 (request and response) Location Service Protocol
                              UDP: 5441 (request and response) ShoreSIP
                              UDP: 5443 (request) UDP 5445 (response) Bandwidth Reservation Protocol
                              UDP: 5442 and 5446 DRS

                              SMTP: 25
                              HTTP TCP: 5440 for CSIS
                              MS RPC: Ranges from port 1024 through 65535

                              PCM Client:
                              HTTP TCP: 5440 for CSIS
                              MS RPC: Ranges from port 1024 through 65535
                              ShoreConference:
                              UDP: 5004 for voice packets
                              UDP: 2427 IP phones listen on this port
                              UDP: 2727 Switches listen on this port
                              HTTPS: 443 and 8443
                              HTTP: 80
                              SMTP 25
                              SSH: 22 for monitoring
                              NTP: 123

                              Another item to test, Is to be logged into the SG as well as the HQ and run lsp_ping from the telnet command with lsp_debug = 2. This will give you a very good indication as to what direction a possible bottle neck may be occurring.
                              Attached Files
                              Last edited by Jlorenz; 07-13-2009, 02:27 PM.

                              Comment

                              Working...
                              X