Troubleshooting Distributed Cache Service SharePoint 2013

Before starting with troubleshooting steps check the following articles and prerequisites regarding this topic:

a. https://technet.microsoft.com/en-us/library/jj219572.aspx Plan for feeds and the Distributed Cache service in SharePoint Server 2013

o One of the most important things to remember – DO NOT use dynamic memory on the SharePoint 2013 servers

Extract:

The Distributed Cache service can run on either a physical or virtual server. When using virtualization, do not use Dynamic Memory to manage shared memory resources among other virtual machines and the Distributed Cache servers. The memory allocation for virtualized Distributed Cache servers must be fixed.

b. Install the latest APP Fabric CU – you can find the complete procedure in the following article http://blogs.msdn.com/b/calvarro/archive/2013/08/29/points-to-consider-with-distributed-cache-on-sharepoint-2013.aspx

c. Remove any host file (%systemroot%\system32\drivers\etc) entries from the distributed cache servers during the first configuration because it can generate a lot of blockings

d. Firewall: open 3389, 135, 445, 22233, 22234, 22235, 22236 and unblock ‘AppFabric Caching Service’ and ‘Remote Service Management’. ICMP protocol needs to be allowed.

Troubleshooting options

1. We will start with the following two commands:

use-cachecluster

get-cachehost

The above commands return the AppFabric Reference of the cache cluster. You can run it on any server that’s a cache host. If there are multiple cache hosts in the cluster, you can run the commands on all of them and confirm that the output is identical.

If use-cachecluster throws an error that means the node is NOT a part of the AppFabric cluster.

2. The next step is to get the SharePoint reference of the Cache Cluster from SharePoint configuration database

$Farm = Get-SPFarm

$cacheClusterName = “SPDistributedCacheCluster_” + $Farm.Id.ToString()

$cacheClusterManager = [Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheClusterInfoManager]::Local

$cacheClusterInfo = $cacheClusterManager.GetSPDistributedCacheClusterInfo($cacheClusterName)

$cacheClusterInfo

$cacheClusterInfo.CacheHostsInfoCollection | fl

The output of the first command from step 1 should match with the output of the command from step 2.

If you want to check the results in the configuration database you can execute the following query:

SELECT * FROM [SharePoint_Config].[dbo].[CacheClusterConfig] where [EntryType] like ‘%hosts%’

3. Remove the AppFabric host on each Cache Host

Stop-SPDistributedCacheServiceInstance -Graceful

Remove-SPDistributedCacheServiceInstance

then check the results using

use-cachecluster

get-cachehost

4. Add Distributed Cache on the server:

Add-SPDistributedCacheServiceInstance

Things to check after running the above command on a host:

Central Admin > Services on server > Server > Distributed cache service needs to be Started.

Host > services.msc > AppFabric Caching service needs to be Started > and this should not crash after a few seconds/minutes.

5. Refresh the distributed cache service

Use-CacheCluster

Get-CacheHost

Stop-CacheCluster

Start-CacheCluster

6. If the issue is still present go further by unprovisioning and deleting the instance of the server that is NOK – DO NOT execute the following commands in a healthy environment

$instanceName =”SPDistributedCacheService Name=AppFabricCachingService”

$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName -and ($_.server.name) -eq $env:computername}

$serviceInstance.Unprovision()

then

Get-SPServiceInstance | fl TypeName, ID

$s = Get-SPServiceInstance ID

$s.delete()

And then to add it back to the cluster

Add-SPDistributedCacheServiceInstance

7. The last command can fix the issue but in the same time can generate orphan entries in the configuration database. In case an orphan entry is generated you have to reinstall the server that is NOK and add it back to the farm with a different name.

a. Export-CacheClusterConfig –Path c:\config.xml

b. Remove the host

Usual output would look like this for one cache host in the cluster:

<hosts>

<host replicationPort=”22236” arbitrationPort=”22235” clusterPort=”22234”

hostId=”id” size=”400” leadHost=”true” account=”domain\account”

cacheHostName=”AppFabricCachingService” name=”servername.domain.local”

cachePort=”22233” />

</hosts>

c. Stop-cachecluster

d. Import-cacheclusterconfig –file “path\file.xml”

e. Start-cachecluster

If you want to check if you have any orphans after executing this command you have the following query:

SELECT TOP 1000 [Id] ,[ClassId] ,[ParentId] ,[Name] ,[Status] ,[Version] ,[Properties] FROM [SharePoint_Config].[dbo].[Objects] where [Properties] like ‘%cachehost%’

In the result of the query presented above each host needs to have two entries. If you have servers with one entry then you have generated an orphan item. From here if you want to invest weeks by reapplying different tips and tricks feel free to do it but the result will be the same – reinstall the server with a different name and add it back to the farm.