NetApp ONTAP 9.8 – FlexCache SMB Overview

NetApp FlexCache added SMB support and more in ONTAP 9.8.  NFS v3 was already supported and following are the new capabilities including SMB. This is one of my favorite additions to ONTAP with the ability to scale-out a central CIFS share natively. Some of the information in this blog is consolidated from NetApp Docs at https://docs.netapp.com and the NetApp FlexCache Technical Report (currently at version 9.7) at https://www.netapp.com/pdf.html?item=/media/7336-tr4743pdf.pdf

FlexCache Overview

  • FlexCache is a persistent read/write cache of a volume that can improve performance by providing load distribution, reduced latency by locating data closer to the point of client access, and enhanced availability by serving cached data in a network disconnection situation.  
  • A FlexCache volume is a sparse copy where some files from the origin volume are cached. When a FlexCache volume is created, a FlexGroup volume is created by default with 4x constituent volumes. 
  • The cache is instant with no data transfer to create the cache
  • Similar to SnapMirror, the cache mechanism communicates over InterCluster LIFs and uses cluster and SVM (vserver) peering when caching to a different cluster and/or SVM. The cache can be local or remote across ONTAP clusters or even on the same cluster and same SVM. InterCluster peering supports TLS for encryption on the wire and both the source and destination can encrypt at rest with NVE, NAE or NVE.
  • In FlexCache terms, the origin is the source volume, and the caches are the remote volumes.
  • FlexCache works as an origin or cache on any ONTAP cluster, hardware and software, on AFF, FAS, ONTAP Select (OTS) and Cloud Volumes ONTAP (CVO).  
  • You can mix different disk and tier types, for example, the origin/cache can be any mix of HDD, SSD, FlashPool (HDD+SSD), and FabricPool (performance tier + object capacity tier).
  • FlexCache supports disconnected mode which allows reads but not writes to cached files while disconnected.
  • FlexCache enables operational efficiencies for backup and disaster recovery since the origin is the only site that needs backup and replication.
  • FlexCache has been around for many years for NFS, even on legacy 7-Mode systems. FlexCache in ONTAP 9 uses a more efficient and faster Remote Access Layer (RAL) protocol compared to the legacy 7-Mode NetApp Remote Volume (NRV) protocol.

New ONTAP 9.8 Features

  • FlexCache for SMB version 2.x and 3.x shares with file locking which is handled by the origin (source volume) locally and to all FlexCache volumes
  • FlexCache volumes from a mirrored destination DP volume as the origin of the cache
  • Block Level Invalidate for more efficiency in the cache (prior ONTAP was only file level invalidate). Note that BLI is disabled by default on origin volumes.
  • Fan-out from 1x origin volume to 100x cache volumes.  Prior to ONTAP 9.8, the ratio was 1 to 10
  • Pre-populate of directories in the cache.  Prior to ONTAP 9.8, or as another option, you can use the nfs “find” or Windows Powershell “Measure-Command” commands to populate the cache.  I will cover this in my next blog with examples from ONTAP and the clients

Free Licensing!

  • One of the best things about this technology is that the FlexCache feature is free!  Starting from ONTAP 9.7, FlexCache no longer needed a license.  
  • For ONTAP 9.5 and 9.6 there is  a free master license key for up to 400TB valid through 2099 at https://mysupport.netapp.com/NOW/knowledge/docs/olio/guides/master_lickey/
  • For ONTAP prior to 9.5, work with your NetApp sales and support teams for the license, but I highly recommend you upgrade to 9.8 for the new features and the free use without a license

FlexCache Limits (check the FlexCache Power Guide in NetApp Docs for the latest)

Best Practices

  • To avoid invalidations on files that are cached when there is only a read at the origin, turn off last accessed time updates on the origin volume
    • volume modify -vserver origin-svm -volume vol_origin -atime-update false
  • Try not to use applications that confirm writes with a read-after-write. The write-around nature of FlexCache can cause delays for such applications
  • Set the following bootarg to revert the RAL or FlexGroup behavior to previous so that the “ls” command does not hang in disconnected mode
    • node run <node> “priv set diag; flexgroup set fast-readdir=false persist
  • Create the FlexCache with the -aggr-list option so it creates the prescribed number of constituents (the default is 4x constituents)
    • Always use the -size option for the FlexCache create to specify the FlexCache volume size
  • Cache size should be larger than the largest file
    • Because a FlexCache is a FlexGroup, a single constituent should not be any smaller than the largest file that must be cached. There is one constituent by default, so the FlexCache size should be at least as large as the largest file to be cached.

Sizing – it depends (depends on what?) TR-4743 examples

  • The cache size can be the same or smaller than the origin volume
    • Best practice is at least 10% of the origin size (see sizing examples below)
  • The working set determines the cache size.  Auto Grow may be useful (see setup of Autogrow in my next blog)
    • Working set – If the origin volume has 1TB of data in it, but a particular job only needs 75GB of data, then the optimal size for the FlexCache volume is the working set size (75GB) plus overhead (approximately 25%). 
      • In this case, 75GB + 25% = 93.75GB or 94GB
    • The other method to determine optimal cache volume size is to take 10-15% of the origin volume size and apply it to the cache. For a 50TB origin volume size, a cache should be 5TB to 7.5TB in size. You can use this method in cases where the working set is not clearly understood and then use statistics, used sizes, and other indicators to determine the overall optimal cache size
      • 100GB for 1TB origin for example
    • Read/Write %s – The rule of thumb for FlexCache is a read/write mix of at least 80% reads and 20% writes at the cache. This ratio works because of the write-around nature of FlexCache. Writes incur a latency penalty when forwarding the write operation to the origin. FlexCache does allow a higher write percentage, but it is not optimal for the way FlexCache in ONTAP processes the write

Global File Cache (GFC) Comparison

  • NetApp Global File Cache (GFC) is from the NetApp acquisition of Talon, and is another SMB caching option worth discussion since these are two similar products. Below are some comparisons, and my opinion on when one or the other is the best option. GFC is SMB only and is licensed by each remote site virtual Windows 2016 or 2019 Server instance at about ~4K list price per year.
  • You can mix and match GFC and FlexCache. For example you may have a FAS8700 serving CIFS at a central location with an ONTAP FlexCache at a remote site with a FAS2720, another remote FlexCache with ONTAP Select, and another remote site with a GFC virtual Windows instance. The FAS8700 can Mirror and Vault to another location for DR and Backup without backup needed at the remote sites.
  • When to use ONTAP FlexCache
    • When all sites already have ONTAP storage
    • When you need to cache NFS
  • When to use GFC
    • When remote sites do not have ONTAP storage
    • When the origin (source) is Cloud Volumes Service (CVS) or Azure NetApp Files (ANF)
      • CVO and ANF are storage as a service native cloud offerings that are not supported with FlexCache
  • When it depends
    • At a new remote site, you can use an ONTAP Select (OTS) VM or a Global File Cache (GFC) Windows Server instance. With GFC you also need a VM at the origin (source) site. There is no right answer and your NetApp team can provide budget options to see which is the best fit, and in some cases you may want to use both.

Use cases (credit to the NetApp 9.8 EAP docs for the use cases and images)

  • Global File System using sparse cache volumes to remote sites
  • Large data sets
  • No need for full replication or multiple copies.  Keep 10-15% (sparse copy) cached where needed with a single master copy
  • For 80% read workloads is a best fit
  • Hot volume performance load balancing
  • Software build (Git)
  • Common tool distribution
  • Cloud bursting, acceleration, caching
  • Stretched NAS on MCC
  • Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL)
  • ASIC Electronic Design Automation (EDA)
  • Media and computer generated imagery (CGI) rendering

Provides a FlexCache to extend the volume namespace beyond the current cluster serving the volume. This can bring the data physically closer to the resources needing the data via a sparse mechanism so that only the data being requested is cached in the remote cluster.  This allows for the remote resource to bypass the WAN and read the data from the local cluster with the FlexCache volume

  • Caching to and from Cloud
    • Provides FlexCache in the cloud and Origin volume on-prem OR FlexCache on-prem Origin volume in cloud OR both Flexcache volume & Origin volume in cloud
  • FlexCache with ONTAP Select
  • FlexCache for Cloud Bursting
  • When using the cloud for a compute resource, you can utilize FlexCache to bring your data to the cloud immediately.  No waiting for replication, no waiting for an initial sync, just create the FlexCache and go. Create FlexCaches in multiple clouds to balance resources and leverage less expensive resources

FlexCache for Cloud Acceleration

  • For those employing a “cloud forward” strategy, cloud caching and acceleration can provide a way to get at data in the cloud faster. Setting up a FlexCache does not require DR or backup.  Primary data is still in the cloud and that backup/dr strategy is a good zero touch way of getting your data to users quicker
  • Limit your egress charges in the cloud be reading data only once from the cloud with consecutive reads from the cache

FlexCache for Cloud Caching

  • Cache from cloud to cloud or region to region

In my next blog, I will demonstrate detailed setup and features of FlexCache for both NFS and SMB in my VSIM lab.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s