in implementation guides ~ read.
Salesforce Large Data Volumes Bp

Salesforce Large Data Volumes Bp


First things first !
To download this implementation guide, click the download button below.
If you need more information about the implementation guide, you can read the Table of Contents below.


Best Practices for Deployments
with Large Data Volumes
Salesforce Spring

Last updated February

Copyright salesforcecom inc All rights reserved Salesforce is a registered trademark of salesforcecom inc

as are other names and marks Other marks appearing herein may be trademarks of their respective owners



Underlying Concepts
Infrastructure for Systems with Large Data Volumes
Techniques for Optimizing Performance
Best Practices
Large Data Volumes Case Studies



Who Should Read This
This paper is for experienced application architects who work with Salesforce deployments that contain large data volumes

A large data volume is an imprecise elastic term If your deployment has tens of thousands of users tens of millions of records or

hundreds of gigabytes of total record storage you have a large data volume Even if you work with smaller deployments you can still
learn something from these best practices
To understand the parts of this paper that deal with details of Salesforce implementation read

Salesforce enables customers to easily scale up their applications from small to large amounts of data This scaling usually happens
automatically but as data sets get larger the time required for certain operations grows The ways in which architects design and
configure data structures and operations can increase or decrease those operation times by several orders of magnitude
The main processes affected by differing architectures and configurations are
Loading or updating of large numbers of records either directly or with integrations
Extraction of data through reports and queries or through views
The strategies for optimizing those main processes are
Following industrystandard practices for accommodating schema changes and operations in databaseenabled applications
Deferring or bypassing business rule and sharing processing
Choosing the most efficient operation for accomplishing a task

Whats in This Paper
Techniques for improving the performance of applications with large data volumes
Salesforce mechanisms and implementations that affect performance in lessthanobvious ways
Salesforce mechanisms designed to support the performance of systems with large data volumes

Salesforce Big Objects

Salesforce provides big data technology called Big Objects A big object stores and manages massive amounts of data on the Salesforce

platform You can archive data from other objects or bring massive datasets from outside systems into a big object to get a full view of

your customers A big object provides consistent performance whether you have million records million or even billion This

scale gives a big object its power and defines its features

Best Practices for Deployments with Large Data Volumes

Underlying Concepts

This paper focuses on optimizing large data volumes stored in standard and custom objects not big objects For optimal performance

and a sustainable longterm storage solution for even larger data sets use Bulk API or Batch Apex to move your data into big objects


Salesforce Developers Big Objects Implementation Guide

Underlying Concepts
This section outlines two key concepts multitenancy and search architecture to explain how Salesforce
Provides its application to customers instances and organizations
Keeps supported customizations secure self contained and high performing
Tracks and stores application data
Indexes that data to optimize searching


Multitenancy and Metadata Overview
Search Architecture

Multitenancy and Metadata Overview
Multitenancy is a means of providing a single application to multiple organizations such as different companies or departments within
a company from a single hardwaresoftware stack Instead of providing a complete set of hardware and software resources to each
organization Salesforce inserts a layer of software between the single instance and each organizations deployment This layer is invisible
to the organizations which see only their own data and schemas while Salesforce reorganizes the data behind the scenes to perform
efficient operations
Multitenancy requires that applications behave reliably even when architects are making Salesforcesupported customizations which
include creating custom data objects changing the interface and defining business rules To ensure that tenantspecific customizations
do not breach the security of other tenants or affect their performance Salesforce uses a runtime engine that generates application
components from those customizations By maintaining boundaries between the architecture of the underlying application and that of
each tenant Salesforce protects the integrity of each tenants data and operations
When organizations create custom objects the platform tracks metadata about the objects and their fields relationships and other
object definition characteristics Salesforce stores the application data for all virtual tables in a few large database tables which are
partitioned by tenant and serve as heap storage The platforms engine then materializes virtual table data at runtime by considering
the corresponding metadata

Best Practices for Deployments with Large Data Volumes

Search Architecture

Instead of attempting to manage a vast everchanging set of actual database structures for each application and tenant the platform
storage model manages virtual database structures using a set of metadata data and pivot tables Thus if you apply traditional
performancetuning techniques based on the data and schema of your organization you might not see the effect you expect on the
actual underlying data structures

Note As a customer you also cannot optimize the SQL underlying many application operations because it is generated by the

system not written by each tenant

Search Architecture
Search is the capability to query records based on freeform text The Salesforce search architecture is based on its own data store which
is optimized for searching for that text
Salesforce provides search capabilities in many areas of the application including
The sidebar
Advanced and global searches
Find boxes and lookup fields
Suggested Solutions and Knowledge Base
WebtoLead and WebtoCase
Duplicate lead processing

Salesforce Object Search Language SOSL for Apex and the API

For data to be searched it must first be indexed The indexes are created using the search indexing servers which also generate and
asynchronously process queue entries of newly created or modified data After a searchable objects record is created or updated it
could take about minutes or more for the updated text to become searchable

Best Practices for Deployments with Large Data Volumes

Infrastructure for Systems with Large Data Volumes

Salesforce performs indexed searches by first searching the indexes for appropriate records then narrowing down the results based on
access permissions search limits and other filters This process creates a result set which typically contains the most relevant results
After the result set reaches a predetermined size the remaining records are discarded The result set is then used to query the records
from the database to retrieve the fields that a user sees

Tip Search can also be accessed with SOSL which in turn can be invoked using the API or Apex

Infrastructure for Systems with Large Data Volumes
This section outlines
Salesforce components and capabilities that directly support the performance of systems with large data volumes
Situations in which Salesforce uses those components and capabilities
Methods of maximizing the benefits you get from the Salesforce infrastructure


Lightning Platform Query Optimizer
The Salesforce multitenant architecture uses the underlying database in such a way that the database systems optimizer cant
effectively optimize search queries The Lightning Platform query optimizer helps the databases optimizer produce effective queries
by providing efficient data access in Salesforce
Database Statistics
Skinny Tables

Lightning Platform Query Optimizer
The Salesforce multitenant architecture uses the underlying database in such a way that the database systems optimizer cant effectively
optimize search queries The Lightning Platform query optimizer helps the databases optimizer produce effective queries by providing
efficient data access in Salesforce

The Lightning Platform query optimizer works on automatically generated queries that handle reports list views and SOQL queries The

optimizer also handles other queries that rely on these generated queries
Specifically the optimizer
Determines the best index from which to drive the query if possible based on filters in the query
Determines the best table from which to drive the query if no good index is available
Determines how to order the remaining tables to minimize cost
Injects custom foreign key value tables that are required to create efficient join paths

Influences the execution plan for the remaining joins including sharing joins to minimize database input and output IO

Updates statistics

Creating Efficient Queries

When youre working with large volumes of data its important to build efficient SOQL queries reports and list views which all depend

on selectivity and indexes The Lightning Platform query optimizer determines filter condition selectivity for a SOQL query report or list

Best Practices for Deployments with Large Data Volumes

Lightning Platform Query Optimizer

view With some simple SOQL queries you can easily get the necessary statistics to determine whether a specific filter condition is

selective After you index a selective field your queries with related filter conditions can execute more efficiently and your users can be
more productive
Review the following practical considerations for gauging the selectivity of your filter conditions

Determine the Selectivity of Your Filter Condition

To better understand selectivity lets use an example You build a SOQL query report or list view for the opportunity object one of the

largest objects in your org You have a filter condition for example a WHERE clause that fetches only the rows you want from the object

Is you filter condition selective enough for the optimizer to use an available index

With a simple SOQL query you can quickly get the statistics that help you figure out whether a value in a given field is selective

Use SOQL to Determine the Selectivity of a Filter Condition

Consider a query that uses a basic unary WHERE clause condition


WHERE Stagename Closed Won

Using the query tool of your choice for example the Developer Console Query Editor Workbench execute the following query to get
some statistics related to your filter conditions selectivity The following is a sample query for the Stagename field

SELECT Stagename COUNTid FROM Opportunity


Note If the query times out because your object has many records work with salesforcecom Customer Support to carry out this
The result set shows you the distribution of records for each value of the Stagename picklist field including the total number of records
in the object You now have the statistics necessary to determine the selectivity for a filter condition involving the Stagename field

Determine the Selectivity of More Complex Filter Conditions

Using GROUP BY ROLLUP in a query similar to the previous example makes it easy to get the necessary statistics to evaluate the selectivity

of various conditions

Heres an example of a query with a more complex filter condition This query uses a date field along with the AND operator

SELECT Id Name FROM Opportunity

WHERE Stagename Closed Won


You now know the statistics for the Stagename field To get the same statistics for the CloseDate field grouped by week for each year

use the following query SOQL has some date functions that make getting this data easy


FROM opportunity



The output of the query returns statistics about the distribution of opportunity records across every week of every year by Close Date

For filter conditions that combine two or more conditions using AND when the filter targets less than the following thresholds

The query optimizer considers the overall filter condition selective

Best Practices for Deployments with Large Data Volumes

Database Statistics

Twice the selectivity thresholds for each filter
The selectivity thresholds for the intersection of those fields
For the CloseDate example this means
Status Closed Won is selective

CloseDate THISWEEK is selective

Overall the filter condition is selective for both these reasons
If one of the filter conditions is nonselective for example StatusClosed Won corresponds to records two possibilities can
make the overall filter condition selective
Each filter condition corresponds to less than records twice the selectivity thresholds for each filter

The intersection of StatusClosed Won AND CloseDate THISWEEK is less than records

The filter condition in the example is less than records so the overall condition is selective

Note With the OR operator each filter must meet the threshold individually

Understand the Impact of Deleted Records on Selectivity
When gathering selectivity statistics you can incorporate or exclude deleted records using the boolean field IsDeleted This field is
available in every standard and custom object

The previous query that uses the ROLLUP function collects data for all opportunity records It doesnt matter whether IsDeleted is true

or false To collect the selectivity statistics for OpportunityStageName and explicitly exclude deleted records try the following query

SELECT Stagename COUNTid FROM opportunity strong WHERE IsDeletedfalsestrong GROUP

BY Stagename

Note If youre using Workbench select Exclude for the Deleted and archived records option list
Make Sure Fields Have Indexes
Before putting a query report or list view into production you want to confirm that a given filter condition is selective However the
fields that make the condition selective must have indexes to be effective Without the necessary indexes the query optimizer must
perform a full scan to fetch the target rows The index can help make your query execute faster and improve the productivity of your
orgs users
The Database Query Search Optimization Cheat Sheet lists the standard fields that have an index by default including Id Name
OwnerId CreatedDate SystemModstamp and RecordType as well as all masterdetail and lookup fields
If your filter condition involves a custom field work with Customer Support to create a custom index on the field your filter uses Not all
fields can have an index such as nondeterministic formula fields

Database Statistics
Modern databases gather statistics on the amount and types of data stored inside of them and they use this information to execute
queries efficiently Because of Salesforces multitenant approach to software architecture the platform must keep its own set of statistical
information to help the database understand the best way to access the data As a result when large amounts of data are created

updated or deleted using the API the database must gather statistics before the application can efficiently access data Currently this

statisticsgathering process runs on a nightly basis

Best Practices for Deployments with Large Data Volumes

Skinny Tables

Skinny Tables
Salesforce can create skinny tables to contain frequently used fields and to avoid joins This can improve the performance of certain
readonly operations Skinny tables are kept in sync with their source tables when the source tables are modified
If you want to use skinny tables contact Salesforce Customer Support When enabled skinny tables are created and used automatically
where appropriate You cant create access or modify skinny tables yourself If the report list view or query youre optimizing changesfor
example to add new fieldsyoull need to contact Salesforce to update your skinny table definition

How Skinny Tables Can Improve Performance
For each object table thats visible to you Salesforce maintains other separate tables at the database level for standard and custom

fields This separation which is invisible to customers ordinarily requires a join when a query contains both kinds of fields A skinny table

contains both kinds of fields and also omits softdeleted records
This table shows an Account view a corresponding database table and a skinny table that can speed up Account queries

Readonly operations that reference only fields in a skinny table dont require an extra join and can consequently perform better Skinny
tables are most useful with tables containing millions of records to improve the performance of readonly operations such as reports
Important Skinny tables arent a magic wand to wave at performance problems Theres overhead in maintaining separate tables
that hold copies of live data Using them in an inappropriate context can lead to performance degradation instead of improvement
Skinny tables can be created on custom objects and on Account Contact Opportunity Lead and Case objects They can enhance

performance for reports list views and SOQL

Skinny tables can contain the following types of fields
Date and time

Best Practices for Deployments with Large Data Volumes


Picklist multiselect
Text area
Text area long


Skinny tables and skinny indexes can also contain encrypted data
Here is an example of how a skinny table can speed up queries Instead of using a date range like to which
entails an expensive repeated computation to create an annual or yeartodate reportyou can use a skinny table to include a Year
field and to filter on Year

Skinny tables can contain a maximum of columns
Skinny tables cant contain fields from other objects
For Full sandboxes Skinny tables are copied to your Full sandbox orgs
For other types of sandboxes Skinny tables arent copied to your sandbox organizations To have production skinny tables activated
for sandbox types other than Full sandboxes contact Salesforce Customer Support

Salesforce supports custom indexes to speed up queries and you can create custom indexes by contacting Salesforce Customer Support
Note The custom indexes that Salesforce Customer Support creates in your production environment are copied to all sandboxes
that you create from that production environment
The platform maintains indexes on the following fields for most objects
Systemmodstamp LastModifiedDate
Email for contacts and leads
Foreign key relationships lookups and masterdetail

The unique Salesforce record ID which is the primary key for each object

Salesforce also supports custom indexes on custom fields except for multiselect picklists text areas long text areas rich
nondeterministic formula fields and encrypted text fields
External IDs cause an index to be created on that field The query optimizer then considers those fields
You can create External IDs only on the following fields
Auto Number

Best Practices for Deployments with Large Data Volumes


To create custom indexes for other field types including standard fields contact Salesforce Customer Support

Index Tables
The Salesforce multitenant architecture makes the underlying data table for custom fields unsuitable for indexing To overcome this
limitation the platform creates an index table that contains a copy of the data along with information about the data types

The platform builds a standard database index on this index table The index table places upper limits on the number of records that an
indexed search can effectively return
By default the index tables do not include records that are null records with empty values You can work with Salesforce Customer
Support to create custom indexes that include null rows Even if you already have custom indexes on your custom fields you must
explicitly enable and rebuild them to get the emptyvalue rows indexed

Standard and Custom Indexed Fields
The query optimizer maintains a table containing statistics about the distribution of data in each index It uses this table to perform
prequeries to determine whether using the index can speed up the query
For example assume that the Account object has a field called AccountTypewhich can take the value Large Medium or
Smalland that the field has a custom index
For example Salesforce generates a query like


FROM Account

WHERE AccountTypec Large

The query optimizer performs a prequery to its internal statistics table to determine the number of records with Large in the
AccountType field If this number exceeds of the objects total records or records the query does not use the custom
The query optimizer determines what an index is used with
Standard Indexed Fields
Used if the filter matches less than of the first million records and less than of additional records up to a maximum of one
million records
For example a standard index is used if

A query is executed against a table with million records and the filter matches or fewer records

A query is executed against a table with million records and the filter matches or fewer records

Best Practices for Deployments with Large Data Volumes


Custom Indexed Fields
Used if the filter matches less than of the total records up to a maximum of records
For example a custom index is used if

A query is executed against a table with records and the filter matches or fewer records

A query is executed against a table with million records and the filter matches or fewer records

If the criteria for an indexed field are not met only that index is excluded from the query If they are in the WHERE clause and meet the

thresholds for records other indexes are sometimes used

The query optimizer uses similar considerations to determine whether to use indexes when the WHERE clause contains AND OR or


For AND the query optimizer uses the indexes unless one of them returns more than of the objects records or total


For OR the query optimizer uses the indexes unless they all return more than of the objects records or total records

Note All fields in the OR clause must be indexed for any index to be used

For LIKE the query optimizer does not use its internal statistics table Instead it samples up to records of actual data to

decide whether to use the custom index
Custom indexes can be created on deterministic formula fields Because some values vary over time or change when a transaction
updates a related entity the platform cannot index nondeterministic formulas
Here are examples of things that make formula fields nondeterministic
Nondeterministic formula fields can
Reference other entities like fields accessible through lookup fields
Include other formula fields that span over other entities

Use dynamic date and time functions for example TODAY and NOW

These formula fields are also considered nondeterministic

Owner autonumber divisions or audit fields except for CreatedDate and CreatedByID fields

References to fields that Lightning Platform cannot index
Multiselect picklists
Currency fields in a multicurrency organization
Long text area fields
Binary fields blob file or encrypted text
Standard fields with special functionalities
Opportunity Amount TotalOpportunityQuantity ExpectedRevenue IsClosed IsWon
Case ClosedDate IsClosed
Product ProductFamily IsActive IsArchived
Solution Status
Lead Status
Activity Subject TaskStatus TaskPriority
Note If the formula is modified after the index is created the index is disabled To reenable an index contact Salesforce Customer

Best Practices for Deployments with Large Data Volumes


Crossobject indexes are typically used if specified using the crossobject notation as they are in the following example


FROM Scorec

WHERE CrossObjectrCrossObjectrIndexedFieldc

You can use this approach to replace formula fields that cannot be customindexed because they reference other objects As long as
the referenced field is indexed the crossobject notation can have multiple levels

TwoColumn Custom Indexes
Twocolumn custom indexes are a specialized feature of the Salesforce platform They are useful for list views and situations where you
want to use one field to select records to display and another field to sort them For example an Account list view that selects by State
and sorts by City can use a twocolumn index with State in the first column and City in the second
When a combination of two fields is a common filter in the query string twocolumn indexes typically help you sort and display records

For example for the following SOQL which appears in pseudo code a twocolumn index on fcfc is more efficient than

single indexes on fc and fc


FROM Account

WHERE fc foo

AND fc bar

Note Twocolumn indexes are subject to the same restrictions as singlecolumn indexes with one exception Twocolumn
indexes can have nulls in the second column whereas singlecolumn indexes cantunless Salesforce Customer Support explicitly
enabled the option to include nulls

Divisions are a means of partitioning the data of large deployments to reduce the number of records returned by queries and reports

For example a deployment with many customer records might create divisions called US EMEA and APAC to separate the customers

into smaller groups that are likely to have few interrelationships
Salesforce provides special support for partitioning data by divisions which you can enable by contacting Salesforce Customer Support

Techniques for Optimizing Performance
This section outlines
Techniques for optimizing Salesforce performance
The arrangements features mechanisms and options underpinning those techniques
Circumstances in which you should use those techniques and tailor them to your needs


Using Mashups
Defer Sharing Calculation

Using SOQL and SOSL

Deleting Data