in implementation guides ~ read.
Salesforce Large Data Volumes Bp

Salesforce Large Data Volumes Bp


First things first !
To download this implementation guide, click the download button below.
If you need more information about the implementation guide, you can read the Table of Contents below.


Best Practices for Deployments
with Large Data Volumes
Salesforce Spring

Last updated February

Copyright salesforcecom inc All rights reserved Salesforce is a registered trademark of salesforcecom inc

as are other names and marks Other marks appearing herein may be trademarks of their respective owners



Underlying Concepts
Infrastructure for Systems with Large Data Volumes
Techniques for Optimizing Performance
Best Practices
Large Data Volumes Case Studies



Who Should Read This
This paper is for experienced application architects who work with Salesforce deployments that contain large data volumes

A large data volume is an imprecise elastic term but if your deployment has tens of thousands of users tens of millions of records or

hundreds of gigabytes of total record storage then you can use the information in this paper A lot of that information also applies to

smaller deployments and if you work with those you might still learn something from this document and its best practices
To understand the parts of this paper that deal with details of Salesforce implementation read

Salesforce enables customers to easily scale their applications up from small to large amounts of data This scaling usually happens
automatically but as data sets get larger the time required for certain operations might grow The ways in which architects design and
configure data structures and operations can increase or decrease those operation times by several orders of magnitude
The main processes affected by differing architectures and configurations are the
Loading or updating of large numbers of records either directly or with integrations
Extraction of data through reports and queries or through views
The strategies for optimizing those main processes are
Following industrystandard practices for accommodating schema changes and operations in databaseenabled applications
Deferring or bypassing business rule and sharing processing
Choosing the most efficient operation for accomplishing a task

Whats in This Paper
Techniques for improving the performance of applications with large data volumes
Salesforce mechanisms and implementations that affect performance in lessthanobvious ways
Salesforce mechanisms designed to support the performance of systems with large data volumes

Underlying Concepts
This section outlines two key concepts multitenancy and search architecture to explain how Salesforce
Provides its application to customers instances and organizations
Keeps supported customizations secure self contained and high performing
Tracks and stores application data
Indexes that data to optimize searching

Best Practices for Deployments with Large Data Volumes

Multitenancy and Metadata Overview


Multitenancy and Metadata Overview
Search Architecture

Multitenancy and Metadata Overview
Multitenancy is a means of providing a single application to multiple organizations such as different companies or departments within
a company from a single hardwaresoftware stack Instead of providing a complete set of hardware and software resources to each
organization Salesforce inserts a layer of software between the single instance and each organizations deployment This layer is invisible
to the organizations which see only their own data and schemas while Salesforce reorganizes the data behind the scenes to perform
efficient operations
Multitenancy requires that applications behave reliably even when architects are making Salesforcesupported customizations which
include creating custom data objects changing the interface and defining business rules To ensure that tenantspecific customizations
do not breach the security of other tenants or affect their performance Salesforce uses a runtime engine that generates application
components from those customizations By maintaining boundaries between the architecture of the underlying application and that of
each tenant Salesforce protects the integrity of each tenants data and operations
When organizations create custom objects the platform tracks metadata about the objects and their fields relationships and other
object definition characteristics Salesforce stores the application data for all virtual tables in a few large database tables which are
partitioned by tenant and serve as heap storage The platforms engine then materializes virtual table data at runtime by considering
the corresponding metadata

Instead of attempting to manage a vast everchanging set of actual database structures for each application and tenant the platform
storage model manages virtual database structures using a set of metadata data and pivot tables Thus if you apply traditional
performancetuning techniques based on the data and schema of your organization you might not see the effect you expect on the
actual underlying data structures

Best Practices for Deployments with Large Data Volumes

Search Architecture

Note As a customer you also cannot optimize the SQL underlying many application operations because it is generated by the

system not written by each tenant

Search Architecture
Search is the capability to query records based on freeform text The Salesforce search architecture is based on its own data store which
is optimized for searching for that text
Salesforce provides search capabilities in many areas of the application including
The sidebar
Advanced and global searches
Find boxes and lookup fields
Suggested Solutions and Knowledge Base
WebtoLead and WebtoCase
Duplicate lead processing

Salesforce Object Search Language SOSL for Apex and the API

For data to be searched it must first be indexed The indexes are created using the search indexing servers which also generate and
asynchronously process queue entries of newly created or modified data After a searchable objects record is created or updated it
could take about minutes or more for the updated text to become searchable
Salesforce performs indexed searches by first searching the indexes for appropriate records then narrowing down the results based on
access permissions search limits and other filters This process creates a result set which typically contains the most relevant results
After the result set reaches a predetermined size the remaining records are discarded The result set is then used to query the records
from the database to retrieve the fields that a user sees

Tip Search can also be accessed with SOSL which in turn can be invoked using the API or Apex

Infrastructure for Systems with Large Data Volumes
This section outlines
Salesforce components and capabilities that directly support the performance of systems with large data volumes
Situations in which Salesforce uses those components and capabilities
Methods of maximizing the benefits you get from the Salesforce infrastructure


Lightning Platform Query Optimizer
Database Statistics
Skinny Tables

Best Practices for Deployments with Large Data Volumes

Lightning Platform Query Optimizer

Lightning Platform Query Optimizer
Because Salesforces multitenant architecture uses the underlying database in unusual ways the database systems optimizer cannot
effectively optimize Salesforce queries unaided The Lightning Platform query optimizer helps the database systems optimizer produce
effective execution plans for Salesforce queries and it is a major factor in providing efficient data access in Salesforce
The Lightning Platform query optimizer works on the queries that are automatically generated to handle reports list views and both

SOQL queries and the other queries that piggyback on them

The Lightning Platform query optimizer
Determines the best index from which to drive the query if possible based on filters in the query
Determines the best table to drive the query from if no good index is available
Determines how to order the remaining tables to minimize cost
Injects custom foreign key value tables as needed to create efficient join paths

Influences the execution plan for the remaining joins including sharing joins to minimize database inputoutput IO

Updates statistics

Database Statistics
Modern databases gather statistics on the amount and types of data stored inside of them and they use this information to execute
queries efficiently Because of Salesforces multitenant approach to software architecture the platform must keep its own set of statistical
information to help the database understand the best way to access the data As a result when large amounts of data are created

updated or deleted using the API the database must gather statistics before the application can efficiently access data Currently this

statisticsgathering process runs on a nightly basis

Skinny Tables
Salesforce can create skinny tables to contain frequently used fields and to avoid joins This can improve the performance of certain
readonly operations Skinny tables are kept in sync with their source tables when the source tables are modified
If you want to use skinny tables contact Salesforce Customer Support When enabled skinny tables are created and used automatically
where appropriate You cant create access or modify skinny tables yourself If the report list view or query youre optimizing changesfor
example to add new fieldsyoull need to contact Salesforce to update your skinny table definition

How Skinny Tables Can Improve Performance
For each object table thats visible to you Salesforce maintains other separate tables at the database level for standard and custom

fields This separation which is invisible to customers ordinarily requires a join when a query contains both kinds of fields A skinny table

contains both kinds of fields and also omits softdeleted records
This table shows an Account view a corresponding database table and a skinny table that can speed up Account queries

Best Practices for Deployments with Large Data Volumes

Skinny Tables

Readonly operations that reference only fields in a skinny table dont require an extra join and can consequently perform better Skinny
tables are most useful with tables containing millions of records to improve the performance of readonly operations such as reports
Important Skinny tables arent a magic wand to wave at performance problems Theres overhead in maintaining separate tables
that hold copies of live data Using them in an inappropriate context can lead to performance degradation instead of improvement
Skinny tables can be created on custom objects and on Account Contact Opportunity Lead and Case objects They can enhance

performance for reports list views and SOQL

Skinny tables can contain the following types of fields
Date and time
Picklist multiselect
Text area
Text area long


Skinny tables and skinny indexes can also contain encrypted data
Here is an example of how a skinny table can speed up queries Instead of using a date range like to which
entails an expensive repeated computation to create an annual or yeartodate reportyou can use a skinny table to include a Year
field and to filter on Year

Best Practices for Deployments with Large Data Volumes


Skinny tables can contain a maximum of columns
Skinny tables cant contain fields from other objects
For Full sandboxes Skinny tables are copied to your Full sandbox orgs
For other types of sandboxes Skinny tables arent copied to your sandbox organizations To have production skinny tables activated
for sandbox types other than Full sandboxes contact Salesforce Customer Support

Salesforce supports custom indexes to speed up queries and you can create custom indexes by contacting Salesforce Customer Support
Note The custom indexes that Salesforce Customer Support creates in your production environment are copied to all sandboxes
that you create from that production environment
The platform maintains indexes on the following fields for most objects
Systemmodstamp LastModifiedDate
Email for contacts and leads
Foreign key relationships lookups and masterdetail

The unique Salesforce record ID which is the primary key for each object

Salesforce also supports custom indexes on custom fields except for multiselect picklists text areas long text areas rich
nondeterministic formula fields and encrypted text fields
External IDs cause an index to be created on that field The query optimizer then considers those fields
You can create External IDs only on the following fields
Auto Number
To create custom indexes for other field types including standard fields contact Salesforce Customer Support

Index Tables
The Salesforce multitenant architecture makes the underlying data table for custom fields unsuitable for indexing To overcome this
limitation the platform creates an index table that contains a copy of the data along with information about the data types

Best Practices for Deployments with Large Data Volumes


The platform builds a standard database index on this index table The index table places upper limits on the number of records that an
indexed search can effectively return
By default the index tables do not include records that are null records with empty values You can work with Salesforce Customer
Support to create custom indexes that include null rows Even if you already have custom indexes on your custom fields you must
explicitly enable and rebuild them to get the emptyvalue rows indexed

Standard and Custom Indexed Fields
The query optimizer maintains a table containing statistics about the distribution of data in each index It uses this table to perform
prequeries to determine whether using the index can speed up the query
For example assume that the Account object has a field called AccountTypewhich can take the value Large Medium or
Smalland that the field has a custom index
For example Salesforce generates a query like


FROM Account

WHERE AccountTypec Large

The query optimizer performs a prequery to its internal statistics table to determine the number of records with Large in the
AccountType field If this number exceeds of the objects total records or records the query does not use the custom
The query optimizer determines what an index is used with
Standard Indexed Fields
Used if the filter matches less than of the first million records and less than of additional records up to a maximum of one
million records
For example a standard index is used if

A query is executed against a table with million records and the filter matches or fewer records

A query is executed against a table with million records and the filter matches or fewer records

Custom Indexed Fields
Used if the filter matches less than of the total records up to a maximum of records
For example a custom index is used if

A query is executed against a table with records and the filter matches or fewer records

A query is executed against a table with million records and the filter matches or fewer records

Best Practices for Deployments with Large Data Volumes


If the criteria for an indexed field are not met only that index is excluded from the query If they are in the WHERE clause and meet the

thresholds for records other indexes are sometimes used

The query optimizer uses similar considerations to determine whether to use indexes when the WHERE clause contains AND OR or


For AND the query optimizer uses the indexes unless one of them returns more than of the objects records or total


For OR the query optimizer uses the indexes unless they all return more than of the objects records or total records

Note All fields in the OR clause must be indexed for any index to be used

For LIKE the query optimizer does not use its internal statistics table Instead it samples up to records of actual data to

decide whether to use the custom index
Custom indexes can be created on deterministic formula fields Because some values vary over time or change when a transaction
updates a related entity the platform cannot index nondeterministic formulas
Here are examples of things that make formula fields nondeterministic
Nondeterministic formula fields can
Reference other entities like fields accessible through lookup fields
Include other formula fields that span over other entities

Use dynamic date and time functions for example TODAY and NOW

These formula fields are also considered nondeterministic

Owner autonumber divisions or audit fields except for CreatedDate and CreatedByID fields

References to fields that Lightning Platform cannot index
Multiselect picklists
Currency fields in a multicurrency organization
Long text area fields
Binary fields blob file or encrypted text
Standard fields with special functionalities
Opportunity Amount TotalOpportunityQuantity ExpectedRevenue IsClosed IsWon
Case ClosedDate IsClosed
Product ProductFamily IsActive IsArchived
Solution Status
Lead Status
Activity Subject TaskStatus TaskPriority
Note If the formula is modified after the index is created the index is disabled To reenable an index contact Salesforce Customer
Crossobject indexes are typically used if specified using the crossobject notation as they are in the following example


FROM Scorec

WHERE CrossObjectrCrossObjectrIndexedFieldc

You can use this approach to replace formula fields that cannot be customindexed because they reference other objects As long as
the referenced field is indexed the crossobject notation can have multiple levels

Best Practices for Deployments with Large Data Volumes


TwoColumn Custom Indexes
Twocolumn custom indexes are a specialized feature of the Salesforce platform They are useful for list views and situations where you
want to use one field to select records to display and another field to sort them For example an Account list view that selects by State
and sorts by City can use a twocolumn index with State in the first column and City in the second
When a combination of two fields is a common filter in the query string twocolumn indexes typically help you sort and display records

For example for the following SOQL which appears in pseudo code a twocolumn index on fcfc is more efficient than

single indexes on fc and fc


FROM Account

WHERE fc foo

AND fc bar

Note Twocolumn indexes are subject to the same restrictions as singlecolumn indexes with one exception Twocolumn
indexes can have nulls in the second column whereas singlecolumn indexes cantunless Salesforce Customer Support explicitly
enabled the option to include nulls

Divisions are a means of partitioning the data of large deployments to reduce the number of records returned by queries and reports

For example a deployment with many customer records might create divisions called US EMEA and APAC to separate the customers

into smaller groups that are likely to have few interrelationships
Salesforce provides special support for partitioning data by divisions which you can enable by contacting Salesforce Customer Support

Techniques for Optimizing Performance
This section outlines
Techniques for optimizing Salesforce performance
The arrangements features mechanisms and options underpinning those techniques
Circumstances in which you should use those techniques and tailor them to your needs


Using Mashups
Defer Sharing Calculation

Using SOQL and SOSL

Deleting Data

Using Mashups
One approach to reducing the amount of data in Salesforce is to maintain large data sets in a different application and then make that
application available to Salesforce as needed Salesforce refers to such an arrangement as a mashup because it provides a quick loosely
coupled integration of the two applications Mashups use Salesforce presentation to display Salesforcehosted data and externally hosted

Best Practices for Deployments with Large Data Volumes

Defer Sharing Calculation

Salesforce supports the following mashup designs
External Website

The Salesforce UI displays an external website and passes information and requests to it With this design you can make the website

look like part of the Salesforce UI

Apex code allows Salesforce to use Web services to exchange information with external systems in real time
Because of their realtime restrictions mashups are limited to short interactions and small amounts of data
See the Apex Code Developers Guide

Advantages of Using Mashups
Data is never stale
No proprietary method needs to be developed to integrate the two systems

Disadvantages of Using Mashups
Accessing data takes more time
Functionality is reduced For example reporting and workflow do not work on the external data

Defer Sharing Calculation
In some circumstances it might be appropriate to use a feature called defer sharing calculation which allows users to defer the processing
of sharing rules until after new users rules and other content have been loaded
An organizations administrator can use a defer sharing calculation permission to suspend and resume sharing calculations and to
manage two processes group membership calculation and sharing rule calculation The administrator can suspend these calculations
when performing a large number of configuration changes which might lead to very long sharing rule evaluations or timeouts and
resume calculations during an organizations maintenance period This deferral can help users process a large number of sharingrelated
configuration changes quickly during working hours and then let the recalculation process run overnight between business days or
over a weekend

Using SOQL and SOSL

A SOQL query is the equivalent of a SELECT SQL statement and a SOSL query is a programmatic way of performing a textbased




Executes with


Search indexes

Uses the

query call

search call

Use SOQL when

You know in which objects or fields the data resides
You want to
Retrieve data from a single object or from multiple objects that are related to one another

Best Practices for Deployments with Large Data Volumes

Using SOQL and SOSL

Count the number of records that meet specified criteria
Sort results as part of the query
Retrieve data from number date or checkbox fields

Use SOSL when

You dont know in which object or field the data resides and you want to find it in the most efficient way possible
You want to
Retrieve multiple objects and fields efficiently and the objects might or might not be related to one another
Retrieve data for a particular division in an organization using the divisions feature and you want to find it in the most efficient
way possible

Consider the following when using SOQL or SOSL

Both SOQL WHERE filters and SOSL search queries can specify text you should look for When a given search can use either language

SOSL is generally faster than SOQL if the search expression uses a CONTAINS term

SOSL can tokenize multiple terms within a field for example multiple words separated by spaces and builds a search index off this

If youre searching for a specific distinct term that you know exists within a field you might find SOSL is faster than SOQL for these

searches For example you might use SOSL if you were searching for John against fields that contained values like Paul and John


In some cases when multiple WHERE filters are being used in SOQL indexes cannot be used even though the fields in the WHERE

clause can be indexed In this situation decompose the single query into multiple queries each of which should have one WHERE

filter and then combine the results

Executing a query with a WHERE filter that has null values for picklists or foreign key fields doesnt use the index and should be

For example the following customer query performs poorly

SELECT Contactc MaxScorec CategoryNamec CategoryTeamNamec

FROM Interestc

WHERE Contactc null

AND Contactc IN contacts

AND overridec

AND overridec null AND overridec

OR scorec null AND scorec

AND Categoryc null

AND CategoryTeamIsActivec true OR CategoryNamec IN selectvalues

AND CategoryTeamNamec null AND CategoryTeamNamec IN


Nulls in the criteria prevented the use of indexes and some of the criteria was redundant and extended execution time Design
the data model so that it does not rely on nulls as valid field values

The query can be rewritten as

SELECT Contactc MaxScorec CategoryNamec CategoryTeamNamec

FROM Interestc

WHERE Contactc IN contacts

AND overridec OR scorec

AND Categoryc Default

AND CategoryTeamNamec IN selectvalues AND CategoryTeamIsActivec true