SQL Server Query Optimizer – when joins are ignored

The query optimizer in SQL Server is very powerful and smart – provided you give it enough information and clues to do its job.  One recent discovery caught me by surprise – the fact that SQL Server can skip performing joins altogether in certain scenarios.

I came upon this while reviewing the definition of views in Dynamics CRM, most of which look like this:

from [AccountBase]
left join [CustomerAddressBase] XXaddress1 on ([AccountBase].[AccountId] = XXaddress1.ParentId and XXaddress1.AddressNumber = 1)
left join [CustomerAddressBase] XXaddress2 on ([AccountBase].[AccountId] = XXaddress2.ParentId and XXaddress2.AddressNumber = 2)
left join [AccountBase] [account_master_account] on ([AccountBase].[MasterId] = [account_master_account].[AccountId])
left join [LeadBase] [account_originating_lead] on ([AccountBase].[OriginatingLeadId] = [account_originating_lead].[LeadId])
left join [AccountBase] [account_parent_account] on ([AccountBase].[ParentAccountId] = [account_parent_account].[AccountId])
left join [ContactBase] [account_primary_contact] on ([AccountBase].[PrimaryContactId] = [account_primary_contact].[ContactId])
left join [EquipmentBase] [equipment_accounts] on ([AccountBase].[PreferredEquipmentId] = [equipment_accounts].[EquipmentId])
left join ...
left join ...

and so forth. Surprisingly, when viewing the execution plan, entire tables were skipped if no columns from those tables were referenced. Now, how is that possible? The result of a join affects the cardinality of the result set; therefore, the criteria for skipping a join cannot be just not selecting columns from a certain table. The key, it turns out, is that the Query Optimizer must be able to guarantee the a one-to-one relationship between the two join tables.

One way SQL Server can guarantee this is if the right hand table has a unique key on the join column(s). Take for example, a table with this unique index:
uniqueIndex

Running this query and specifying a left join on both columns of the unique key, you can see the query optimizer realizes there’s no need to actually perform the left join:
join1

But if the join condition on the 2nd column is removed, the query optimizer realizes this is now a many-to-many relationship, and not performing the join would be semantically incorrect:
join2

Not surprisingly, having a foreign key relationship between the join tables also satisfies SQL Server’s criteria of not performing the join. In fact, further optimizations can be made when there’s a trusted foreign key relationship. Playing with the AdventureworksDW database, inner joins can be skipped:

sql_innerjoin

As can where exists conditions:
sql_whereexists

Of course, this is because SQL Server checks the existence of value against the target table when the record is inserted; effectively, this is improving query performance by transferring the workload to the insert/update/delete phase.

Curiously, if I select the join column from the joined table, SQL still thinks it has to perform the join, even though there is no difference between FIS.OrderDateKey and dd.DateKey in this scenario. I suppose the Query Optimizer isn’t omniscient.

sql_boo

The best use case for this technique is probably improving performance of large views that join to a lot of tables; this is often the case when abstracting complicated schemas from end users. For joins to complicated subqueries or CTEs expressions in the view, I’ve also successfully used this technique by creating a materialized view with a unique index, thereby guaranteeing to the query optimizer that the result set will be a one-to-one relationship.

A better way to write MDX Period over Period calculations

Period over Period calculations, which compare the results of one time period against a previous time period, are extremely useful in the financial world. Everyone’s interested in growth since last month/quarter/year.  Most MDX implementations that I’ve seen use the ParallelPeriod function, which

Returns a member from a prior period in the same relative position as a specified member

The typical approach would be to create separate Year over Year, Quarter over Quarter, and Month over Month calculations. For example, to create a Year over Year calculation for the previous year:

PeriodOverPeriod1

This script first calculates the amount for the current member of the hierarchy:
([Date].[Calendar].CurrentMember, [Measures].[Amount])

Then finds the previous member of this of a specific level in this hierarchy using the ParallelPeriod() function, and subtract that from the current member:
(ParallelPeriod(
[Date].[Calendar].[Calendar Year],
1,
[Date].[Calendar].CurrentMember
),
[Measures].[Amount])

Here’s how that looks in excel, along with a demonstration of what happens if you try to use this calculations with other levels in the date dimension:

POP2

In essence, using this approach means we’d have to create a calculation for every level of the date hierarchy; this is tedious and a ton of maintenance for what amounts the same logic. Besides, it would more user-friendly to have one calculation that works for ALL levels in a natural date hierarchy. But how can this be done if the first parameter for ParallelPeriod() must accept a specific level argument? The trick here is to use .LEVEL property to dynamically return the level of the current member in the hierarchy:

([Date].[Calendar].CurrentMember, [Measures].[Amount])
-
(ParallelPeriod(
[Date].[Calendar].CurrentMember.Level,
1,
[Date].[Calendar].CurrentMember
),
[Measures].[Amount])

Here’s how that looks in excel:
POP3

Lastly, since this calculation exclusively looks at the previous member of the hierarchy (previous year, previous month, etc), we can write the MDX expression even more elegantly using the PrevMember property:

([Date].[Calendar].CURRENTMEMBER,[Measures].[Amount])-
([Date].[Calendar].CURRENTMEMBER.PREVMEMBER,[Measures].[Amount])

SSAS Dynamic Security invalidates Aggregations

Analysis Services role-based security is great –  it gives us the flexibility to define sets of data that users should or shouldn’t see.  And when the number of different possible set combinations get too unwieldy to handle, we can use data driven dynamic security to meet all of our data security requirements in a single role.  To top it all off, in most cases there is very little perceived performance impact: upon connecting to the cube, SSAS creates a subcube of the data that you’re supposed to see.

That is, no performance impact until you check this little box:

visualtotals

Visual Totals means the fact measure aggregates only on members that the user is allowed to see; this is the expected behavior in almost all cases.  However, this also means any aggregations made at a higher granularity level than the attribute used for security is no longer valid.  Here’s a contrived example : we have aggregations designed at the Year level, which pre-calculates the sum for January….December.  Now, if we use the Month attribute to secure our cube, and disallow users from seeing September, that pre-calculated sum is no longer valid; SSAS now has to calculate the Year level from at least the month level.

The moral of the story is, be very careful about defining dynamic security using low granularity attributes.  It WILL invalidate most aggregations.  I don’t see the performance aspect of dynamic security mentioned whatsoever in most how-to articles.  In most cases, the performance hit is an acceptable compromise when it means less maintenance for the developer.  However, if performance considerations are critical, then less elegant solutions (creating separate cubes) could come into play.

As an aside, I would love to see SSAS be smarter about using aggregations.  In the previous example, SASS has to throw away the pre-calculated aggregation at the Year level due to just one member missing at the month level; it now has to calculate the Year total from the 11 Months that the user can see.  It would be a lot more elegant to just subtract September from the pre-calculated Year aggregation.  This becomes especially important in large Parent-Child hierarchies that can only aggregate at the lowest and highest levels.

TL;DR: Dynamic security with visual totals invalidates all aggregations designed at higher granularities.

Ragged Hierarchies and HideMemberIf: an in-depth look

One of the most poorly-documented features I’ve come across in Analysis Services is the handling of ragged/unbalanced hierarchies. I’m not going to write about parent-child hierarchies, which may be more intuitive to model but are terrible for performance. Instead, I’ll provide an analysis of the various options of the HideMemberIf functionality, and for which data modeling scenarios each option will work when clients set MDX COMPATIBILITY=1. The end-goal is not having to drill through repeating/null intermediate levels in client tools which almost always set MDX COMPATIBILITY=1.

Here’s the test query/data we’ll be working with:

SELECT 'Scenario 1' AS Scenario, 'Repeat lowest level' AS [Scenario Description], 'Joe CEO' AS Level1, 'Secretary' AS Level2, 'Secretary' AS Level3,  'Secretary' AS Level4
UNION ALL
SELECT 'Scenario 2' AS Scenario, 'Repeat lowest parent level' AS [Scenario Description], 'Joe CEO' AS Level1, 'Joe CEO' AS Level2, 'Joe CEO' AS Level3, 'Secretary' AS Level4
UNION ALL
SELECT 'Scenario 3' AS Scenario, 'Replace lowest levels with NULL' AS [Scenario Description], 'Joe CEO' AS Level1, 'Secretary' AS Level2, NULL AS Level3, NULL AS Level4
UNION ALL
SELECT 'Scenario 4' AS Scenario, 'Replace intermediate levels NULL' AS [Scenario Description], 'Joe CEO' AS Level1, NULL AS Level2, NULL AS Level3, 'Secretary' AS Level4

raggedhierarchydata

Now, we’ll set up the attribute relationships in our test dimension like this:

RaggedHierarchyAttributeRelationship

and create 4 separate hierarchies, each with a different HideMemberIf option:
RaggedHierarchies

And here are the results in Excel (I’ve highlighted the scenarios that work):
raggedhierarchyExcel

Couple of things stand out here:

  • In this simple dataset, {ParentName & OnlyChildWithParentName} produce the same results, as does {NoName & OnlyChildWithNoName}
  • We can sense a general pattern from the combination of options that work: we need to shift all levels “up”, and then either duplicate the lowest levels  or replace them with NULLs.

One disadvantage of modeling your data this way is that the levels in your hierarchies may no longer make sense, e.g.  a Country -> State -> City hierarchy will have “Washington DC” at the State level.  I’ll argue that in hierarchies where each level has a distinct semantic meaning, we should not hide levels; after all, Washington DC is a city, not a state.  In these cases, users may very well expect a NULL for the state level.  However, in hierarchies that DO represent a generic structure such as organizational report, where level names are more abstract e.g. Level 1 > Level 2 > Level 3, hiding levels make sense.  It alleviates the nuisance of users drilling down 10 levels of “Suzy Secretary”.

Note that using HideMemberIf also incurs a performance hit, though nowhere as much as parent-child hierarchies.  I may explore the query performance implications in a future post.

SSRS Tip: Use variables to store formatting expressions

SSRS provides a rich set of formatting options that allows you to control virtually every aspect of a report. Often times you’ll find that these formatting expressions are repeated throughout the report, and having to manually code this numerous times for each affected column.

As a good practice, if I find that the same formatting expression is used more than once, I’ll define a report variable for the format expression, and use that to drive the formatting of multiple columns. Let’s say we want to round a percentage figure to a whole number. Instead of  hard coding it like this:

ssrsformatbefore

You can create a report variable:

reportproperties

 

reportvariable

 

and then set the format like this:

setformat

 

 

Besides not repeating the same expression in multiple places, we can now also change the formatting for all affected columns by altering one variable.   More innovative uses might entail populating formatting variables with run-time parameters.

Retreiving Sql Server 2012 Product Key from the registry

If you’ve misplaced your Sql Server installation key, you can retrieve it from the Windows Registry using a powershell script posted here.  However, that script works only for Sql 2008; in 2012, the location of the key in the Registry has changed, as well as the format. So I’ve posted my modification of the powershell script below:

function Get-SQLserverKey {
## function to retrieve the license key of a SQL 2012 Server.
## by Jakob Bindslet (jakob@bindslet.dk)
## 2012 Modification by Xian Wang (daanno2@gmail.com)
param ($targets = ".")
$hklm = 2147483650
$regPath = "SOFTWARE\Microsoft\Microsoft SQL Server\110\Tools\Setup"
$regValue1 = "DigitalProductId"
$regValue2 = "PatchLevel"
$regValue3 = "Edition"
Foreach ($target in $targets) {
$productKey = $null
$win32os = $null
$wmi = [WMIClass]"\\$target\root\default:stdRegProv"
$data = $wmi.GetBinaryValue($hklm,$regPath,$regValue1)
[string]$SQLver = $wmi.GetstringValue($hklm,$regPath,$regValue2).svalue
[string]$SQLedition = $wmi.GetstringValue($hklm,$regPath,$regValue3).svalue
$binArray = ($data.uValue)[0..16]
$charsArray = "B","C","D","F","G","H","J","K","M","P","Q","R","T","V","W","X","Y","2","3","4","6","7","8","9"
## decrypt base24 encoded binary data
For ($i = 24; $i -ge 0; $i--) {
$k = 0
For ($j = 14; $j -ge 0; $j--) {
$k = $k * 256 -bxor $binArray[$j]
$binArray[$j] = [math]::truncate($k / 24)
$k = $k % 24
}
$productKey = $charsArray[$k] + $productKey
If (($i % 5 -eq 0) -and ($i -ne 0)) {
$productKey = "-" + $productKey
}
}
$win32os = Get-WmiObject Win32_OperatingSystem -computer $target
$obj = New-Object Object
$obj | Add-Member Noteproperty Computer -value $target
$obj | Add-Member Noteproperty OSCaption -value $win32os.Caption
$obj | Add-Member Noteproperty OSArch -value $win32os.OSArchitecture
$obj | Add-Member Noteproperty SQLver -value $SQLver
$obj | Add-Member Noteproperty SQLedition -value $SQLedition
$obj | Add-Member Noteproperty ProductKey -value $productkey
$obj
}
}

Output:

output

Oracle OLE DB Performance with fetchsize

I’ve been working on some SSIS packages that move data from an Oracle source to a Sql Server destination. Unfortunately, the source and destination servers are in data centers located geographically far apart, so the throughput was never going to be optimal. However, I’m seeing some exceptionally bad throughput – something like 5 minutes to transfer ~500k rows of not very wide data using the OLE DB source.

Doing some googling revealed that using the Microsoft Oracle connectors by Attunity can give a 20 to 30 percent boost in performance when using Oracle as a source. To my great surprise, after installing and rewriting the packages using the Attunity connectors, I was experiencing more along the lines of 300% boost in performance!

However, there are downsides to using the Attunity connectors:

1) They are not included in an out-of-box SSIS installation; they require a seperate download and installation. As such, they may or may not be easy to deploy in your server environment.

2) Attunity connectors use a different data source connection that is incompatible with OLE DB source connectors. Additionally, you can’t Data Sources defined for the entire SSIS project, and must instead create a new connection for each SSIS package. This unnecessarily complicates management and deployment of your SSIS solution.

What’s interesting about the boost in performance with the Attunity connectors was the corresponding increase in network saturation. With the OLE DB source, the network utilization was around ~1%, but the Attunity connectors was getting 3%. How could there be such disparity in network throughput when the same query was issued to Oracle?

One clue I got was that when using the the Attunity Connector source, a new property called the Batch Size is exposed:
batchsize

Batch size controls how many rows SSIS retrieves with each “batch”.  After retrieving all the rows in a batch, SSIS signals to Oracle that it is ready for another batch.   Of course, network latency is introduced during this signaling process.  Thus, decreasing the number of total batches also decreases the number of round-trips SSIS has to make.  The default batch size is only 100;  increasing this to a reasonable number should boost performance.  In our environment, a value of 5000 seemed to be optimal.  But there was no equivalent setting in the OLE DB source.

After more googling and trying out performance ‘tricks’ from bloggers of various repute, I came upon an obscure blog which suggested altering the fetchsize property of the connection string.  The author was originally diagnosing poor  processing performance of an SSAS cube going against an Oracle data source, but I decided it was worth a shot doing the same for SSIS.

The default fetchsize is set at 100 rows, but anything from 100 to 1,000,000 is feasible, depending on your network and server configuration.  Adjusting this setting proved to be the exactly what was needed, boosting the performance by about 270% when I set it to 5000 rows.  Unfortunately, the fetchszie property is not exposed anywhere in the GUI when creating an OLE DB source.  Instead, you’ll have to edit the connection string manually.  In our environment, we use DTEXEC to call packages and point them to a run-time dtsConfig  file containing the connection string (which can be edited manually). YMMV.

I’m frankly surprised more people are not experiencing poor performance while using Oracle OLE DB source in SSIS, and especially how googling yielded very few helpful results.  Hopefully this post can be of some help to anyone who is having this issue.

 

TL;DR: Increase fetchsize property for OLE DB connections to Oracle.