Data Warehouse Surrogate Key Generation

 admin

Apr 20, 2006  For this reason, database designers sometimes introduce a surrogate key, which uniquely identifies every row in the table and is “more minimal” than the inherently unique aspect of the data. The usual choice is a monotonically increasing integer, which is small and easy to use in foreign keys. Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity. Surrogate Key is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. Here in this article we will concentrate on different approaches to generate Surrogate Key for different type ETL process. And depending on what your data source is (such as a clickstream), it can happen a lot. To answer the question, use a dynamic lookup for the dimension table. If you send it a new natural key, it will return the next available surrogate key in sequence. It also stores the natural/surrogate key pair in the cache (it does not update the.

/monster-hunter-generations-ultimate-hr-2-key-quests.html. The Multidimensional Warehouse is the third datastructure in EPM.

Image: Multidimensional Warehouse (MDW)

The followinggraphic illustrates the MDW component of the EPM architecture andthe target tables that are present in the MDW.

The MDW stores dimensionalized data that is groupedinto one or more business processes, better known as a dimensional schema, used for business intelligenceand ad hoc reporting. The data is stored in a star schema (a fact table associated with a series ofdimension tables) and generally contains data loaded from the OWS.

The star schema arrangement depends entirely on primary key and foreign key relationships. A primary key is a column (orcolumns) in a dimension table whose values uniquely identify eachrow in the table. Primary keys enforce entity integrity by uniquelyidentifying entity instances. A foreign key is a column or columnsin a fact table whose values match the primary key values of a givendimension table. This way references can be made between a fact anddimension table. Foreign keys enforce referential integrity by completingan association between two entities.

Note: MDW dimensions use a surrogatekey, a unique key generated from production keys by theETL process. The surrogate key is not derived from any data in theEPM database and acts as the primary key in a MDW dimension. See thenext topic for more information on surrogate keys in the MDW.

Image: Dimensional Model Example

The following graphic provides an example of a starschema and its primary and foreign key relationships:

Although data loaded into the MDW is primarily derivedfrom the OWS, there are exceptions to this rule. Profitability andGlobal Consolidations data for the Financial Management Solutions(FMS) Warehouse is loaded into the MDW from the OWE.

External survey data for the HCM Warehouse is loadedinto the MDW from the OWE.

Online Marketing data is loaded into the MDW directlyfrom the source system, and bypasses the Operational Warehouse entirely.

Surrogate Keys

Surrogate keys provide a means of defining uniquekeys whose values, with the exception of the Time and Calendar dimensions,are anonymous—that is, the value of a surrogate key has no significanceto the application using it and is strictly an artificial value. Thesystem uses surrogate keys specifically as a means of joining structures.To speed up query access, the MDW resolves PeopleSoft-specific programmingconstructs, such as SetIDs and effective dates and replaces them withsurrogate IDs as key columns. Surrogate keys have no relationshipto the business or production key. Surrogate keys are present in dimensiontables as the primary key and in fact tables as foreign keys to dimensions.However, the dimension record retains the business key as an alternate-keyattribute. Surrogate keys are four-byte integers and their size doesnot change even when production key changes in size.

Although surrogate keys usually do not have any'intelligence,' that is, their value has no meaning, in certain situations,such as the Gregorian Calendar and Time dimensions, intelligent surrogatekeys are used. These intelligent keys enable the ETL process to runmore quickly by providing the option of avoiding a lookup on correspondingdimensions.

Surrogate key fields usually have the suffix _SID (Surrogate ID).

Surrogate Keys and the ETL Process

Surrogate keys are generated from production keysusing the DataStage routine KeyMgtNextValueConcurent(), which receives an input parameter and a name identifying the sequence.The surrogate key can be unique per single dimension target (D) orunique across the whole (W) multidimensional warehouse. This processis enabled by the environment parameter named SID_UNIQUENESS. Thevalue for this parameter is provided at run time. If the value is D, then this routine is called with a dimensionjob name for which a surrogate key must be assigned and it returnsthe next available number. If not, the routine is called with EPM as the sequence identifier.

You do not have to take any action to create surrogatekeys; they are generated during the ETL process within the aforementionedDataStage routine. The DataStage routine retrieves the next surrogatekey value and assigns it to the surrogate key that it is currentlycreating. When the ETL process copies a dimension row from the sourcesystem into the MDW, the ETL process performs a lookup on the dimensiontable. If the dimension row (with same business keys) does not existin the dimension table, the process inserts a row with a new surrogatekey value. If the dimension row already exists in the dimension table,the process updates the existing row with the incoming row value.When the ETL process copies a fact row from the source system intothe MDW, for each dimension key in the fact row, the system performsa lookup on the dimension table and retrieves the corresponding surrogatekey value. This surrogate key is the foreign key value in the factrow in the MDW. If the system does not locate a dimension value inthe fact row in the dimension table, that is a data exception andan error results.

Surrogate Key Benefits

Surrogate keys provide benefits such as:

Data Warehouse Surrogate Key Generation Free

  • The ability to easily and structurallyconform a dimension when being sourced from multiple systems.

  • Disassociation from operationalsystem changes.

    Because surrogate key generationis controlled by the warehouse, it is not influenced by operationalsystem changes.

  • The ability to handle unspecifiedor missing key values.

  • A graceful mechanism to handlechanges in history.

    Multiple versions of a dimensioncan be maintained with different surrogate (primary) keys, yet withthe same business (identifying) key.

  • Performance enhancement of queries,because a surrogate key is a single column numeric key, thus the joinsusing surrogate keys are faster than ones using multi-column businesskeys.

Audit Fields

Audit fields track extract, transform, and load(ETL) loading information, such as when the row was loaded or lastmodified or the batch in which the row was loaded. This informationis included in a subrecord. The subrecord added to MDW tables is calledLOAD_MDW_SBR. Subrecords are always added at the end of a record;no fields exist after this subrecord in any table.

Image: LOAD_MDW_SBR record example

The followingexample shows a typical LOAD_MDW_SBR subrecord.

Data Aggregation

Tables in the MDW contain source data at the samegranularity as the source system. Required data aggregation is carriedout at run time by the business intelligence tool. This allows forbetter control of aggregation strategies by the business intelligencetool, because aggregation requirements vary from customer to customer.

MDW Dimension Tables

Dimensions are sets of related attributes that youuse to group or constrain detailed information that you measure inyour data mart. Dimensions are usually text (in character data type),relatively static, and often hierarchical.

Dimension tables contain surrogate keys as the primarykey and are a single column key containing only the surrogate keycolumn. Surrogate keys usually have _SID (surrogate ID) appended to the field name. Dimension tables retainsource system business key fields as non-key attribute columns inthe dimension table. However, these are not used for joins with facttables. For example, in the Customer dimension, the original businesskey field CUST_ID is retained, if it exists in the source table, butis no longer included in the key. The SetID is also retained, if itexists in the source table, as a nonkey attribute; the value containedin the SetID is the same as in the source system.

If a dimension is SetID-based, the MDW table containsthe source SetID and the performance (PF) SetID, which is named SETID.

If a dimension contains a description text, a relatedlanguage table is often defined for this dimension. The ETL processpopulates this table if a customer requires multilanguage processing.The key for this table is the surrogate key ID, plus the languagecode field, LANGUAGE_CD, whichcontains the code for the additional language.

Note: You can find more information about multilanguageprocessing for the multidimensional warehouse in your EPM Warehousespecific documentation (for example, the PeopleSoft EPM: Campus Solutions Warehouse).

Shared Dimensions

Dimensionssuch as Account, Customer, Department, or Person are examples of shareddimensions. Shared dimensions are either exactly the same—includingkey structure—or an exact subset of another dimension; that is, shareddimensions are structurally identical every place in which they areused. Shared dimensions are used across all EPM warehouse products,such as the Campus Solutions Warehouse and the Financial ManagementSolutions Warehouse.

When using a shared dimension, the system consistentlyinterprets attributes; hence rollups across data marts are possibleand consistent. When a warehouse is provided data from multiple sources,a shared dimension is typically (but not always) built from multiplesource structures.

Image: EPM conformed dimension

The followingis a sample MDW shared dimension shown in Application Designer.

MDW Dimension Table Naming Convention

MDW dimension tables use the following naming convention:D_[table name].

MDW Fact Tables

MDW fact tables (F_*) contain numeric performancemeasurement data—such as quantity, sales, and revenue—that is usedto build a data warehouse and its related reports. Facts help to quantifya company's activities. A fact is a typically an additive businessperformance measurement. That is, you can usually perform arithmeticfunctions on facts.

In a star schema, a fact table is the central table,each element of which is a foreign key derived from a dimension table.Dimension tables have a surrogate ID column that is the primary keyof that dimension. A fact table may use these dimension surrogateIDs as foreign keys to the dimension table. In the dimensional modelexample graphic presented previously, the Sales fact table containssix foreign keys, each one matching a dimension surrounding the facttable.

Periodic Snapshot Fact Tables

Data Warehouse Surrogate Key Generation 2017

Periodic Snapshots provide a view of the cumulativeperformance of the business at regular, predictable time intervals.Unlike a transaction fact table that loads a row of data for eachevent occurrence, the periodic snapshot fact table captures the eventat the interval of a day, week, or month, and another capture at theinterval of the next period, and so on. These periodic snapshots arestacked consecutively into the fact table. The periodic snapshot facttable often is the only place to easily retrieve a regular, predictable,trend view of the key business performance metrics.

Accumulating Fact Tables

Accumulating snapshots represent an indeterminatetime span, covering the complete life of a transaction or discreteproduct. Accumulating snapshots almost always have multiple date stamps,representing the predictable major events or phases that take placeduring the course of a lifetime. Since many of these dates are notknown when the fact row is first loaded, we must use surrogate datekeys to handle undefined dates.

MDW Fact Table Naming Convention

MDW fact tables use the following naming convention:F_[table name].

As those of you who watched my recent webinar Data Modeling Fundamentals With Sisense ElastiCube might recall, a primary key is a unique identifier given to a record in our database, which we can use when querying the database or in order to join multiple sources. This article will discuss the concept of surrogate keys and show some examples of when and how to apply them using simple SQL.

General Guidelines for Selecting Primary Keys

Before we dive into natural vs. surrogate keys, let’s recall four important rules to follow when selecting a primary key for your data model:

  1. The primary key must be unique for each record. A primary key with duplicates will lead to inaccurate queries with duplicated counts and totals. If two customers are assigned the same primary key, their sales activity will be unintentionally blended together. If the customer is accidentally duplicated, their sales activity will also be duplicated. Database architects refer to this as a loss of referential integrity.
  2. The primary key must apply uniform rules for all records. Whether your key is strictly numeric, alphanumeric, or a random system-generated value, each record must be programmed in a consistent format. This format must exist despite whatever complexities there are in the business requirements. An inconsistent format can lead to difficult data analysis, especially in parent/child data relationships.
  3. The primary key must stand the test of time. A key based off of contextual data at the present time, may not have the same contextual meaning later. For example, if a customer ID key is based on customer name, what happens when a customer is acquired or reorganized? Changing key formats should be avoided at all costs. Changing keys will require changing all stored procedures referencing the new key in any JOINs or WHERE clauses, as well as UPDATEs to all existing references to the old key in all of your database tables.
  4. The primary key must be read-only. In order to stand the test of time, primary keys should never be edited. Edited primary keys can have typos (123123 vs 132123), varying formats based on the user’s preference (1 vs 000001), and allow for overwriting a previously deleted record. Never allow anyone to edit the value of primary keys.

Selecting a Primary Key: Surrogate vs. Natural Keys

First, let’s go over the difference between these two forms of primary keys:

A natural key is a key that has contextual or business meaning (for example, in a table containing STORE, SALES, and DATE, we might use the DATE field as a natural key when joining with another table detailing inventory).

A natural key can be system-generated, but natural keys are at least partially determined by a manual process. Some natural keys are totally manually generated. One of the most widely recognized uses of a natural key is a stock ticker symbol – i.e. MSFT, APPL, and GOOGL. Natural keys serve as a great primary key when contextual meaning is important.

A surrogate key is a key which does not have any contextual or business meaning. It is manufactured “artificially” and only for the purposes of data analysis. The most frequently used version of a surrogate key is an increasing sequential integer or “counter” value (i.e. 1, 2, 3). Surrogate keys can also include the current system date/time stamp, or a random alphanumeric string.

See Sisense in action:

When should you stick to natural keys in your data model?

The main advantage of natural keys is in their simplicity and in the fact that the data maintains its original context. They will often be (relatively) easy to recognize to people viewing the data, and relying on natural keys reduces the need to enrich the data using custom SQL. Additionally:

  • Natural keys are great for multiple data types in the database. Natural keys allow the user to easily identify the data type from the key, even when multiple data types use similar key formats. Financial databases frequently format their keys using a natural and sequential key together.

    Even though all three records contain a sequential ID of 123, the natural key prefix allows the user to immediately identify different data types.

  • Natural keys work well when connecting two systems with two different primary key formats. Thus for example, we can use

    To create

  • Natural keys make for a more easy-to-understand GUI. A customer ID such as GOOGL is easy for a user to recognize (for instance, you likely knew this stock ticker symbol is for Google). Easier recognition also allows for easier search.

Data Warehouse Surrogate Key Generation Model

Drawbacks of using natural keys

While it might be tempting and initially easier to rely on existing natural keys, this could prove problematic when scaling the data model, or in a more complex environment, which we will demonstrate using an example of stock tickers:

  • Natural keys do not apply uniform rules for each record. Designators or variables in the natural key make the key difficult to query and understand after the fact. For example, stock ticker symbols of preferred shares have a multitude of designators, including P, PR, and /PR. Trying to query for the designator P (SELECT * FROM stock_quotes WHERE stock_ticker_symbol like %P) would return all results where the stock ticker symbol ends in P, regardless if the symbol is actually preferred stock or not.
  • Natural keys do not stand the test of time. Symbols which might have been business meaning could become meaningless, or bear a different meaning in the future. Thus, for example, the symbols GOOG and GOOGL do not accurately represent the reorganization of the company from Google to Alphabet.
  • Natural keys can be easily confused with each other. Sticking with the previous example – when Twitter was ready to launch their IPO under the ticker TWTR, many investors bought from a defunct electronics company named Tweeter, trading under the ticker TWTRQ. Because TWTR and TWTRQ contain the same first four letters, many investors unintentionally invested in the wrong stock. Tweeter later changed their ticker symbol to THEGQ, which could also be misconstrued with GQ Magazine (a privately held company under Conde Nast).

Advantages of using surrogate keys

As mentioned, a surrogate key sacrifices some of the original context of the data. However, it can be extremely useful for analytical purposes for the following reasons:

  • Surrogate keys are unique. Because surrogate keys are system-generated, it is impossible for the system to create and store a duplicate value.
  • Surrogate keys apply uniform rules to all records. The surrogate key value is the result of a program, which creates the system-generated value. Any key created as a result of a program will apply uniform rules for each record.
  • Surrogate keys stand the test of time. Because surrogate keys lack any context or business meaning, there will be no need to change the key in the future.
  • Surrogate keys allow for unlimited values. Sequential, timestamp, and random keys have no practical limits to unique combinations.

Combining Natural and Surrogate Keys

Certain business scenarios might require keeping the natural key intact as a means for users to interact with the database. In these cases …

  • If a natural key is recommended, use a surrogate key field as the primary key, and a natural key as a foreign key. While users may interact with the natural key, the database can still have surrogate keys outside of the users’ view, with no interruption to user experience.
  • If a natural key must be used without an additional surrogate key, be sure to combine it with a surrogate key element. In our financial database example, Expense Reports (ER-123) have a natural key is used in conjunction with a surrogate sequential key. This format prevents many of the natural key side effects listed above.

An Example of Adding a Surrogate Key Using Custom SQL

In the following example, we will look at a table containing historical data about product prices. By using a custom SQL expression in the Sisense Elasticube Manager, we create the surrogate key ProdDate_Key, which in this case is created by combining the other fields into a single, unique identifier that can easily be queried later.

Original:

Surrogate Key In Data Warehouse

SQL used to add surrogate key:

What Is Surrogate Key

SSELECT DISTINCT
tostring(ProductID)+'_'+tostring(getyear(Date))+'-'+tostring(getmonth(Date))+'-'+tostring(Getday(Date)) AS Prod_Date_Key,
Date,
PH.ProductID,
PH.ListPrice
FROM [ProductListPriceHistory] PH JOIN [AllDates] ON Date between PH.StartDate AND PH.EndDate

Result:

Want to master data modeling? Watch our on demand webinar and learn the fundamental skills every analyst should have.