Thursday, 20 October 2011

Explain about Skew Factor.

Skew Factor refers to the row distribution of table among the AMPs. If the data is highly skewed, it means some AMPs are having more rows and some very less. Means data is not evenly distributed. It affects the Teradata's performance. The data distribution or skewness can be controlled by choosing indexes.

There is no particular range for skew factor. It is suggested to keep skew factor between 5-10.

The factors which impact the skew factor.

- Number of AMPS

- PI of a table

- Number of records in a table

- Size of row in a table

Query to calculate skew factor =>

SELECT databasename,

tablename,

100 * (1 - (AVG(currentperm) / MAX(currentperm))) skew_factor

FROM dbc.tablesize

GROUP BY 1,2

ORDER BY 1,2;

What is explain ?

Explain Command in Teradata provides an "ENGLISH" translation of the steps choosen by the optimizer to execute an SQL statement.This is used to analyze all joins and complex queries.

The following is an example:-

EXPLAIN

select last_name,first_name FROM employees;

The EXPLAIN parses the SQL statement but does not execute it.

This provides the designer with an "execution stratergy".

The execution stratergy provides what an optimizer does but not why it choses them.

when you run a query with the EXPLAIN command, it would NOT return any result data set, as mentioned, it just returns the explanation on how the results

would be returned, in case the query runs.

Explain Architecture of Teradata

The biggest strength of the Teradata is the parallelism. So the architecture of the Teradata is designed in such way to keep this strength in mind. Teradata is unique from any other database because of its unique architecture only.

The main components of Teradata architecture is PE (Parsing Engine), AMP (Access Module Processor) and BYNET . We look into these components in details after looking the logical view of the architecture.

The logical view of Teradata architecture is given below -

Parsing Engine (PE) – Whenever a user login to Teradata it actually connect to Parsing Engine (PE). When a user submits a query, then the PE takes action, it creates a plan and instruct AMPs what to do in order to get the result from the query. The PE knows all, it knows how many AMPs are connected to Teradata system, how many rows are in the table and what is the best possible plan to execute the query.

This is why the PE is also called as the ‘OPTIMIZER’.

Beside making a perfect plan for query execution PE also make a check on the access right of the user that weather the user has the privilege to execute the query or not.
In this way PE also perform security feature on the users.

Access Module Processor (AMP) – Each AMP attached to the Teradata system listens to the PE via the BYNET for instructions. Each AMP is connected to its own disk and has the privilege to read or write the data to its disk. The AMP can be best considered as the computer processor with its own disk attached to it. Whenever it receives the instructions from the PE it fetches the data from its disk and sends it to back to PE through BYNET. Each AMP is allowed to read and write in its own disk ONLY. This is known as the ‘SHARED NOTHING ARCHITECTURE’. Teradata spreads the rows of the table evenly across all the AMPs, when PE asks for data all AMPs work simultaneously and read the records from its own DISK. Hence a query will be as slow as the slowest AMP in the system. This is known as parallelism.

BYNET – The BYNET is the communication channel between PE and AMP. It ensures that the communication between PE and AMP is correct and on right track.
In Teradata system there are always two BYNET system.
They are called as ‘BYNET 0’ and ‘BYNET 1’. But we refer them as a single BYNET system. The reason two BYNET exist on a Teradata system is that –

1)      If one BYNET fails, the second BYNET takes over it place.
2)      Two BYNET improve the performance of the system, the PE and AMP can talk to each other over both BYNET which fasten the communication.

Short summary –
    * The PE checks the syntax of the query, check the user security rights
    * Then PE come up with the best optimized plan for the execution of the query
    * The PE passes this plan through BYNET to AMPs.
    * The AMPs follow the plan and retrieve the data from its DISK.
    * Then AMPs passes the data to PE through BYNET.
    * The PE then passes the data to the user.

Tuesday, 18 October 2011

Explain Secondary Index

There are up to 32 unique and non-unique secondary indexes can be defined for a table. Comparing to primary indexes, Secondary indexes allow access to information in a table by alternate, less frequently used paths. A secondary index is a subtable that is stored in all AMPs, but separately from the primary table. The subtables, which are built and maintained by the system, contain the following;

RowIDs of the subtable rows
Base table index column values
RowIDs of the base table rows (points)

As shown in the following figure, the secondary index subtable on each AMP is associated with the base table by the rowID.

Secondary index are optional. Unlike the primary index, a secondary index can be added or dropped without recreating the table. There can be one or more secondary indexes in the CREATE TABLE statement, or add them to an existing table using the CREATE INDEX statement or ALTER TABLE statement. DROP INDEX can be used to dropping a named or unnamed secondary index. Since secondary indexes require subtables, these subtables require additional disk space and, therefore, may require additional I/Os for INSERTs, DELETEs, and UPDATEs.

If a Teradata SQL request uses secondary index values in a WHERE constraint, the optimizer may use the rowID in a secondary index subtable to access the qualifying rows in the data table. If a secondary index is used only periodically by certain applications and is not routinely used by most applications, disk space can be saved by creating the index when it is needed and dropping it immediately after use.

A unique secondary index is very efficient, it typically allows access of only two AMPs, requires no spool file, and has one row per value, therefore, when a unique secondary index is used to access a row, two AMPs are involved. Unique secondary indexes can thus improve performance by avoiding the overhead of scanning all AMPs. For example, if a unique secondary index is defined on the department_name column of the Customer_service.department table (assuming that no two departments have the same name), then the following query is processed using two AMPs:

SELECT department_number FROM customer_service.department
        WHERE department_name = 'Education';

In this example, the request is sent to AMP n, which contains the rowID for the secondary index value "Education", this AMP, in turn, sends the request to AMP m, where the data row containing that value is stored. Note that the rowID and the data row may reside on the same AMP, in which case only one AMP is involved.

A non-unique secondary index (NUSI) may have multiple rows per value. As a general rule, the NUSI should not be defined if the maximum number of rows per value exceeds the number of data blocks in the table. A NUSI is efficient only if the number of rows accessed is a small percentage of the total number of data rows in the table. It can be useful for complex conditional expressions or processing aggregates. For example, if the contact_name column is defined as a secondary index for the customer_service.contact table, the following statement can be processed by secondary index:

    SELECT * FROM customer_service.contact
        WHERE contact_name = 'Mike';

After request is submitted, the optimizer first will determine if it is faster to do a full-table scan of the base table rows or a full-table scan of the secondary index subtable to get the rowIDs of the qualifying base table rows; then place those rowIDs into a spool file; finally use the resulting rowIDs to access the base table rows.
Non-unique secondary indexed accessed is used only for request processing when it is less costly than a complete table search.

Teradata allows the use of Primary and Secondary Indexes, however, there are few differences between the two. In Teradata, Primary Indexes are required and Secondary Indexes are optional.

Each table must have a Primary Index for distributing the rows among the VPROCs. Each table can have only one Primary Index, but it can have up to 32 Secondary Indexes. Both Primary and Secondary Indexes may contain up to 16 columns, and they both may be unique or non-unique.

Row distribution among VPROCs is only affected by the Primary Index not the Secondary Index. Secondary Indexes may be created and dropped at any time, but if the Primary Index must be changed or dropped, the table must be dropped and recreated.   Again, both the Primary and Secondary Indexes will affect the table performance, so poorly chosen indexes can cause some VPROCs to work harder than others.

Thursday, 29 September 2011

Explain Primary Index

It is used for data distribution among the AMPs. There is always one index in a table and it is must for every table (even if you don't define it explicitly).

A primary index should be defined to provide a nearly uniform distribution of rows among the AMPs, the more unique the index, the more even the distribution of rows and the better space utilization.
The index should be defined on as few columns as possible.
Primary index can be either Unique or non-unique. A unique index must have a unique value in the corresponding fields of every row.

A non-unique index permits the insertion of duplicate field values. The unique primary index is more efficient. Once created, the primary index cannot be dropped or modified, the index must be changed by recreating the table.

Once created, the primary index cannot be dropped or modified, the index must be changed by recreating the table.

Creating index with create table command
create table organic (serial_No integer,
                                   organic_name char(15),
                                   Carbon_number smallint,
                                    amount smallint)
unique primary index (serial_No);

create table inorganic (serial_No integer,
                                      inorgnic_name char(15),
                                      anion char(5),
                                      cation char(6),
                                      amount smallint)
unique primary index (serial_No);

non unique primary index:
create table student (student_ID char(15),
                                  student_name char(20),
                                  Department char(10))
primary index (student_ID);

Thursday, 15 September 2011

Teradata Utilities

Teradata loading utilities

The article contains comparison and main features of the data loading tools provided by Teradata. The tutorial illustrates main features of Teradata Multiload , FastLoad and TPump (Parallel Data Pump) and provides sample real-life uses of those tools.

Fast Load

Main use=> To load empty tables at high speed.

The target tables must be empty in order to use FastLoad.

It supports inserts only(not possible to perform updates or deletes in FastLoad).

Although Fastload uses multiple sessions to load the data, only one target table

can be processed at a time.

Teradata Fastload does not support join indexes, foreign key references in target

tables and tables with secondary index defined.

It is necessary to drop any of the constraints listed before loading and recreate

them afterwards.

The maximum number of concurrent Teradata Fastload tasks can be adjusted

by a system administrator.

Fastload runs in two operating modes: Interactive and Batch.

Duplicate rows will not be loaded.

Multi Load

Main use: Load, update and delete large tables in Teradata in a bulk mode

Efficient in loading very large tables.

Multiple tables can be loaded at a time.

Updates data in a database in a block mode (one physical write can update

multiple rows.

Uses table-level locks.

Resource consumption: loading at the highest possible throughput.

Duplicate rows allowed .

TPump

Main use: to load or update a small amount of target table rows

Sends data to a database as a statement which is much slower than using bulk mode

TPump uses row-level hash locks

Resource consumption: loading speed can be adjusted using a built-in resource

consumption management utility. The throughput can be turned down in peak periods.

TPump does not support MULTI-SET tables.

FastLoad(script)

===========================

The following script attached below will load a sample fixed-length columns extract into a Teradata database using FastLoad.

SESSIONS 4;
ERRLIMIT 25;
logon tdpid/username,password;

create table gg_cli (
wh_cust_no integer not null,
cust_name varchar(200),
bal_amt decimal(15,3) format �ZZZ,ZZ9.999�
)
unique primary index( wh_cust_no ) ;

SET RECORD UNFORMATTED;

define
                wh_cust_no(char(10)), delim1(char(1)),
                cust_name(char(200)), delim2(char(1)),
                bal_amt(char(18)), delim3(char(1))
                newlinechar(char(1))
file=insert.input;

SHOW;

BEGIN LOADING gg_cli errorfiles error_1, error_2;

insert into gg_cli (
                :wh_cust_no,
                :cust_name,
                :bal_amt
);

END LOADING;

logoff;

MultiLoad(script)

===========================

The following script attached below will load a sample fixed-length columns extract into a Teradata database using MultiLoad. Use the following command to run load the ggclients.mload file using Teradata FastLoad script:

.logtable inslogtable;

.logon tdpid/username,password;

create table gg_cli (
wh_cust_no integer not null,
cust_name varchar(200),
bal_amt decimal(15,3) format �ZZZ,ZZ9.999�
)
unique primary index( wh_cust_no ) ;

.BEGIN IMPORT MLOAD tables gg_cli;

.layout ggclilayout;
                .field wh_cust_no 1 char(10);
                .field cust_name 12 char(200);
                .field bal_amt 213 char(18);
.dml label insertclidml;

insert into gg_cli.*;

.import infile insert.input
                format text
                layout ggclilayout
                apply insertclidml;
.END MLOAD;

.logoff;

TPump(script)

===========================

The sample script attached below loads a sample fixed-length columns extract into a Teradata database using Parallel Data Pump - Teradata TPump.

.logtable tpumplogtable;
.logon tdpid/username,password;

.BEGIN LOAD SESSION 4;

.layout ggclilayout;
                .field wh_cust_no 1 char(10);
                .field cust_name 12 char(200);
                .field bal_amt 213 char(18);
.dml label insertclidml;

insert into gg_cli.*;

.IMPORT INFILE insert.input
                layout ggclilayout
                apply insertclidml;
.END LOAD;

.logoff;

Monday, 5 September 2011

Stats in Teradata

Collect stats is a mechanism which helps optimizer to get the least cost execution plan for each requested query.

Collect statistics on table
collect stats on TABLE_NAME;

Collect statistics on column of table
collect stats on TABLE_NAME column(COL1);
collect stats on TABLE_NAME column(COL2);

To get detail about statistics of a table
Help stat DBName.TableName;