DataStage Interview Questions & Answers

1. What is DataStage

DataStage is an application on the server, which connects to data Sources and Targets and processes ("transforms") the data as they move through the application. DataStage is classed as an "ETL tool", the initials standing for Extract, Transform and Load respectively.

2. What is a DataStage job?

DataStage is a Client –Server Technology & Integrated tools used for Designing, Running, Monitoring and Administering the Data Acquisition Applications is known as DataStage Job.

3. What are key things required to build a DataStage job?

A Job is Graphical Representation of Data flow from Source to Target Systems. A Job is an order series of Individual stages linked together to describe the Data flow from Source to Target.

1) Source Metadata
2) Business Logic - Data Transformation
3) Target Metadata

4. What are the DataStage Client Components?

DataStage Client Component

1) DataStage Administrator

DataStage Admin user interface used to perform administration tasks such as setting up DataStage users, creating and deleting a project, and setting up purging criteria.

2) DataStage Designer

Designer is an interface used to create DataStage applications (Known as jobs). Each job specifies the data sources, the transforms required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run by the Server. Job export and import activities.

3) DataStage Director

Director is a user interface used to validate, schedule, run, and monitor DataStage server jobs, parallel jobs & Sequences jobs.

5. What are the DataStage Server Components?

DataStage Repository: - A central store that contains all the information required to build a data warehouse.
DataStage Server: - Runs executable jobs that extract, transform, and load data into a data warehouse.
DataStage Package Installer: - A user interface used to install packaged DataStage jobs and plug-ins TCP/IP.

6. What is DataStage Architecture

7. Types of jobs in DataStage

DataStage Jobs are three types

Server : These are not into parallelism technique because it is running on a single node, present server jobs are not getting created. The data retrieval is slow. Design is a bit complicated.

Parallel: Parallel jobs are nothing but running on multiple nodes based on the partition techniques which we have given. The data retrieval will be faster. Design is easy.

Sequence job: Sequence jobs contain activities, which are special stages that indicate the actions that occur when the sequence job runs. You interact with activities in the same way that you interact with stages in parallel jobs and server jobs.

8. Types of Parallelism

Parallelism: the execution job will be parallel mode with partitions.

Two types of Parallelism:

PipeLine Parallelism: 1 2 3 4 5 ---->transformer--->load

Partition Parallelism: Based on partition techniques and node configuration our data will be processed to the next level.

1st check: How many node were there. Ex: 2

2nd check: types of partition techniques
Ex:
1
3
2
4
5
7
8
4
Keyless Partition techniques:

9. Various Partition Methods?

1.Round robin: Equal distribution will be happened in Round robin , the records will stored in b/w the nodes 1 after another
node1:
1
2
5
8

node2:
3
4
7
4

2.Random: eual distribution will be happened and data will be sent to nodes randomly
Node1:
1
3
2
4

Node2:
5
7
8
4

3.Entire: Lookup stage reference link.
node1:
1
3
2
4
5
7
8
4
node2:
1
3
2
4
5
7
8
8
4

4.Same: seq--->sort(round robin)--->tra(same)-->target
Key based:
1
2
3
4
4
5
7
8
8

1.Hash: duplicates any type of key column it might be number or esle varchar
node1:
1
3
5
8
8
node2:
2
4
4
7

2. Modulus: node1: 0 node2:1 node3:2
modulus will support only the number key column. compared to hash faster when you are key column is number/integer
mod(1/2) =2/2 0
3/2=1 2nd node
3.Range=
4.DB2

Default: Auto

Collecting Methods:
Ordered
Round robin
sort merge default: Auto

10. What is a Configuration File & it’s importance.

APT Configuration File: degree of parallelism , this is heart and soul of datastage
Default: default.apt : 2nodes 4nodes
Structure of configuration file:
{
node "node1"
fast name " server name/hostname"
resource disk
resource scratch disc
}
Information under configuration file:
Node information: how many nodes we have in a project
pool information: which node it is running more
resource disk: permanent storage for dataset
resource scratch disch: temporary storage for transformations, it will be empty
automatically

11 Can we create our own configuration file?

Yes, Tools--->Configurations--->New

12) What is meant by SMP and MPP? How will you find out that your job is running on SMP or MPP?

SMP: Symmetric Multi processing, it means it is Shared everything
If you are fastname is same for all the nodes then we call it has SMP
MPP: Massive parallel Processing, Shared nothing.
If your fastest name is different for all the nodes then we call it MPP.

13. Parameters

What is meant by parameters and the use parameters, types of parameterS?
parameter is nothing but to store the default/continuously repeating values to reduce
time consuming.
There are two types of parameters:
Job level parameters: is useful for only particular jobs.
project level parameter[parameter sets]: we can use for all the jobs in the project
--------------------------------------------------------------------------------------------------------------------

14. Environmental Variables:

Default environment variables: Project default environmental variables
User Defined environment variables: This is created by us
Q) can we create our own environmental variable?
A: Yes, we can create Job properties--->Parameters--->bottom to Add Environment
variable---->New

15. Importing Metadata [Table definition]:

Metadata: data about the data
How to import Table definitions
By Click on Load and go to table definition right click and then choose respectively option
ODBC/plugin table definition for Databases
Sequential file definition for flat files & Orchestra schema for file set and Dataset

16. About Sequential File Stage

Sequential File Stage

1.It supports 1 I/P link, 1 O/P link
and optional reject link. 2.To read /write text file,csv files (Flat files).
3.The default execution mode of the sequential file stge is 'Sequential'.
It means by default it will run on a single node.
4) How can we execute in parallel mode?
Ans)
a) Read method= File pattern
b) Read from multiple nodes= No(Default) , if change to Yes then it will change to parallel
c) No of readers per node=1 (Default) , if change to more than 1 then it will change to parallel.
d) if you read multiple files then automatically it will change to parallel.
5.By default sequential file will not allow null values, to allow null values we have an option called Field defaults--->Null field value ' '
6.we can write unix commands in sequential file stage options--->Filter (unix commands)
7.It will support properly upto 2 GB of data (Max 2 GB Data it will write)

17. About Dataset

1.it will support 1 I/P link and 1 O/p link , no reject link.
2.This is a datastage tool native format. It will save the data in Binary format.
3.Dataset data will be faster because it runs in parallel mode.
4.We use dataset has intermediate staging purpose.. source[file] lot of data------->Dataset load[intermediate target above intermediate target dataset as source -----> Final target.
5.to view our dataset we have Tools---->Dataset management---->we can view our dataset/ resource disk:
6.We can't delete the dataset directly because, while creating the dataset it will create 4 files.
a) Data File : Data file will be having only the data
b)descriptor file: it will be having your schema of the file
c) Control file: related dataset control function
d)Header File: only column name. 7.To delete dataset permanently then we have two approaches
1) Tools--->dataset management---->select your dataset---->delete
2) By using command prompt/unix: orchadmin rm dataset name.
8. We can't view the dataset data directly since it is in Binary format, so to view it we have command orchadmin dump dataset name filena.txt, Toos--dataset management---->dataset view.
9. The dataset is dependent on node configurations, for example if you have 2 nodes then the dataset will create 2 data files.
10.Extension of dataset is .ds

18. DataBase(RDBMS/Cloud) Stages

Oracle connector stage:
1.It will support any number of I/P links and no O/p when db as the source and any no of reference link
2. to read data from a table or to write data to a table.
3.server name,username, password

19. Processing Stages

Copy Stage

1.It will support 1 I/P link and any Number of O/P links and NO reject links.
2.It is mainly used to drop the columns,to modify the column names,to modify the datatypes of columns,to add the new columns if required.
3.Force=False[default] ---->no of o/p performance faster force=true --->single O/p performance should not be decreased.
4. If I want to do a number of transformations but can't be done in the single link then we can use a copy stage and then we can connect no of O/P links. Types of sorts:

20. Sort Stage:

1.it will support 1 I/P link and 1 O/P link and there NO reject link.
2.To sort the data we need at least one key column.
3.to fetch the results faster.
4. Options---> create key change column= No, it won't create any new column =yes , then it will create a new column as keychange()
5.sort key mode=sort, previously sorted, if u have usage with create keychange column u can enable but when your sort key mode = previously grouped then u should use create cluster key change column.
6.All duplicates= True(default) , if i set it False then it will send only 1 record to the target. retain records will be the first record.
7. sort utility, when it is set true means it will not rearrange the sorted data.

21. Link Sort:

1.When we have less data then we can use link sort.
2.With link sort we can do all the functionality same as sort stage but memory usage will be less.
3.Processing stage--->input--->partitioning(Hash)--->key column(sort the data)---->perform sort---->stable/unique.
4.When the sort is unique it will give unique records to the target.
5.when performing sort is stable then it will not sort the data which is in sorted order.

22) What is the difference b/w link sort and sort stage?

Ans: link sort and sort stage both the functionality are same but in sort stage we have few options like Restrict memory usage=20mb(default) and create key change column.

23) Remove Duplicate Stage

1.It will support 1 I/P link and 1 O/P link and NO reject link.
2.Atleast one key column is mandatory.
3.Mainly removing duplicates is used to work in scd implementation.
4.Option--->Duplicates retain=first /last when we are keeping as last then last record will be stayed first record will be deleted.

24) How many ways can we remove the duplicates in Datastage?

Answers: 1.sort stage--->Allow duplicates= false, 2.link sort--->perform sort--->unique, 3. Remove duplicate stage, 4. source side(file--->Filter command/database-->distinct function), 5. Transformer stage [[2 stage variables ]].

25. Filter Stage

1.It will support 1 I/P link and any number of O/P and one optional reject link.
2.Filter stage mainly works with where clause condition.
3.Your number of O/P links is equal to your number of where clause conditions.
4.we can write our where clause conditions on any number of columns.
5.Output row only once means it will not send the records which are already processed in some other output.

26. Switch Stage:

1.It will support 1 I/P link and up to 128 O/P links and optional reject link.
2. Switch stage works with CASE function.
3. We can work on only 1 column with a number of case functions.
4. number of case statements should equal the number of O/P links.

27. Surrogate Key Generator Stage

1.It will 1 I/P link and 1 O/P
2.Surroage key stage will generate the sequence number it means no duplicates and all.
3.It is mainly used to implement SCD2 since instead of using the business for the calculation we generate sKey to process into fact tables.

28. Funnel Stage

sql :- Set operators: union,union all,intersect and minus.
Funnel works on Union all Operators

1.It will support any number of I/P links and 1 O/P link. There are no reject links.
2.when we have the name of the columns and the number of the columns is equal in all the input sources then only we can use the funnel stage.
3.we can work only for the structured data.
4.the data will be sent to the target with union all functions.
5.To send the data to the target we have 3 types of funnels
A) Continuous Funnel : it will not follow any order , it means whatever the data process from the source to the funnel first that will be sent to the target.
B) Sequence Funnel : Based on input link ordering the data will be processed to the target.
C) Sort funnel: Based on the key column the data will be processed to the target. File1: 2 3 6 file2: 1 4 5 target : 1 2 3 4 5 6

Join/Lookup/Merge: when we have unstructured data it means the name of the columns and number of the columns will not be the same but the key column must be mandatory.

Join Stage

1.It will support any number of I/O links and 1 O/P link and no reject link.
2.To join 2 or more tables/files we need a common key column.
3.To join the tables/files the data should be in sorted order.
4.1st link we call it has left link and last link we call it has right link and in between the first and last link wecall it as intermediate links.
5.By using the Join stage we can do 4 types of joins.

Ex: Table A:
eid,ename,sal
101,Siva,2000
103,Madhav,3000

Table B:

Eid,Designation
101,Consultant
104,Software Engineer

A) Inner join: Matching records from the all the source tables/Files
B) Right outer join: Matching records from all the tables/files and unmatched records from the right table.
C) Left Outer join:Matching records from all the tables/files and unmatched records from the left table
D) Full Outer join: Matching records from all the tables/files and unmatched records from all the tab;es/files.
6.When we have a huge volume of data, then it is good to use the join stage.
7.The data is processed via the join stage and stored under the hard disk.
8.Based on the key column we will give our partition technique, default hash. It will support hash /modulus
Hash: when we have our key column with int/string
modulus: when our key column is integer.

Q) Very very important Questions

TableA:-

No
1
1
1
1

TableB:-

No
1
1
1
3
1
1
2
O/P
Inner Join:1 1 1 1 1 1 1 11

Lookup Stage: to join unstructured data (Dis-similar)

1.It will support any number of input links and 1 O/P and 1 Optional reject link.
2.1st link we call it primary link and remaining links we call it as reference links/lookup link.
main link , reference link : -----------
3.it is not mandatory to maintain the data in sorted order.
4.key column is mandatory.
5.Partition techniques: primary link: hash/modulus reference link: Entire
6.to check the data which is present in primary link will verify the data in reference link so if you are not providing entire partition then the data will be splitted in all the nodes so at that time primary link data comparison with reference link will be difficult and processing will be slow.so if you are providing entire partition ten all the data will be in all the nodes so processing will be speed.
7.lookustage is better when we have volume of data is less since the data is processed via ram
8.There are 3 types of lookups we can perform in the lookup stage.
A) Normal Lookup:
1.when we have more volume data in the primary link compared to reference link then it is better to use Normal lookup.
2.It will support a number of I/P links.
3.we can use any types of sources for the reference link in normal lookup.
B) Sparse Lookup:
1.When we have the volume of data is more in the reference link compared to the primary link.
2. It will support only 2 I/P links.
3. The sparse lookup reference link will support only oracle /db2.
C) Range Lookup: if you want to provide any range to process the data to the target.
9.lookup stage we have property called "Multiple Rows returned from a link"--- it will show the reference link If you are keeping "Multiple Rows returned from a link" is empty it means whatever the matching records ARE THERe from the primary link will be processed to the target.
10. Lookup will support only 2 types of joins:
A)Left outer join :In lookup stage properties---->lookup failure= continue
B)Inner join: Lookup failure= drop/reject lookup failure=fail
11.it will give u the primary link rejected data. [ It means the primary link data which is not matching with reference link data then the primary link data will send it a reject link].

Merge Stage Unstructured data handling

1.It will support any number of I/P links and 1 O/P link and N-1 Reject Links.
2.1st link we call it as master link, remaining links we call it as update links.
3.N-1 reject links means it is dependent on your update links.
4.Partition Technique: Hash/Modulus
5.Key column is must and should
6.Data should be in sorting order.
7.Merge is better when we have a large volume of data since it uses a hard disk.
8.It supports only 2 types of joins,
when "Unmatched master mode= keep" ----->Left outer join
"Unmatched master mode= Drop" ------>Inner join.
9.It will give you the updated links and rejected data.

Difference between join , lookup and merge

JoinLookupMerge
It will support all types of joins left,right,inner,fullouterIt will support 2 types of joins Lookup failure= continue ,then left join Lookup failure= Drop/reject then inner joinIt supports only 2 types of joins Unmatched master mode=keep, Left join Unmatched master mode = drop , inner join.
Key column is mandatoryKey column is no mandatorymapping will done mandatory
The data should be in sorted order.It is not mandatory that the data should be in sorted orderThe data should be in sorted order.
Join stage will story under
memory
stores under ram MemoryMerge stage will story under
memory
It is good to use when we
have huge volume of data
it is good to use when we
have volume of data is less
Volume of data is huge
Partition technique:
Hash/Modulus
lookup link: entirePartition technique:
Hash/Modulus
It will not support reject linksIt will support 1 reject linkIt will support N-1 reject link
It will support any number of I/P links and 1 O/P linkit will support any number I/P link but 1st link we call it as Primary link and remaining all lookup Reference linksIt will support any number of I/P links first link : Master link and remaining links : updated links

CDC Stage:
Full form: Change Data Capture when we are implementing slowly changing dimentions : scd1, scd2 and scd3
SCD1: we will maintain only current data, it won't be stored any historical data.
source Ex: eid,ename,eaddress
101,siva,nellore
102,madhav,prakasam
103,bhargav,guntur
Target:
eid,ename,eaddress
101,siva,bangalore
102,madhav,prakasam
103,bhargav,guntur

SCD2: we will maintain current + historical data.
To implement scd2 we need to follow a few prerequisites.
1) surrogate key ---->It will generate sequence number [unique number]
2) Start Date ------> so yours current date[2017]
3) End date ------> initially value for end date default date [ 9999-31-12]
4) Current Flag --------> Y or N
5) Version Number ------->previous 1 current 2

source Ex:
eid,ename,eaddress,startdate,enddate,currenflag,version number
101,siva,nellore,01-01-2017,31-12-9999,Y,1
102,madhav,prakasam,02-02-2019,31-12-9999,Y,1
103,bhargav,guntur,03-03-2020,31-12-9999,Y,1

target:L
Ex:
skey,eid,ename,eaddress,startdate,enddate,currenflag,version number
1,101,siva,nellore,01-01-2017,03-01-2018,N,1
2,101,siva,bangalore,04-01-2018,31-12-2018,N,2
3,101,siva,pune,01-01-2019,31-12-9999,Y,3
102,madhav,prakasam,02-02-2019,31-12-9999,Y,1
103,bhargav,guntur,03-03-2020,31-12-9999,Y,1

SCD3:it will maintain current data and one previous data. here the data will be added column wise

ex: eid,ename,eaddress
101,siva,nellore
102,madhav,prakasam
103,bhargav,guntur

Target:

eid,ename,current address,previous address
101,siva,bangalore,nellore
101,siva,pune,bangalore
2.It will support 2 I/P links and 1 O/P link but no reject link.
3.first link will call it as before link, 2nd link after link.
4.options--->change method--->3 types
A) All keys & explicit values : key columns will be taken care of by datastage and change value columns we need to pass.
B) explicitly KEYS & explicit VALUES : we need to provide both key column & change values column as well.
C) explicitly keys & All values : we need to provide the key column & values columns which are changing the system will be taken care.
5.If my change code :0 ----> direct copy from before to after
1 -----> inserts new data
2 -----> Delete
3 -----> updates
we can change the change code values options ---> copy code,delete code,insert code,updates
6.we can drop the O/P as well like we have drop copy o?p, delete O/P ---False if ypu change it to true

Transformer Stage:-

1.Heart and soul od DataStage Transformations.
2.It compiles and works on a c++ compiler.
3.It supports only 1 I/P link and any number of O/P links and 1 Optional reject link.
4.This executes in parallel mode.
5.To work in the Transformer stage we have a lot of functionalities.
A) Stage Variables: It is an intermediate processing variable it is useful for the calculation but doesn't send to the target columns.
B) Loop Variables: by using Iteration
C) Constraints: table/file level filtering the data like filter stage where clause.
D) Derivations: column level data filtering nothing but switch stage.
Q) What is the execution flow of the Transformer stage?
Ans: Stage variables----->Loop Variables----->Constraints------>Derivations
6.We can generate surrogate key values from the Transformer Stage.
7. We have system variables, to provide the system information.
Ex: 2node configuration files, data 10 records
A)@Inrownum: It will generate the sequence numbers based on node configuration but this is for source data O/P: 5 records 1 to 5 and remaining 5 records 1to 5
B)@Outrownum:It will generate the sequence numbers based on node configuration. This is target data.
C)@Iteration: Loop Functionality.
D)@NumPartition:2
E)@PartitionNum:0 and 1
F)@True 1
G)@False 0

Development and Debugging Stage

RowGenerator Stage:
1.It will support 1 O/p link and no I/P link and No reject link.
2.To generate the mock data for testing purposes.

Column Generator Stage:
1.It will support 1 I/P link , 1 O/P link and no reject link.
2.The column generator stage adds rhe columns to the incoming data and generates the mock data for that column for each row of incoming data for testing purposes.

Peek Stage:
1.It supports 1 I/P link and any number of O/p links.
2.It will generate the data in log or file eid:101 ename:pravachan sal:2000

Head stage:
1.It will support 1 I/P link and 1 O/p Link and no reject link.
2.If you want to fetch top nth records for testing purposes we can use head stage.

Tail Stage:-
1.It will support 1 I/P link and 1 O/P link and NO reject link.
2.Itf you want bottom Nth records for testing then we can use the Tail stage.

Sample Stage:
1. It will support 1 I/p link and any number of O/P links when it is acting as percentage mode and 1 O/P link when it is acting as period mode.
2.This is easy to debug as per the requirement.

What is the use of Sequence Job?
To run multiple jobs(parallel jobs/server jobs) in sequential order with some other conditions or else some trigger conditions.

each component(stages) we call it as activity
links we call it as triggers

Different types of Activities:
1.Job activity: to call and to run our parallel jobs or server jobs
2.Execute command activity: To execute unix commands or shell scripts
3.User variable activity: To pass the parameters as per user requirement and to use anywhere in the job sequence.
4.Wait for file activity: It will wait for the file until it arrives to the path which you have given.
5.Exception handler: wherever your job fails then automatically an exception will raise. job activity1(fail)--->job activity2---->job activity3--->main activity
6.Terminator activity: once an exception is raised ,immediately it will stop your job from execution.
7.Sequencer activity: Sequencer allows you to synchronize the control flow of multiple job activities in job sequence.
multiple input links and multiple output links
sequencer functionality: All /Any
job1
job2
job3
Execute command
8.Routine Activity: to pass and execute some scripts
9.Nested condition activity:depending upon condition
10.Startloop activity: list loop, numeric loop
11.endloop activity: to end the loop we use end loop activity.
12.Mail activity: