What is Informatica and Enterprise Data Warehouse?
Informatica is an ETL tool which is used to extract, transform and load data. It plays crucial role in building Enterprise Data Warehouse.
EDW provides user a global view on historical data stored from various department in an organization like finance, sales, labor, marketing etc based on which critical business decisions could be made.
What is repository service, integration service, reporting service, domain and node?
There are basically three types of services which you will find under application services:
Repository service - It is responsible for storing Informatica metadata (source definition, target definition, mappings, mapplets and transformations in the form of tables) and providing access to these repositories for other services. It is also responsible for maintaining connection between client development tool and Informatica repository which keeps metadata up-to-date by updating, inserting and deleting the metadata.
Integration Service - It is responsible for execution of workflow and end-to-end data movement from source to target. It pulls the workflow information from repository, starts the execution of tasks, gathers/combines data from different sources (for example relational database and flat file) and loads into single or multiple targets.
Reporting Service - It enables report generation.
Domain - It is administrative part of Informatica which is used by admins to operate nodes and services. You can define multiple nodes in a domain with a gateway node which is responsible for receiving request and distributing it to worker nodes. Further, nodes are responsible for running services and other Informatica processes. There are basically two types of services in a domain - Service manager and Application services.
Service manager is responsible for logging and login operations. Like authentication, authorization, managing people and groups.
Application services are responsible for managing integration services, repository services and reporting services.
Key properties of domain are as follows:
Database properties - You can define database instance name and port responsible for holding domain. It consist of database type (like Oracle, SQL server etc), host, port, database name and user name.
General Properties - You can define resilience timeout, restart period, restart attempts, dispatch mode etc. For example if services goes down how many seconds application services wait to again connect to respective service depends upon resilience timeout, how long Informatica can try restarting those services depends upon restart period and attempts. How tasks will be distributed to worker nodes from gateway will depend upon dispatch mode like round robin.
What is PowerCenter repository?
It is like a relational database which stores Informatica metadata in the form of tables (underlying database could be Oracle database or SQL server or similar) and it is managed by repository services.
What is Informatica client tool?
Informatica client tool is basically developer tool installed on client machine and it consist of four parts:
Designer (Source, target, mapping, mapplet and transformations designer)
Workflow manager (Task, worklet and workflow designer)
Basic terminology consist of:
Workflow, Worklet, Sessions & Tasks: Workflow consists of one or more session, worklet and task (includes timer, decision, command, event wait, mail, link, assignment, control etc) connected in parallel or sequence. You can run these sessions by using session manager or pmcmd command. Further, you can write pre-post-session commands.
Mappings, Mapplets & Transformations: Mappings are collection of source, target and transformations. Mapplets (designed in mapplet designer) are like re-usable mappings which contains various transformations but no source/target. You can run the mapping in debug mode without creating a session. There are mapping parameters and variables as well. Mapping parameters represent constant values that are defined before running a session while mapping variables can change values during sessions.
What could be the various states of object in Informatica?
Valid - fully syntactically correct object.
Invalid - where syntax and properties are invalid according to Informatica standards. Informatica marks those objects invalid. It could be mapping, mapplet, task, session, workflow or any transformation.
Impacted - where underlying object is invalid. For instance in a mapping suppose underlying transformation has become invalid due to some change.
What happens when user executes the workflow?
User executes workflow
Informatica invokes integration service to pull workflow details from repository
Integration service starts execution of workflow after gathering workflow metadata
Integration service runs all child tasks
Reads and combine data from sources and loads into target
After execution, it updates the status of task as succeeded, failed, unknown or aborted
Workflow and session log is generated
What is the difference between ABORT and STOP?
Abort will kill process after 60 seconds even if data processing hasn't finished. Session will be forcefully terminated and DTM (Data Transformation Manager) process will be killed.
Stop will end the process once DTM processing has finished processing. Although it stops reading data from source as soon as stop command is received.
What are the various types of transformation available in Informatica?
There are basically two categories of transformation in Informatica.
Active - It can change the number of rows that pass through transformation for example filter, sorter transformations. It can also change the row type for example update strategy transformation which can mark row for update, insert, reject or delete. Further it can also change the transaction boundary for example transaction control transformation which can allow commit and rollback for each row based on certain expression evaluation.
Passive - Maintains same number of rows, no change in row type and transaction boundary. Number of output rows will remain always same as number of input rows.
Active transformation includes:
Source Qualifier: Connected to relational source or flat file, converts source data type to Informatica native data types. Performs joins, filter, sort, distinct and you can write custom SQL. You can have multiple source qualifier in single session with multiple targets, in this case you can decide target load order as well.
Joiner: It can join two heterogeneous sources unlike source qualifier which needs common source. It performs normal join, master outer join, detail outer join and full outer join.
Filter: It has single condition - drops records based on filter criteria like SQL where clause, single input and single output.
Router: It has input, output and default group - acts like filter but you can apply multiple conditions for each group like SQL case statement, single input multiple output, more easier and efficient to work as compared to filter.
Aggregator: It performs calculations such as sums, averages etc. It is unlike expression transformation in which one can do calculations in groups. It also provides extra cache files to store transformation values if required.
Sorter: Sorts the data. It has distinct option which can filter duplicate rows.
Lookup: It has input, output, lookup and return port. Explained in next question.
Union: It works like union all SQL.
Update Strategy: Treats source row as "data driven" - DD_UPDATE, DD_DELETE, DD_INSERT and DD_REJECT.
Normalizer: Takes multiple columns and returns few based on normalization.
Passive transformation includes:
Expression: It is used to calculate in single row before writing on the target, basically non-aggregate calculations.
Sequence Generator: For generating primary keys, surrogate keys (NEXTVAL & CURRVAL).
Explain Lookup transformation in detail?
Lookup transformation is used to lookup a flat file, relational database table or view. It has basically four ports - input (I), output (O), lookup (L) and return port (R). Lookup transformations can be connected or unconnected and could act as active/passive transformation.
Connected lookup: It can return multiple output values.
It can have both dynamic and static cache.
It can return more than one column value in output port.
It caches all lookup columns.
Unconnected lookup: It can take multiple input parameters like column1, column2, column3 .. for lookup but output will be just one value.
It has static cache.
It can return only one column value in output port.
It caches lookup condition and lookup output port in return port.
Lookup cache can be made cached or no cache. Cached lookup can be static or dynamic. Dynamic cache can basically change during the execution of process, for example if lookup data itself changes during transaction (NewLookupRow).
Further, these cache can be made persistent or non-persistent i.e. it tells Informatica to keep lookup cache data or delete it after completion of session. Here is types of cache:
Static - static in nature.
Dynamic - dynamic in nature.
Persistent - keep or delete the lookup cache.
Re-cache - makes sure cache is refreshed if underlying table data changes.
Shared cache - can be used by other mappings.
What are the various types of files created during Informatica session run?
Session log file - depending upon tracing level (none, normal, terse, verbose, verbose data) it records SQL commands, reader-writer thread, errors, load summary etc. Note: You can change the number of session log files saved (default is zero) for historic runs.
Workflow log file - includes load statistics like number of rows for source/target, table names etc.
Informatica server log - created at home directory on Unix box with all status and error messages.
Cache file - index or data cache files; for example aggregate cache, lookup cache etc.
Output file - based on session if it's creating output data file.
Bad/Reject file - contains rejected rows which were not written to target.
None: Applicable only at session level. The Integration Service uses the tracing levels configured in the mapping.
Terse: logs initialization information, error messages, and notification of rejected data in the session log file.
Normal: Integration Service logs initialization and status information, errors encountered and skipped rows due to transformation row errors. Summarizes session results, but not at the level of individual rows.
Verbose Initialization: In addition to normal tracing, the Integration Service logs additional initialization details; names of index and data files used, and detailed transformation statistics.
Verbose Data: In addition to verbose initialization tracing, the Integration Service logs each row that passes into the mapping. Also notes where the Integration Service truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the Integration Service writes row data for all rows in a block when it processes a transformation.
What is SQL override in Informatica?
SQL override can be implemented at Source Qualifier, Lookups and Target. It allows you to override the existing SQL in mentioned transformations to
Limit the incoming rows
Escape un-necessary sorting of data (order by multiple columns) to improve performance
Use parameters and variables
Add WHERE clause
What is parallel processing in Informatica?
You can implement parallel processing in Informatica by Partitioning sessions. Partitioning allows you to split large amount of data into smaller sets which can be processed in parallel. It uses hardware to its maximum efficiency. These are the types of Partitioning in Informatica:
Database Partitioning: Integration service checks if source table has partitions for parallel read.
Round-robin Partitioning: Integration service evenly distributes the rows into partitions for parallel processing. Performs well when you don't want to group data.
Hash Auto-keys Partitioning: Distributes data based on hash function among partitions. It uses all sorted ports to create partition key. Perform well before sorter, rank and unsorted aggregator.
Hash User-keys Partitioning: Distributes data based on hash function among partitions, but here user can choose the port which will act as partition key.
Key Range Partitioning: User can choose multiple ports which will act as compound partition key.
Pass-through Partitioning: Rows are passed to next partition without redistribution like what we saw in round robin.
What can be done to improve performance in Informatica?
See the bottlenecks by running different tracing levels (verbose data).
Tune the transformations like lookups (caching), source qualifiers, aggregator (use sorted input), filtering un-necessary data, degree of parallelism in workflow etc.
Dropping the index before data load and re-indexing (for example using command task at session level) after data load is complete.
Change session properties like commit interval, buffer block size, session level Partitioning (explained above), reduce logging details, DTM buffer cache size and auto-memory attributes.
OS level tuning like disk I/O & CPU consumption.