Introduction
In Order to get the data in the DvSum Catalog, the Data source should be Scanned from the database. The scanning process is divided further into two steps which are cataloging and profiling. This article will show how a source can be cataloged and profiled. Before moving further make sure that you know how to add a data source and how a source is authenticated. Here is the link to the article that shows how to add an Oracle Data source.
Cataloging
Cataloging the data source means that during the authentication there are some databases and schemas added and schemas further contain tables that have the data. When the source is authenticated then "Scan Now" button can be seen. When the "Scan Now" button is clicked it further shows two options:
- Catalog & Profile
- Catalog
When any source is scanned using the scan type "Catalog" then a job will be scheduled with the scan type "Catalog".
After few minutes the catalog step will be completed and the status will be changed from "Running" to "Completed".
When the Scan is completed then scan information can be viewed by clicking on the "Scan Name". The scan results page shows us the total tables and columns that are scanned. The scan type is also mentioned on the Scan Results page.
On the Data Dictionary tab, the data sources that are newly scanned can be seen. Since the data source was "cataloged" so the tables will come but there will be no records in the tables and it can be verified by looking at the record count which will be empty.
Profiling
Only the columns in the tables are not useful until the data is not there so in order to bring data of the tables in the catalog, the table must be profiled. When any table is selected then "Run Profiling" button can be seen above.
On clicking "Run Profiling" the tables will be profiled (all the data present in the tables will be fetched in the catalog) and a job will be scheduled in the "Scan History" tab of the respective source.:
When the Scan will be completed then all the tables will be profiled. On the Data Dictionary tab there will be record count for the tables that were profiled indicating that the data is now present in the tables:
Note: Scan Results tab for the scan type "Profile" does not exist
Cataloging & Profiling
When the source is authenticated then on the "Scan Now" button there is an option of "Catalog & Profile" which is a combination of cataloging and profiling (full scan). It will bring the schemas and tables along with the data inside them in the catalog. Once the "Scan" button is clicked, a scan job will be created of the type "Catalog & Profile" will be created.
Once the scan is completed, the scan details information can be seen by clicking on the scan name which will open up the scan results page which contains all the information related to the scan along with the scan type:
On the Data Dictionary tab, all the tables along with the data will be present in the catalog. Here we do not require profiling separately as it was done during the scan.:
It is to be noted that scan type "Catalog & Profile" basically brings all the schemas and tables present inside it along with the data. If the data for some tables is required then the source must be cataloged first and then profiling can be applied to specific tables.
Scheduling a Job
The above examples of scans are basically on-demand which means the moment the "Scan" button is clicked, a job will be created and start running but if there is a scenario in which scans are required to run daily or start from a specific date then scans can be scheduled. On the settings tab of the Data Source, the "General" tab contains scanning information, and on clicking "Edit", jobs can be scheduled and the user can mention the scan types:
It is to be noted that only scan types "Catalog & Profile" and "Catalog" can be scheduled. The scan type "Profile" can not be scheduled for a particular time. On selecting the right scan type, the scan frequency, start, and end time can be selected according to the requirement. Once the information is saved, a job will be scheduled with the scan type that was selected:
In our application, there are some sources that don't have the option of separate "catalog" and "profile". For these sources on the "Scan Now" button on Data sources, there will be no drop-down appearing asking for selecting scan type. In these sources, the scan type will be "Catalog & Profile" by default. These 4 sources are:
- Azure Data Lake Storage (ADLS)
- Power BI
- Tableau
- File Upload
On the "General" tab under the "Settings" tab for these data sources, there will be no option of scan types like rest of the sources because by default the selected scan type is "Catalog & Profile"
0 Comments