User will be able to add 'S3 bucket' as a source in DvSum and make customer's data insightful by Cataloging, Profiling and Executing rules. Let's get started.
To add S3 as a source, following are the required conditions to be fulfilled.
a) SAWS version should be 2.4.0 and above
b) ,S3 as a source, type should be enabled (explained below)
Enabling S3 Source in DvSum
Step 1 Open the Dvsum application, select Administration tab and click on ,Manage sources' option, Click on 'Add Source' and Select 'S3 source'. Following error messages will be displayed if the S3 source is not enabled for users 'Owner' and 'Admin'.
Note: Only owner is authorized to add a source.
Step 1.2 Owner will click the 'Manage account' link and gets redirected to this page from where the source can be enabled. On other hand, Admin will request the owner to get the source enabled for the account. Click the 'Saws' tab, select the saws and click 'Enable source' button;
Step 1.3 From the list of available sources, select S3 and click Upgrade button as shown below;
Step 1.4 On returning back in SAWS tab, it will take some time to process and after that, S3 Icon will appear in enabled sources column which means that source is successfully enabled as shown below
Scenario 1 : SAWS Error
On upgrading, if there's any issue with SAWS, an error message will be displayed "Please check if your SAWS is working correctly".
Scenario 2: Pending State
On upgrading, if there's any job(s) running, it will go to 'pending' state.
Adding S3 source in DvSum tool
Step 2.1 Open the Dvsum application, select 'Administration' and click on 'Manage sources' option, Click on 'Add Source' and Select S3 source as shown below;
Step 2.2 In Basic information section, provide source name, description, select web service on which S3 source is made enabled, other fields are optional as shown below;
S3 source configuration on AWS
In order to obtain AWS secret and access key, user needs to perform some configuration. For step by step details, click here.
Step 2.4 If user is already existing or has configured S3 for new user profile and has access/secret key, then, enter them. Click 'Authenticate' button as shown below;
Scenario 1: Selecting from already existing Buckets, Database, Crawler data
Step 2.5 Once keys are authenticated, fields for S3 configuration starts appearing as below. User selects data from fields already populating in drop-downs. Select Region, source bucket, staging bucket, Glue database, crawler from drop-down. User can also view status logs on adding these fields data one by one. Hit 'save' button and S3 will be added as a source successfully.
Scenario 2: Creating Buckets, Database, Crawler from scratch
Step 2.6 Source Bucket
Click on create button. AWS has some naming standards to be followed. In our application, user will be shown a placeholder suggesting a bucket name that meets AWS standards.
Step 2.6.1 User can provide any name. There's no such restriction, but, will be notified by a message if standard is not being followed.
Step 2.6.2 User will be notified if bucket name already exists.
Step 2.6.3 Click the 'Save' button. Created bucket will start appearing in drop-down options of Staging bucket as well. But, as it cannot be as same as source bucket, it will appear as disabled.
Step 2.7 Staging bucket
Click on create button. Same as above, give a name following AWS naming standards. Hit 'Save' button.
Step 2.8 Folder in Source bucket
In order to access any particular folder in source bucket, user can provide the name. But if there's no folder name is specified, that means all the folders existing in that source bucket will be accessible.
Step 2.9 Database
Click on create button. Same as above, give a name following AWS naming standards. Provide description. Hit 'Save' button.
Step 3.0 Crawler
As database is newly created, user has to create a crawler for this as well. Same as above, click create, provide name, description, add any pattern that you may wish to exclude and hit 'Save'.
Step 4 Confirm the fields and click 'Save' button. S3 as source will appear as enabled on Manage sources page. In schema name field, gluedb name will be displayed. User will be able to Catalog it, profile it, execute Rules and export it.
Before cataloging, make sure that a CSV format file is uploaded in the S3 source bucket on AWS.
Step 5 If its not uploaded, user can create folder or can directly upload file in the source bucket.
open created folder and upload file
Step 6 Schedule Cataloging
Unlike other data sources, S3 cannot be directly cataloged. For that, user will schedule a catalog that has exactly the same exiting flow for scheduling. Once scheduled catalog run, then it will appear as cataloged status on Manage source page.
Step 7 Schedule Profiling
Unlike other data sources, S3 cannot be directly profiled. For that, user will schedule a profiling that has exactly the same exiting flow for scheduling. Once scheduled profiling run, then it will appear as profiled on profiling page.
Step 8 Writeback on S3 source
For S3, write-back on source is not allowed. User can only see the records in a form of exception and can export into excel. Following are the UI changes that are made when no writeback will be allowed
- Write-back parameters section is Edit configuration will be disabled.
- Write-back column check box in Field configuration tab will be disabled.
- All the rules created on S3 source table cannot be cleansed. Cleanse button will be disabled, however, user can export these rules.