Amazon S3 as a source

 

User will be able to add 'S3 bucket' as a source in DvSum and make customer's data insightful by Cataloging, Profiling and Executing rules. Let's get started.

Note:
To add S3 as a source, following are the required conditions to be fulfilled. 
a) SAWS version should be 2.4.0 and above 
b) ,S3 as a source, type should be enabled (explained below)

 

Enabling S3 Source in DvSum 

Step 1  Open the Dvsum application, select Administration tab and click on ,Manage sources' option,  Click on 'Add Source' and Select 'S3 source'. Following error messages will be displayed if the S3 source is not enabled for users 'Owner' and 'Admin'.

Note: Only owner is authorized to add a source. 

mceclip0.png

Owner:
mceclip1.png


Admin

mceclip2.png
Step 1.2  Owner will click the 'Manage account' link and gets redirected to this page from where the source can be enabled. On other hand, Admin will request the owner to get the source enabled for the account. Click the 'Saws' tab, select the saws and click 'Enable source' button; 

3.2.jpg

Step 1.3  From the list of available sources, select S3 and click Upgrade button as shown below; 

mceclip4.png

 

Step 1.4 On returning back in SAWS tab, it will take some time to process and after that, S3 Icon will appear in enabled sources column which means that source is successfully enabled as shown below 

mceclip5.png

 

Scenario 1 : SAWS Error

On upgrading, if there's any issue with SAWS, an error message will be displayed "Please check if your SAWS is working correctly". 

Scenario 2:  Pending State

On upgrading, if there's any job(s) running, it will go to 'pending' state.

Adding S3 source in DvSum tool 

Step 2.1  Open the Dvsum application, select 'Administration' and click on 'Manage sources' option,  Click on 'Add Source' and Select S3 source as shown below; 

mceclip6.png

Step 2.2  In Basic information section, provide source name, description, select web service on which S3 source is made enabled, other fields are optional as shown below; 

2.2.jpg

S3 source configuration on AWS

Step 2.3  
In order to obtain AWS secret and access key, user needs to perform some configuration. For step by step details, click here

Step 2.4 If user is already existing or has configured S3 for new user profile and has access/secret key, then, enter them. Click 'Authenticate' button as shown below; 

2.4.jpg

Scenario 1: Selecting from already existing Buckets, Database, Crawler data

Step 2.5 Once keys are authenticated, fields for S3 configuration starts appearing as below. User selects data from fields already populating in drop-downs. Select Region, source bucket, staging bucket, Glue database, crawler from drop-down. User can also view status logs on adding these fields data one by one. Hit 'save' button and S3 will be added as a source successfully. 

2.5.jpg

Scenario 2: Creating Buckets, Database, Crawler from scratch 

Step 2.6 Source Bucket 
Click on create button. AWS has some naming standards to be followed. In our application, user will be shown a placeholder suggesting a bucket name that meets AWS standards.

2.6.jpg

Step 2.6.1 User can provide any name. There's no such restriction, but, will be notified by a message if standard is not being followed. 

2.6.1.jpg

Step 2.6.2 User will be notified if bucket name already exists. 

2.6.2.jpg

Step 2.6.3 Click the 'Save' button. Created bucket will start appearing in drop-down options of Staging bucket as well. But, as it cannot be as same as source bucket, it will appear as disabled. 

2.6.3.jpg

Step 2.7 Staging bucket 
Click on create button. Same as above, give a name following AWS naming standards. Hit 'Save' button. 

2.7.jpg

Step 2.8 Folder in Source bucket 
In order to access any particular folder in source bucket, user can provide the name. But if there's no folder name is specified, that means all the folders existing in that source bucket will be accessible. 

Step 2.9 Database 
Click on create button. Same as above, give a name following AWS naming standards. Provide description. Hit 'Save' button.

2.9.jpg

Step 3.0 Crawler 
As database is newly created, user has to create a crawler for this as well. Same as above, click create, provide name, description, add any pattern that you may wish to exclude and hit 'Save'. 

3.jpg

3.1.jpg

Step 4  Confirm the fields and click 'Save' button. S3 as source will appear as enabled on Manage sources page. In schema name field, gluedb name will be displayed. User will be able to Catalog it, profile it, execute Rules and export it.

4.jpg

Note: 
Before cataloging, make sure that a CSV format file is uploaded in the S3 source bucket on AWS.

Step 5   If its not uploaded, user can create folder or can directly upload file in the source bucket. 
mceclip7.png

mceclip8.png

open created folder and upload file 

mceclip9.png

mceclip10.png

Step 6  Schedule Cataloging 
Unlike other data sources, S3 cannot be directly cataloged. For that, user will schedule a catalog that has exactly the same exiting flow for scheduling. Once scheduled catalog run, then it will appear as cataloged status on Manage source page.  

mceclip0.png

Step 7  Schedule Profiling
Unlike other data sources, S3 cannot be directly profiled. For that, user will schedule a profiling that has exactly the same exiting flow for scheduling. Once scheduled profiling run, then it will appear as profiled on profiling page.

mceclip1.png

Step 8  Writeback on S3 source 

For S3, write-back on source is not allowed. User can only see the records in a form of exception and can export into excel. Following are the UI changes that are made when no writeback will be allowed 

  • Write-back parameters section is Edit configuration will be disabled. 
    mceclip2.png
  • Write-back column check box in Field configuration tab will be disabled. 

    mceclip3.png

  • All the rules created on S3 source table cannot be cleansed. Cleanse button will be disabled, however, user can export these rules.
    mceclip4.png

 

 

 

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk