Configure Amazon S3 as a Source

Overview

This article describes the steps needed to configure Amazon Simple Storage Service (S3) as a source in DvSum Data Quality (DQ). The same steps apply to configure a source in DvSum Data Catalog (DC) with only a slight variation. 

Supported file types:

  1. CSV (Comma Separated Values)
  2. Parquet
  3. Avro

Detailed Steps

Add an S3 Source

Navigate to Administration → Manage sources.

Click the 'Add Source' button and select S3. mceclip6.png

Configure Source

 Basic Information

In the Basic Information section provide the source name, description, and web service.

Other fields are optional. 

2.2.jpg

Credentials

  • AWS Access Key: enter your AWS Access Key ID
  • AWS Secret Key: enter your AWS Secret Access Key

Refer to S3 Source Configuration on AWS for details on the permissions required by this AWS User.

Untitled.png

 

Scenario 1: Select existing S3 Buckets, Glue Database, and Glue Crawler

After authenticating, the "S3 Configurations" fields will display. Select appropriate values from the drop-down lists. Click the "Save" button when done.

2.5.jpg

 

Scenario 2: Create S3 Buckets, Glue Database, and Glue Crawler

S3 Source Bucket
Click the S3 Source Bucket "+ Create" button. DvSum proposes "dvsum-s3-source-" as an optional prefix to ensure that the new S3 bucket will be unique. 

Define any valid bucket name.

2.6.jpg

Click the 'Save' button.

The bucket will be created and appear in the drop-down list.

 

S3 Staging Bucket

Click the S3 Staging Bucket "+ Create" button. DvSum proposes "dvsum-s3-source-" as an optional prefix to ensure that the new S3 bucket will be unique. 

Define any valid bucket name. The Source Bucket and Staging Bucket must be distinct.

2.7.jpg

Click the 'Save' button.

The bucket will be created and appear in the drop-down list.

 

Folders

Folders may optionally be set. If a folder is set for the S3 Source Bucket, then only files in that folder will be cataloged. If a folder is set for the S3 Staging Bucket, then temporary staging files will be created in this folder.

 

Glue Database

Click the Glue Database "+ Create" button.

Define a database name following AWS Glue Database naming requirements, and provide a description.

Click the 'Save' button.

2.9.jpg

 

Glue Crawler

Click the Glue Crawler "+ Create" button.

Define a crawler name following the AWS Glue Crawler naming requirements, and provide a description.

Click the 'Save' button.

 

Save

Review all of the values. Click the 'Save' button.

3.1.jpg

 

NOTE: If the action fails, user need to verify and cross-check the following:

  • Certain special characters are not supported for column names. The following are examples of unsupported characters:
      1. Comma separator in column name (e.g. A,B)
      2. Latin Unicode (e.g. U+0000)
      3. ISO Unicode (e.g. UTF-32)
      4. Windows Unicode (e.g. U+1F937)
  • In the column, users must ensure that there are no carriage returns within the column data, and it should be presented in a single line.

Use the New Source

On-demand cataloging

Your new source will be visible on the "Manage Sources" page.

If you created a new S3 bucket, then your next step should be to update data files to the bucket.

If you selected an existing S3 bucket containing data files, select your source and choose
Run Cataloging → Run Offline

4.jpg

 

Scheduled Cataloging

Click "Schedule Cataloging" to create a regular schedule for the crawler to search for new files.

mceclip0.png

 

Scheduled Profiling

Navigate to Profile → Profiling.

Select your data source, select "ALL" tables, then click the "Select" button to list the tables.

Screenshot 2023-12-10 at 10.53.39 AM.png

Select tables then click "Schedule Profiler" to set a profiling schedule.

 

Next Steps

After your S3 data source is cataloged and profiled, then you're ready to begin creating Data Quality rules.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk