AWS Dynamodb简介

Amazon DynamoDB 是一种完全托管式、无服务器的 NoSQL 键值数据库，旨在运行任何规模的高性能应用程序。
DynamoDB能在任何规模下实现不到10毫秒级的一致响应，并且它的存储空间无限，可在任何规模提供可靠的性能。
DynamoDB 提供内置安全性、连续备份、自动多区域复制、内存缓存和数据导出工具。

Redshift简介

Amazon Redshift是一个快速、功能强大、完全托管的PB级别数据仓库服务。用户可以在刚开始使用几百GB的数据，然后在后期扩容到PB级别的数据容量。
Redshift是一种联机分析处理OLAP（Online Analytics Processing）的类型，支持复杂的分析操作，侧重决策支持，并且能提供直观易懂的查询结果。

资源准备

VPC

vpc
- cird block： 10.10.0.0/16
internet gateway
elastic ip address
nat gateway：使用elastic ip address作为public ip
public subnet
- 三个Availability Zone
private subnet
- 三个Availability Zone
public route table：public subnet关联的route table
- destination： 0.0.0.0/0 target: internet-gateway-id(允许与外界进行通信)
- destination：10.10.0.0/16 local（内部通信）
private route table：private subnet关联的route table
- destination：10.10.0.0/16 local（内部通信）
- destination： 0.0.0.0/0 target: nat-gateway-id(允许内部访问外界)
web server security group
- 允许任意ip对443端口进行访问
- 允许自己的ipdui22端口进行访问，以便ssh到服务器上向数据库插入数据
glue redshift connection security group
- 只包含一条self-referencing rule ，允许同一个security group对所有tcp端口进行访
- 创建Glue connection时需要使用该security group：
- Reference： glue connection security group must have a self-referencing rule to allow to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.
private redshift security group
- 允许vpc内部（10.10.0.0/24）对5439端口进行访问
- 允许glue connection security group对5439端口进行访问
public redshift security group
- 允许vpc内部（10.10.0.0/24）对5439端口进行访问
- 允许kenisis firehose所在region的public ip 对5439端口进行访问
  - 13.58.135.96/27 for US East (Ohio)
  - 52.70.63.192/27 for US East (N. Virginia)
  - 13.57.135.192/27 for US West (N. California)
  - 52.89.255.224/27 for US West (Oregon)
  - 18.253.138.96/27 for AWS GovCloud (US-East)
  - 52.61.204.160/27 for AWS GovCloud (US-West)
  - 35.183.92.128/27 for Canada (Central)
  - 18.162.221.32/27 for Asia Pacific (Hong Kong)
  - 13.232.67.32/27 for Asia Pacific (Mumbai)
  - 13.209.1.64/27 for Asia Pacific (Seoul)
  - 13.228.64.192/27 for Asia Pacific (Singapore)
  - 13.210.67.224/27 for Asia Pacific (Sydney)
  - 13.113.196.224/27 for Asia Pacific (Tokyo)
  - 52.81.151.32/27 for China (Beijing)
  - 161.189.23.64/27 for China (Ningxia)
  - 35.158.127.160/27 for Europe (Frankfurt)
  - 52.19.239.192/27 for Europe (Ireland)
  - 18.130.1.96/27 for Europe (London)
  - 35.180.1.96/27 for Europe (Paris)
  - 13.53.63.224/27 for Europe (Stockholm)
  - 15.185.91.0/27 for Middle East (Bahrain)
  - 18.228.1.128/27 for South America (São Paulo)
  - 15.161.135.128/27 for Europe (Milan)
  - 13.244.121.224/27 for Africa (Cape Town)
  - 13.208.177.192/27 for Asia Pacific (Osaka)
  - 108.136.221.64/27 for Asia Pacific (Jakarta)
  - 3.28.159.32/27 for Middle East (UAE)
  - 18.100.71.96/27 for Europe (Spain)
  - 16.62.183.32/27 for Europe (Zurich)
  - 18.60.192.128/27 for Asia Pacific (Hyderabad)

VPC全部资源的serverless文件：

custom:bucketNamePrefix 替换为自己的创建的bucket

service: dynamodb-to-redshift-vpc

custom:
  bucketNamePrefix: "jessica"

provider:
  name: aws
  region: ${opt:region, "ap-southeast-1"}
  stackName: ${self:service}
  deploymentBucket:
    name: com.${self:custom.bucketNamePrefix}.deploy-bucket
    serverSideEncryption: AES256

resources:
  Parameters:
    VpcName:
      Type: String
      Default: "test-vpc"

  Resources:
    VPC:
      Type: "AWS::EC2::VPC"
      Properties:
        CidrBlock: "10.10.0.0/16"
        EnableDnsSupport: true
        EnableDnsHostnames: true
        InstanceTenancy: default
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}"
    # Internet Gateway
    InternetGateway:
      Type: "AWS::EC2::InternetGateway"
      Properties:
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_InternetGateway"
    VPCGatewayAttachment:
      Type: "AWS::EC2::VPCGatewayAttachment"
      Properties:
        VpcId: !Ref VPC
        InternetGatewayId: !Ref InternetGateway

    # web server security group
    WebServerSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow access from public
        VpcId: !Ref VPC
        SecurityGroupIngress:
          - IpProtocol: tcp
            FromPort: 443
            ToPort: 443
            CidrIp: "0.0.0.0/0"
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_WebServerSecurityGroup"

    # public route table
    RouteTablePublic:
      Type: "AWS::EC2::RouteTable"
      Properties:
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_RouteTablePublic"
    RouteTablePublicInternetRoute:
      Type: "AWS::EC2::Route"
      DependsOn: VPCGatewayAttachment
      Properties:
        RouteTableId: !Ref RouteTablePublic
        DestinationCidrBlock: "0.0.0.0/0"
        GatewayId: !Ref InternetGateway

    # public subnet
    SubnetAPublic:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [0, !GetAZs ""]
        CidrBlock: "10.10.0.0/24"
        MapPublicIpOnLaunch: true
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetAPublic"
    RouteTableAssociationAPublic:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetAPublic
        RouteTableId: !Ref RouteTablePublic

    SubnetBPublic:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [1, !GetAZs ""]
        CidrBlock: "10.10.32.0/24"
        MapPublicIpOnLaunch: true
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetBPublic"
    RouteTableAssociationBPublic:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetBPublic
        RouteTableId: !Ref RouteTablePublic

    SubnetCPublic:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [2, !GetAZs ""]
        CidrBlock: "10.10.64.0/24"
        MapPublicIpOnLaunch: true
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetCPublic"
    RouteTableAssociationCPublic:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetCPublic
        RouteTableId: !Ref RouteTablePublic

    # redshift security group
    PrivateRedshiftSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow access from inside vpc
        VpcId: !Ref VPC
        SecurityGroupIngress:
          - IpProtocol: tcp
            FromPort: 5439
            ToPort: 5439
            CidrIp: 10.10.0.0/24
          - IpProtocol: tcp
            FromPort: 5439
            ToPort: 5439
            SourceSecurityGroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupId
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_PrivateRedshiftSecurityGroup"
    # redshift security group
    PublicRedshiftSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow access from inside vpc and Kinesis Data Firehose CIDR block
        VpcId: !Ref VPC
        SecurityGroupIngress:
          - IpProtocol: tcp
            FromPort: 5439
            ToPort: 5439
            CidrIp: 10.10.0.0/24
          - IpProtocol: tcp
            FromPort: 5439
            ToPort: 5439
            CidrIp: 13.228.64.192/27
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_PublicRedshiftSecurityGroup"
    GlueRedshiftConnectionSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow self referring for all tcp ports
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_GlueRedshiftConnectionSecurityGroup"
    GlueRedshiftConnectionSecurityGroupSelfReferringInboundRule:
      Type: "AWS::EC2::SecurityGroupIngress"
      Properties:
        GroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupId
        IpProtocol: tcp
        FromPort: 0
        ToPort: 65535
        SourceSecurityGroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupId
        SourceSecurityGroupOwnerId: !Sub "${aws:accountId}"
    # nat gateway
    EIP:
      Type: "AWS::EC2::EIP"
      Properties:
        Domain: vpc
    NatGateway:
      Type: "AWS::EC2::NatGateway"
      Properties:
        AllocationId: !GetAtt "EIP.AllocationId"
        SubnetId: !Ref SubnetAPublic

    # private route table
    RouteTablePrivate:
      Type: "AWS::EC2::RouteTable"
      Properties:
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_RouteTablePrivate"
    RouteTablePrivateRoute:
      Type: "AWS::EC2::Route"
      Properties:
        RouteTableId: !Ref RouteTablePrivate
        DestinationCidrBlock: "0.0.0.0/0"
        NatGatewayId: !Ref NatGateway

    # private subnet
    SubnetAPrivate:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [0, !GetAZs ""]
        CidrBlock: "10.10.16.0/24"
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetAPrivate"
    RouteTableAssociationAPrivate:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetAPrivate
        RouteTableId: !Ref RouteTablePrivate
    SubnetBPrivate:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [1, !GetAZs ""]
        CidrBlock: "10.10.48.0/24"
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetBPrivate"
    RouteTableAssociationBPrivate:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetBPrivate
        RouteTableId: !Ref RouteTablePrivate
    SubnetCPrivate:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [2, !GetAZs ""]
        CidrBlock: "10.10.80.0/24"
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetCPrivate"
    RouteTableAssociationCPrivate:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetCPrivate
        RouteTableId: !Ref RouteTablePrivate

  Outputs:
    VPC:
      Description: "VPC."
      Value: !Ref VPC
      Export:
        Name: !Sub "${self:provider.stackName}"
    SubnetsPublic:
      Description: "Subnets public."
      Value:
        !Join [
          ",",
          [!Ref SubnetAPublic, !Ref SubnetBPublic, !Ref SubnetCPublic],
        ]
      Export:
        Name: !Sub "${self:provider.stackName}-PublicSubnets"
    SubnetsPrivate:
      Description: "Subnets private."
      Value:
        !Join [
          ",",
          [!Ref SubnetAPrivate, !Ref SubnetBPrivate, !Ref SubnetCPrivate],
        ]
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateSubnets"
    DefaultSecurityGroup:
      Description: "VPC Default Security Group"
      Value: !GetAtt VPC.DefaultSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-DefaultSecurityGroup"
    WebServerSecurityGroup:
      Description: "VPC Web Server Security Group"
      Value: !Ref WebServerSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-WebServerSecurityGroup"
    PrivateRedshiftSecurityGroup:
      Description: "The id of the RedshiftSecurityGroup"
      Value: !Ref PrivateRedshiftSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateRedshiftSecurityGroup"
    PublicRedshiftSecurityGroup:
      Description: "The id of the RedshiftSecurityGroup"
      Value: !Ref PublicRedshiftSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-PublicRedshiftSecurityGroup"
    GlueRedshiftConnectionSecurityGroup:
      Description: "The id of the self referring security group"
      Value: !Ref GlueRedshiftConnectionSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-GlueSelfRefringSecurityGroup"

Redshift Cluster

Private Cluster subnet group
- 创建一个包含private subnet的private subnet group
Private Cluster：用于测试glue job同步数据到redshift，PubliclyAccessible必须设为false，否则glue job无法连接
- ClusterSubnetGroupName
  - 使用private subnet group
- VpcSecurityGroupIds
  - 使用private redshift security group
- NodeType: dc2.large
- ClusterType: single-node
- PubliclyAccessible: false
Public Cluster subnet group
- 创建一个包含public subnet的public subnet group
Public Cluster：用于测试glue job同步数据到redshift，PubliclyAccessible必须设为true，且security group允许kinesis firehose public ip对5439端口进行访问，否则firehose无法连接到redshift
- ClusterSubnetGroupName
  - 使用public subnet group
- VpcSecurityGroupIds
  - 使用public redshift security group
- NodeType: dc2.large
- ClusterType: single-node
- PubliclyAccessible: true

redshift全部资源的serverless文件:

custom:bucketNamePrefix 替换为自己的创建的bucket

service: dynamodb-to-redshift-redshift

custom:
  bucketNamePrefix: "jessica"

provider:
  name: aws
  region: ${opt:region, "ap-southeast-1"}
  stackName: ${self:service}
  deploymentBucket:
    name: com.${self:custom.bucketNamePrefix}.deploy-bucket
    serverSideEncryption: AES256

resources:
  Parameters:
    ServiceName:
      Type: String
      Default: dynamodb-to-redshift

  Resources:
    PrivateClusterSubnetGroup:
      Type: "AWS::Redshift::ClusterSubnetGroup"
      Properties:
        Description: Private Cluster Subnet Group
        SubnetIds:
          Fn::Split:
            - ","
            - Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateSubnets
        Tags:
          - Key: Name
            Value: private-subnet
    PrivateCluster:
      Type: "AWS::Redshift::Cluster"
      Properties:
        ClusterIdentifier: test-data-sync-redshift
        ClusterSubnetGroupName: !Ref ClusterSubnetGroup
        VpcSecurityGroupIds:
          - Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateRedshiftSecurityGroup
        DBName: dev
        MasterUsername: admin
        MasterUserPassword: Redshift_admin_2022
        NodeType: dc2.large
        ClusterType: single-node
        PubliclyAccessible: false
    PublicClusterSubnetGroup:
      Type: "AWS::Redshift::ClusterSubnetGroup"
      Properties:
        Description: Public Cluster Subnet Group
        SubnetIds:
          Fn::Split:
            - ","
            - Fn::ImportValue: !Sub ${ServiceName}-vpc-PublicSubnets
        Tags:
          - Key: Name
            Value: public-subnet
    PublicCluster:
      Type: "AWS::Redshift::Cluster"
      Properties:
        ClusterIdentifier: test-data-sync-redshift-public
        ClusterSubnetGroupName: !Ref PublicClusterSubnetGroup
        VpcSecurityGroupIds:
          - Fn::ImportValue: !Sub ${ServiceName}-vpc-PublicRedshiftSecurityGroup
        DBName: dev
        MasterUsername: admin
        MasterUserPassword: Redshift_admin_2022
        NodeType: dc2.large
        ClusterType: single-node
        PubliclyAccessible: true
  Outputs:
    PrivateRedshiftEndpoint:
      Description: "Redshift endpoint"
      Value: !GetAtt Cluster.Endpoint.Address
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateRedshiftEndpoint"
    PrivateRedshiftPort:
      Description: "Redshift port"
      Value: !GetAtt Cluster.Endpoint.Port
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateRedshiftPort"
    PublicRedshiftEndpoint:
      Description: "Public Redshift endpoint"
      Value: !GetAtt PublicCluster.Endpoint.Address
      Export:
        Name: !Sub "${self:provider.stackName}-PublicRedshiftEndpoint"
    PublicRedshiftPort:
      Description: "Public Redshift port"
      Value: !GetAtt PublicCluster.Endpoint.Port
      Export:
        Name: !Sub "${self:provider.stackName}-PublicRedshiftPort"

使用AWS Glue ETL Job进行同步

适用场景

一次性整表同步
对于典型的时间序列数据（当前的数据写入和读取频率高，越老的数据读写频率越低），通常会采用为每个时间段（每天）创建一张表的方式来合理的分配WCU和RCU。
如果在当时时间段结束之后，需要对该时间段内的所有数据进行复杂的分析操作，则需要将dynamodb的整表同步到redshift

架构

优点

使用AWS Glue Crawler可以自动管理源表和目标表的scheme，在Glue Job script中可以省去mapping的过程，Glue Job script代码易维护

资源部署

Dynamodb table: 源数据表
IAM role for glue crawler，crawler需要连接dynamodb和redshift的权限以读取表的scheme
Dynamodb glue catalog database：用于存储crawler生成的dynamodb table scheme
Redshift glue catalog database：用于存储crawler生成的redshift table scheme
Dynamodb glue crawler：用于读取dynamodb表，生成对应的dynamodb table scheme
Redshift glue crawler:用于读取redshift表，生成对应的redshift table scheme
Glue connection:glue job连接redshift需要用到的connection
IAM role for glue job：Glue job需要
S3 bucket for glue job
glue job

如何部署：

sls deploy -c glue-etl.yml
#replace ${bucketNamePrefix} to your own glue bucket name crate in glue-etl.yml
aws s3 cp dynamodb-to-redshift.py s3://com.${bucketNamePrefix}.glue-temp-bucket/script/

部署文件：glue-etl.yml

service: dynamodb-to-redshift-glue-etl

custom:
  bucketNamePrefix: "jessica"

provider:
  name: aws
  region: ${opt:region, "ap-southeast-1"}
  stackName: ${self:service}
  deploymentBucket:
    name: com.${self:custom.bucketNamePrefix}.deploy-bucket
    serverSideEncryption: AES256

resources:
  Parameters:
    DynamodbTableName:
      Type: String
      Default: "TestSyncToRedshift"
    ServiceName:
      Type: String
      Default: dynamodb-to-redshift
    GlueBucketName:
      Type: String
      Default: com.${self:custom.bucketNamePrefix}.glue-etl-temp-bucket

  Resources:
    TestTable:
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: !Sub ${DynamodbTableName}
        BillingMode: PAY_PER_REQUEST
        AttributeDefinitions:
          - AttributeName: pk
            AttributeType: S
          - AttributeName: sk
            AttributeType: S
        KeySchema:
          - AttributeName: pk
            KeyType: HASH
          - AttributeName: sk
            KeyType: RANGE
    CrawlerRole:
      Type: AWS::IAM::Role
      Properties:
        RoleName: CrawlerRole
        AssumeRolePolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: "Allow"
              Principal:
                Service:
                  - "glue.amazonaws.com"
              Action:
                - "sts:AssumeRole"
        ManagedPolicyArns:
          - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
          - arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess
          - arn:aws:iam::aws:policy/AmazonRedshiftFullAccess
          - arn:aws:iam::aws:policy/AmazonS3FullAccess
    DynamodbDatabase:
      Type: AWS::Glue::Database
      Properties:
        CatalogId: !Ref AWS::AccountId
        DatabaseInput:
          Name: "dynamodb-database"
    DynamodbCrawler:
      Type: AWS::Glue::Crawler
      Properties:
        Name: "dynamodb-crawler"
        Configuration:
        Role: !GetAtt CrawlerRole.Arn
        DatabaseName: !Ref DynamodbDatabase
        Targets:
          DynamoDBTargets:
            - Path: !Sub ${DynamodbTableName}
        SchemaChangePolicy:
          UpdateBehavior: "UPDATE_IN_DATABASE"
          DeleteBehavior: "LOG"
        Schedule:
          ScheduleExpression: cron(0/10 * * * ? *) # run every 10 minutes
    GlueRedshiftConnection:
      Type: AWS::Glue::Connection
      Properties:
        CatalogId: !Sub "${aws:accountId}"
        ConnectionInput:
          Name: ${self:service}-redshift-connection
          ConnectionType: JDBC
          MatchCriteria: []
          PhysicalConnectionRequirements:
            SecurityGroupIdList:
              - Fn::ImportValue: !Sub ${ServiceName}-vpc-GlueSelfRefringSecurityGroup
            SubnetId:
              Fn::Select:
                - 1
                - Fn::Split:
                    - ","
                    - Fn::ImportValue: !Sub "${ServiceName}-vpc-PrivateSubnets"
          ConnectionProperties:
            JDBC_CONNECTION_URL:
              Fn::Join:
                - ""
                - - "jdbc:redshift://"
                  - Fn::ImportValue: !Sub ${ServiceName}-redshift-PrivateRedshiftEndpoint
                  - ":"
                  - Fn::ImportValue: !Sub ${ServiceName}-redshift-PrivateRedshiftPort
                  - "/dev"
            JDBC_ENFORCE_SSL: false
            USERNAME: admin
            PASSWORD: Redshift_admin_2022
    RedshiftDatabase:
      Type: AWS::Glue::Database
      Properties:
        CatalogId: !Ref AWS::AccountId
        DatabaseInput:
          Name: "redshift-database"
    RedshiftCrawler:
      Type: AWS::Glue::Crawler
      Properties:
        Name: "redshift-crawler"
        Configuration:
        Role: !GetAtt CrawlerRole.Arn
        DatabaseName: !Ref RedshiftDatabase
        Targets:
          JdbcTargets:
            - ConnectionName: !Ref GlueRedshiftConnection
              Path: dev/public/test_sync_to_redshift
        SchemaChangePolicy:
          UpdateBehavior: "UPDATE_IN_DATABASE"
          DeleteBehavior: "LOG"
    RedshiftGlueJobRole:
      Type: AWS::IAM::Role
      Properties:
        RoleName: RedshiftGlueJobRole
        AssumeRolePolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Principal:
                Service:
                  - glue.amazonaws.com
              Action: sts:AssumeRole
        ManagedPolicyArns:
          - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
          - arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess
          - arn:aws:iam::aws:policy/AmazonRedshiftFullAccess
          - arn:aws:iam::aws:policy/AmazonS3FullAccess
          - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess

    GlueTempBucket:
      Type: AWS::S3::Bucket
      Properties:
        BucketName: !Sub ${GlueBucketName}

    GlueJob:
      Type: AWS::Glue::Job
      Properties:
        Name: dynamodb-to-redshift-glue-etl-job
        Role: !GetAtt RedshiftGlueJobRole.Arn
        Command:
          Name: glueetl
          ScriptLocation: !Sub "s3://${GlueBucketName}/script/dynamodb-to-redshift.py"
          PythonVersion: 3
        DefaultArguments:
          --TempDir: !Sub "s3://${GlueBucketName}/tmp/dynamodb-to-redshift/"
        WorkerType: G.1X
        NumberOfWorkers: 2
        GlueVersion: "3.0"
        Connections:
          Connections:
            - !Ref GlueRedshiftConnection

glue job脚本：dynamodb-to-redshift.py

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

params = [
    'JOB_NAME',
    'TempDir',
]

args = getResolvedOptions(sys.argv, params)
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

DynamoDBtable_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="dynamodb-database",
    table_name="testsynctoredshift",
    transformation_ctx="DynamoDBtable_node1",
)

RedshiftCluster_node2 = glueContext.write_dynamic_frame.from_catalog(
    frame=DynamoDBtable_node1,
    database="redshift-database",
    table_name="dev_public_test_sync_to_redshift",
    redshift_tmp_dir=args["TempDir"],
    transformation_ctx="RedshiftCluster_node2",
)

job.commit()

测试

insert some data to dynamodb table with aws web console first, otherwise, the crawler can not detect the table scheme

2. run dynamodb-crawler, after run success, you can see the database and table in [glue console](https://ap-southeast-1.console.aws.amazon.com/glue/home?region=ap-southeast-1#catalog:tab=tables)

3. create redshift table with [Redshift query editor v2](https://ap-southeast-1.console.aws.amazon.com/sqlworkbench/home?region=ap-southeast-1#/client)

```

CREATE TABLE "public"."test_sync_to_redshift"(pk varchar(200) not null, sk varchar(200) NOT NULL, primary key(pk, sk));

```

4. run redshift-crawler, if encounter no valid connection error, please update password in the redshift-connection manually with aws console, don't know why the password is not correct when deploy with cloudformation. After run success, you can see the database and table in [glue console](https://ap-southeast-1.console.aws.amazon.com/glue/home?region=ap-southeast-1#catalog:tab=tables)

5. run glue etl job, after run success, you can check data in redshift table with [Redshift query editor v2](https://ap-southeast-1.console.aws.amazon.com/sqlworkbench/home?region=ap-southeast-1#/client).

This glue etl job will `insert all data in dynamodb table` to redshift table directly, as for redshift, [primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html), so if you run the jon serval times, you will see duplicate data with some primary key in redshift table.

使用AWS Glue Streaming Job进行同步

适用场景

持续增量同步
表的操作支持插入，更新和删除

架构图

资源部署

Dynamodb表
VPC
Redshift Cluster
Glue Crawler
GlueJob

优点

可以支持表的插入，更新和删除操作的同步

缺点

使用AWS kinesis Firehose进行同步

适用场景

持续增量同步
表的操作只支持插入，部分支持更新，不支持删除，比如记录传感器每秒收集的数据，记录网站用户的行为事件
- 由于kinesis firehose是通过Redshift COPY命令与redshift进行集成的，而redshift是不保证primary key的唯一性，对redshift来说，primary key只是提供信息，并没有保证primary key的唯一性，如果在COPY命令的源数据中包含多条primary key相同的数据（比如对一条数据进行多次修改），则会导致redshift表出现多条primary key相同的数据。
- 部分支持更新的意思就是如果多条primary key相同的数据对你的业务逻辑没有影响，那也可以使用AWS kinesis Firehose进行同步，如果多条primary key对业务逻辑有影响，那就不可以使用
- 由于kinesis firehose是通过Redshift COPY命令与redshift进行集成的，COPY命令是不支持删除的