Doris X Polaris: Building Unified Data Lakehouse with Iceberg REST Catalog - A Practical Guide
With the continuous evolution of data lake technologies, efficiently and securely managing massive datasets stored on object storage (such as AWS S3) while providing unified access endpoints for upstream analytics engines (like Apache Doris) has become a core challenge in modern data architectures. Apache Polaris, as an open and standardized REST Catalog service for Iceberg, provides an ideal solution to this challenge. It not only handles centralized metadata management but also significantly enhances data lake security and manageability through fine-grained access control and flexible credential management mechanisms.
This document will provide a detailed guide on integrating Apache Doris with Polaris to achieve efficient querying and management of Iceberg data on S3. We’ll guide you through the complete process from environment preparation to final data querying step by step
Through this documentation, you will quickly learn:
AWS Environment Setup: How to create and configure S3 buckets in AWS, and prepare the necessary IAM roles and policies for both Polaris and Doris, enabling Polaris to access S3 and vend temporary credentials for Doris.
Polaris Deployment and Configuration: How to download and start the Polaris service, and create Iceberg Catalog, Namespace, and corresponding Principal/Role/permissions in Polaris to provide secure metadata access endpoints for Doris.
Doris-Polaris Integration: Explains how Doris obtains metadata access tokens from Polaris via OAuth2, and demonstrates two core underlying storage access methods:
Temporary AK/SK distribution by Polaris (Credential Vending mechanism)
Doris directly using static AK/SK to access S3
About Apache Doris
Apache Doris is the fastest analytical and search database for the AI era.
It provides high-performance hybrid search capabilities across structured data, semi-structured data (such as JSON), and vector data. It excels at delivering high-concurrency, low-latency queries, while also offering advanced optimization for complex join operations. In addition, Doris can serve as a unified query engine, delivering high-performance analytical services not only on its self-managed internal table format but also on open lakehouse formats such as Iceberg.
With Doris, users can easily build a real-time lakehouse data platform.
About Apache Polaris
Apache Polaris (Incubating) is a catalog implementation for Apache Iceberg™ tables and is built on the open source Apache Iceberg™ REST protocol.
With Polaris, you can provide centralized, secure read and write access to your Iceberg tables across different REST-compatible query engines.
Hands-on Guide
1. AWS Environment Setup
Before we begin, we need to prepare S3 buckets and corresponding IAM roles on AWS, which form the foundation for Polaris to manage data and Doris to access data.
1.1 Create S3 Bucket
First, we create an S3 bucket named polaris-doris-test
to store the Iceberg table data that will be created later.
# Create an S3 bucket
aws s3 mb s3://polaris-doris-test --region us-west-2
# Verify that the bucket was created successfully
aws s3 ls | grep polaris-doris-test
1.2 Create IAM Role for Object Storage Access
To implement secure credential management, we need to create an IAM role for Polaris to use through the STS AssumeRole mechanism. This design follows the security best practices of the least privileged principle and separation of duties.
Create a trust policy file
Create the
polaris-trust-policy.json
file:Note: Replace YOUR_ACCOUNT_ID with your actual AWS account ID, which can be obtained using
aws sts get-caller-identity --query Account --output text
.cat > polaris-trust-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:root" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "polaris-doris-demo" } } } ] } EOF
Create an IAM Role
aws iam create-role \ --role-name polaris-doris-demo \ --assume-role-policy-document file:///path/to/polaris-trust-policy.json \ --description "IAM Role for Polaris to access S3 storage"
Attach S3 access permission policy
# Attach the AmazonS3FullAccess managed policy (for testing only, use fine-grained permissions for production environments) aws iam attach-role-policy \ --role-name polaris-doris-demo \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
1.3 Bind IAM Role to EC2 Instance (Optional)
If you do not perform this step, you need to export
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
before starting polaris
If your Polaris service will run on an EC2 instance, it is best to bind an IAM role to the EC2 instance instead of using access keys. This avoids hard-coding credentials in the code and improves security.
Create a trust policy for the EC2 instance role
First, create the trust policy file that allows the EC2 service to assume this role:
cat > ec2-trust-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF
Create EC2 Instance Role
aws iam create-role \ --role-name polaris-ec2-role \ --assume-role-policy-document file:///path/to/ec2-trust-policy.json \ --description "IAM Role for EC2 instance running Polaris service"
Attach S3 access permission policy
# Attach the AmazonS3FullAccess managed policy aws iam attach-role-policy \ --role-name polaris-ec2-role \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
Create an instance configuration file
# Create an instance profile aws iam create-instance-profile \ --instance-profile-name polaris-ec2-instance-profile # Add a role to an instance profile aws iam add-role-to-instance-profile \ --instance-profile-name polaris-ec2-instance-profile \ --role-name polaris-ec2-role
Attach the instance profile to the EC2 instance
# If it is a newly created EC2 instance, specify it at startup aws ec2 run-instances \ --image-id ami-xxxxxxxxx \ --instance-type t3.medium \ --iam-instance-profile Name=polaris-ec2-instance-profile \ --other-parameters... # If it is an existing EC2 instance, you need to associate the instance profile aws ec2 associate-iam-instance-profile \ --instance-id i-xxxxxxxxx \ --iam-instance-profile Name=polaris-ec2-instance-profile
2. Polaris Deployment and Catalog Creation
With the environment ready, we’ll now deploy the Polaris service and configure the Iceberg Catalog.
This document uses the source code quick start method. For more deployment methods, please refer to: https://polaris.apache.org/releases/1.0.1/getting-started/deploying-polaris/
2.1 Clone Source Code and Start Polaris
Configure AWS Credentials(Optional)
If you’re not running Polaris on EC2, or if the EC2 instance doesn’t have the appropriate IAM Role attached, you need to provide Polaris with AK/SK that has permission to assume the
polaris-doris-demo
role through environment variables.export AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY_ID export AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_ACCESS_KEY
Clone Polaris Repository and Switch to Specific Version
git clone https://github.com/apache/polaris.git cd polaris # Recommend using a released stable version git checkout apache-polaris-1.0.1-incubating
Run Polaris
Ensure you have Java 21+ and Docker 27+ installed.
./gradlew run -Dpolaris.bootstrap.credentials=POLARIS,root,secret
POLARIS
is the realmroot
is theCLIENT_ID
secret
is theCLIENT_SECRET
If credentials are not set, it will use preset credentials
POLARIS,root,s3cr3t
This command will compile and start the Polaris service, which listens on port 8181 by default.
You can also use binary distribution, see: https://github.com/apache/polaris/tree/main/runtime/distribution
2.2 Create Catalog and Namespace in Polaris
Export ROOT Credentials
The
CLIENT_ID
andCLIENT_SECRET
here are the same as those we set when we started Polarisexport CLIENT_ID=root export CLIENT_SECRET=secret
Create Catalog (Pointing to S3 Storage)
./polaris catalogs create \ --storage-type s3 \ --default-base-location s3://polaris-doris-test/polaris1 \ --role-arn arn:aws:iam::<account_id>:role/polaris-doris-demo \ --external-id polaris-doris-demo \ doris_catalog
--storage-type
: Specifies the underlying storage as S3.--default-base-location
: Default root path for Iceberg table data.--role-arn
: IAM Role that Polaris service uses to assume for S3 access.--external-id
: External ID used when assuming the role, must match the configuration in the IAM Role trust policy.
Create Namespace
./polaris namespaces create --catalog doris_catalog doris_demo
This creates a namespace (database) named
doris_demo
underdoris_catalog
.
2.3 Polaris Security Roles and Permission Configuration
To allow Doris to access as a non-root
user, we need to create a new user and role with appropriate permissions.
Create Principal Role and Catalog Role
# Create a Principal Role for aggregating permissions ./polaris principal-roles create doris_pr_role # Create a Catalog Role under doris_catalog ./polaris catalog-roles create --catalog doris_catalog doris_catalog_role
Grant Permissions to Catalog Role
# Grant doris_catalog_role permission to manage content within the Catalog ./polaris privileges catalog grant \ --catalog doris_catalog \ --catalog-role doris_catalog_role \ CATALOG_MANAGE_CONTENT
Associate Principal Role and Catalog Role
# Assign doris_catalog_role to doris_pr_role ./polaris catalog-roles grant \ --catalog doris_catalog \ --principal-role doris_pr_role \ doris_catalog_role
Create New Principal (User) and Bind Role
# Create a new user (Principal) named doris_user ./polaris principals create doris_user # Example output: {"clientId": "6e155b128dc06c13", "clientSecret": "ce9fbb4cc91c43ff2955f2c6545239d7"} # Please note down this new client_id and client_secret pair, as Doris will use them for connection. # Bind doris_user to doris_pr_role ./polaris principal-roles grant \ doris_pr_role \ --principal doris_user
With this, all Polaris-side configuration is complete. We’ve created a user named
doris_user
that obtains permission to managedoris_catalog
throughdoris_pr_role
.
3. Doris-Polaris Integration
Now, we’ll create an Iceberg Catalog in Doris that connects to the newly configured Polaris service. Doris supports multiple flexible authentication combinations.
Note: In this example, we use OAuth2 authentication credential to connect to the Polaris rest service. In addition, Doris also supports using
iceberg.rest.oauth2.token
to directly provide a pre-obtained Bearer Token
Method 1: OAuth2 + Temporary Storage Credentials (Credential Vending)
This is the most recommended approach. Doris uses OAuth2 credentials to authenticate with Polaris and obtain metadata. When needing to read/write data files on S3, Doris requests a temporary S3 access credential with minimal privileges from Polaris.
Doris Catalog Creation Statement:
Use the clientId
and clientSecret
generated for doris_user
.
CREATE CATALOG polaris_vended PROPERTIES (
'type' = 'iceberg',
-- Catalog name in Polaris
'warehouse' = 'doris_catalog',
'iceberg.catalog.type' = 'rest',
-- Polaris service address
'iceberg.rest.uri' = 'http://YOUR_POLARIS_HOST:8181/api/catalog',
-- Metadata authentication method
'iceberg.rest.security.type' = 'oauth2',
-- Replace with doris_user's client_id:client_secret
'iceberg.rest.oauth2.credential' = 'client_id:client_secret',
'iceberg.rest.oauth2.server-uri' = 'http://YOUR_POLARIS_HOST:8181/api/catalog/v1/oauth/tokens',
'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:doris_pr_role',
-- Enable credential vending
'iceberg.rest.vended-credentials-enabled' = 'true',
-- S3 basic configuration (no keys required)
's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
's3.region' = 'us-west-2'
);
Method 2: OAuth2 + Static Storage Credentials (AK/SK)
In this approach, Doris still uses OAuth2 to access Polaris metadata, but when accessing S3 data, it uses static AK/SK hardcoded in the Doris Catalog configuration. This method is simple to configure and suitable for quick testing, but has lower security.
Doris Catalog Creation Statement:
CREATE CATALOG polaris_aksk PROPERTIES (
'type' = 'iceberg',
'warehouse' = 'doris_catalog',
'iceberg.catalog.type' = 'rest',
'iceberg.rest.uri' = 'http://YOUR_POLARIS_HOST:8181/api/catalog',
'iceberg.rest.security.type' = 'oauth2',
'iceberg.rest.oauth2.credential' = 'client_id:client_secret',
'iceberg.rest.oauth2.server-uri' = 'http://YOUR_POLARIS_HOST:8181/api/catalog/v1/oauth/tokens',
'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:doris_pr_role',
-- Directly provide S3 access keys
's3.access_key' = 'YOUR_S3_ACCESS_KEY',
's3.secret_key' = 'YOUR_S3_SECRET_KEY',
's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
's3.region' = 'us-west-2'
);
4. Managing Iceberg Table in Doris with Polaris
Regardless of which method you use to create the Catalog, you can manage the Iceberg table with following SQL statements.
-- Switch to the Catalog you created and the Namespace configured in Polaris
USE polaris_vended.doris_demo;
-- Create an Iceberg table
CREATE TABLE my_iceberg_table (
id INT,
name STRING
)
PROPERTIES (
'write-format'='parquet'
);
-- Insert data
INSERT INTO my_iceberg_table VALUES (1, 'Doris'), (2, 'Polaris');
-- Query data
SELECT * FROM my_iceberg_table;
-- Expected result:
-- +------+---------+
-- | id | name |
-- +------+---------+
-- | 1 | Doris |
-- | 2 | Polaris |
-- +------+---------+
If all the above operations succeed, congratulations! You have successfully established the complete data lake pipeline from Doris -> Polaris -> Iceberg on S3.
For more information about managing Iceberg table with Doris, please visit:
https://doris.apache.org/docs/lakehouse/catalogs/iceberg-catalog