Introduction
Recently, I had
a task to refresh our UAT environment using a PDB from the production database
hosted in Oracle Cloud Infrastructure (OCI).
At first, the
activity looked straightforward. The plan was to perform a remote PDB clone
from Production to UAT. However, there was one challenge from the beginning:
the source and target databases were located in different OCI VCNs.
After
establishing connectivity between the environments, I started the cloning
process. The clone started successfully, but after approximately one hour, it
appeared to stall, and no significant progress was observed.
After some
investigation and testing, I identified two important factors that affected the
cloning process:
- The target DBCS was running with only 1
OCPU.
- The production database had an hourly
archive log backup and a delete job running through Commvault.
In this article, I will share the architecture, troubleshooting process, and lessons learned from this refresh activity.
1. Environment Overview
Source
Environment
- Production Oracle Database Cloud Service
(DBCS)
- Production DBCS is hosted in the Production
VCN
Target
Environment
- UAT Oracle Database Cloud Service (DBCS)
- Target DBCS is hosted in a separate Non-Production VCN
Refresh Method
- Remote PDB Clone using Database Link
2. Challenge #1 – Different OCI VCNs
The first challenge was that the Production and UAT databases were
deployed in different OCI VCNs.
Since remote PDB cloning requires communication between the source
and target databases, I first needed to establish network connectivity.
To achieve this, I configured Local Peering Gateways (LPGs) between
the two VCNs.
The implementation included:
2.1. Creating LPGs in both VCNs
2.1.1. Add LPG to the Non-Production (NPRD) VCN.
Log in to your tenancy in OCI, then go to the navigation menu → Networking → Virtual Cloud Network.
Select the correct compartment and click on the Non-Prod VCN.
Go
to the Gateways → click Create Local Peering Gateway.
Now, our LPG has been created in the Non-Prod VCN, and its peering status is New - Not connected to a peer.
2.1.2. Add LPG to the Production (PRD) VCN.
Go to Navigation Menu → Networking → Virtual Cloud Network. Select
the Prod VCN, then go to Gateways and click Create Local Peering Gateway.
Repeat the same steps as in the Non-Prod VCN.
Now, our LPG has been created in Prod VCN and its Peering status is
New-Not connected to a PEER
2.2. Establishing peering between them
In prod, LPG click Establish Peering Connection
Then select the Non-Prod VCN and configure the Non-Prod LPG as an Unpeered
Peer Gateway.
Now, the Peering status of both LPGs will change to Peered-Connected
to a peer
2.3. Updating route tables
2.3.1. Modify the route tables on Prod VCN
Go to Navigation Menu → Networking → Virtual Cloud Network. Select
the Prod VCN, then go to the Routing tab.
2.3.2. Modify the route tables on non-prod VCN
Do the same steps as in 2.3.1 on the Non-Prod VCN. However, this time enter the CIDR block for the Prod VCN and the name of the Prod LPG, then click Add Route Rules.
2.4. Verifying security rules
Establishing a Local Peering Gateway (LPG) between the production and non-production VCNs creates the network path, but traffic will still be blocked by default at the database layer. To allow the UAT database to initiate the clone, I needed to add the following rule to my Prod VCN.
3. Creating the Database Link
Once the network connectivity was in place, I created a database
link from the target CDB to the production database.
3.1. Create the User in the Source (and optional dest) CDB
-- Run this on Source (Prod) CDB$ROOT
CREATE USER C##CLONE_USER IDENTIFIED BY "PASSWORD" CONTAINER=ALL;
GRANT CREATE SESSION, SELECT ANY DICTIONARY TO C##CLONE_USER CONTAINER=ALL;
GRANT SYSOPER TO C##CLONE_USER CONTAINER=ALL;
3.2. Create the Database Link in the Destination CDB
-- Run this on Destination (UAT) CDB$ROOT
CREATE DATABASE LINK CLONE_LINK
CONNECT TO C##CLONE_USER IDENTIFIED BY "Cl12ON$$EU34$Re"
USING '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=*****.oraclevcn.com)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME= *****.oraclevcn.com)))';
3.3. Check if the database link is working correctly
select name from v$database@clone_link;
4. First Clone Attempt
The UAT DBCS was configured with only 1 OCPU because it was mainly
used for testing purposes.
I started the remote PDB clone operation. Initially, everything looked normal. The clone started successfully, and data transfer began.
However, after approximately one hour, the process appeared to stop
making noticeable progress. The session remained active, but the clone was
taking much longer than expected.
At this stage, I started investigating possible bottlenecks.
4.1. Clone command
The command used for the clone operation was:
-- Run this on Destination (UAT) CDB$ROOT
[oracle@uat script]$ cat clone_pdb.sql
-- clone_pdb.sql
SET ECHO ON
SET TIME ON
SET TIMING ON
SET PAGESIZE 0
SET LINESIZE 200
PROMPT *** Starting PDB Clone at:
SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD HH24:MI:SS') FROM DUAL;
CREATE PLUGGABLE DATABASE UATPDB FROM PRODPDB@clone_link
REFRESH MODE NONE
PARALLEL 2
KEYSTORE IDENTIFIED BY "PasswordForWallet";
PROMPT *** Opening PDB PRODPDB...
ALTER PLUGGABLE DATABASE UATPDB OPEN;
PROMPT *** Saving PDB State...
ALTER PLUGGABLE DATABASE UATPDB SAVE STATE;
PROMPT *** Clone Completed at:
SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD HH24:MI:SS') FROM DUAL;
EXIT;To ensure the clone process continued even after disconnecting from
the session, I executed the script in the background using nohup.
[oracle@uat ~]$ nohup sqlplus / as sysdba @/home/oracle/script/clone_pdb.sql > /home/oracle/script/pdb_clone_full.log 2>&1 &
Explanation of Key Parameters in the script:
- CREATE PLUGGABLE DATABASE ... FROM
...@clone_link
Executes the remote PDB clone over the database link. - REFRESH MODE NONE
Indicates a one-time clone (not refreshable). - PARALLEL
Used to speed up the clone. In the first attempt, since the target had only 1 OCPU, I used PARALLEL 2. - KEYSTORE IDENTIFIED BY
"PasswordForWallet"
Required in OCI environments with TDE to open the wallet during the clone.
After creation, the script opens the PDB, saves its state, and logs
start/end timestamps for tracking execution time.
5. Investigating the Target Database Resources
One of the first things I reviewed was the compute configuration of
the target database.
The UAT database was running on:
- Shape: VM.Standard3.Flex
- OCPUs: 1
- Memory: 16 GB
- Network Bandwidth: 1 Gbps
The OCI console clearly showed that the VM was limited to
approximately 1 Gbps network bandwidth.
The OCI console clearly showed that the VM was limited to
approximately 1 Gbps network bandwidth.
Since a remote PDB clone involves transferring database blocks over
the network and writing them to storage on the target system, this immediately
became a potential bottleneck.
To test this theory, I temporarily increased the OCPUs from 1 to 8.
Note that changing the shape (increasing OCPUs) will cause the DBCS to reboot.
To do this, I went to the navigation menu → Oracle AI Database → Oracle Base
Database Service
In the DB System page, I selected the UAT database. Then I went to
the Nodes tab, clicked the Actions button, and clicked Change Shape.
As the OCPUs increased, the shape resources scaled accordingly.
The clone operation was significantly faster, and the overall
database performance improved. This was a good reminder that in OCI, increasing
OCPUs affects more than CPU capacity. It also increases available network
bandwidth and storage performance, which can have a direct impact on large
database operations.
6. Investigating Archive Log Management
While reviewing the production environment, I identified a
potential risk related to archive log handling.
The production database was protected by Commvault with an hourly
archive log backup job that also deletes archive logs after backup, similar to BACKUP
ARCHIVELOG ALL DELETE INPUT;
Oracle requires an active archive log stream to finalize a remote
PDB clone. If Commvault truncates these logs during the process, the clone may
hang or fail with ORA errors because the required SCN sequence is no longer
available for synchronization.
Since the clone runs for an extended period and includes a final
sync phase, this behavior was identified as a potential risk. Increasing the
CPU helped confirm this, as the clone progressed further and did not appear
stuck.
To eliminate this risk, I temporarily stopped Commvault on
production using:
commvault stop
--and verified it using:
commvault status
7. Second Clone Attempt
I did the second clone attempt after the follwoing changes in comparison with the first attempt:
- Scaling the UAT DBCS from 1 OCPU to 8
OCPUs
- Temporarily disabling the archive log
backup-and-delete job
- Changing parallel to 8 in the clone script
I executed the clone operation again. This time, the refresh completed successfully without any issues. The overall performance was significantly better compared to the original attempt.
8. Decrease the number of OCPUs to 1
After the refresh was completed successfully, the additional resources were no longer required. To avoid unnecessary costs, I scaled the UAT DBCS back from 8 OCPUs to 1 OCPU.
Conclusion
Although this started as a routine PDB refresh, it became a
valuable troubleshooting exercise involving OCI networking, infrastructure
sizing, and operational processes.
By configuring Local Peering Gateways, scaling the database from 1
to 8 OCPUs, and reviewing archive log management, the refresh was successfully
completed.
This experience highlighted that PDB clone performance issues are
not always database-related—network, compute, storage, and operational factors
can all have an impact.
No comments:
Post a Comment