Skip to Main Content
  • Questions
  • dbms_execute_parallel data consistency problem

Breadcrumb

Question and Answer

Chris Saxon

Thanks for the question, Mstfa.

Asked: October 18, 2023 - 12:08 pm UTC

Last updated: October 27, 2023 - 12:48 pm UTC

Version: 11g - 19c

Viewed 1000+ times

You Asked

Hello experts,

I have 2 databases, one is 19c, and the other is 11g R2. In the 11g version database, I have a table with approximately 9 billion records. I am attempting to move this table using the dbms_execute_parallel package.

My code is roughly as follows:

declare

  l_sql_stmt    constant varchar2 (500) := 'INSERT INTO target_table
             SELECT * FROM source_table@db_link 
             WHERE rowid between :start_id AND :end_id
             ';

  l_chunk_sql   constant varchar2 (1000) := 'select min(r) start_id, max(r) end_id 
            from (
              SELECT ntile(100) over (order by date_column) grp, rowid r
              FROM   source_table@db_link
               )
            group  by grp
              ';

  l_task_name   constant varchar2(20) := 'my_task4';

begin

  dbms_parallel_execute.create_task(l_task_name);

  dbms_parallel_execute.create_chunks_by_sql(l_task_name, l_chunk_sql, true);

 
  dbms_parallel_execute.run_task(
    task_name        => l_task_name,
    sql_stmt         => l_sql_stmt,
    language_flag    => dbms_sql.native,
    parallel_level   => 16
  );

   dbms_parallel_execute.drop_task(l_task_name);
   
end;



Every time I run this code, I encounter a data duplication issue in the target database, which is the 19c version. Approximately 1000 records are duplicated, but with each run, I face different numbers of issues. It can be 1000 in one run, 2000 in another. It varies. I have executed this code on both databases. In other words, I ran it on both 11g and 19c databases by changing the direction of the db_link. However, the result remains the same.

Furthermore, I also attempted to migrate using the dbms_parallel_execute.create_chunks_by_rowid feature from the 11g version to the 19c version. However, the result was the same. In fact, the method's direction should only be from the 11g database to the 19c database due to its usage of chunks_by_rowid.

Finally, I am experiencing this issue not just with a single table but with multiple tables as well. Sometimes, it even works correctly on tables that previously had issues. I haven't been able to identify a consistent pattern.


Is there a bug in the dbms_execute_parallel package? Is there a bug in the 11g version? Is there a bug in the 19c version? Or am I doing something wrong?

Thank you.

and Chris said...

If you want to split the chunks by ROWID, you should use the DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_ROWID procedure. This creates the groups for you without having to write a SQL statement.

The issue with the query above is rows with consecutive ROWIDs (almost certainly will) have non-consecutive values for date_column.

For example, if you create this table:

create table t as 
  select level c1, mod ( level, 2 ) c2 from dual
  connect by level <= 100
  order  by c1; 


The rows for each value of C2 are spread throughout the whole table. Use NTILE to split these into four groups and the generated ROWID ranges will include rows for the other value of C2.

with grps as (
  select c1, c2, ntile(4) over ( order by c2 ) grp from t
)
  select grp, min ( c1 ), max ( c1 ), 
         min ( rowid ) st_rid, max ( rowid ) en_rid
  from   grps
  group  by grp
  order  by grp;
/* 
       GRP    MIN(C1)    MAX(C1) ST_RID             EN_RID            
---------- ---------- ---------- ------------------ ------------------
         1          2        100 AAAxaRAAOAADH4rAAB AAAxaRAAOAADH4rABj
         2          4         98 AAAxaRAAOAADH4rAAD AAAxaRAAOAADH4rABh
         3          1         49 AAAxaRAAOAADH4rAAA AAAxaRAAOAADH4rAAw
         4         51         99 AAAxaRAAOAADH4rAAy AAAxaRAAOAADH4rABi
*/
select count(*) 
from   t
where  rowid between 'AAAxaRAAOAADH4rAAB' and 'AAAxaRAAOAADH4rABj';
/*
  COUNT(*)
----------
        99
*/


The range generated for the first group covers 99 rows - almost all the data!

There's a high chance this query:

SELECT * FROM source_table@db_link 
WHERE rowid between :start_id AND :end_id


Reads the same rows for different chunks. You could avoid this by changing the NTILE query to ORDER BY ROWID. I would just use CREATE_CHUNKS_BY_ROWID instead though.

Rating

  (8 ratings)

We're not taking comments currently, so please try again later if you want to add a comment.

Comments

A reader, October 19, 2023 - 6:19 pm UTC

I'm experiencing the same issue when using create_chunks_by_rowid. It was a code I wrote as an example with order by date_column. However, when I split it using order by rowid, I still encounter the same problem.

I'm encountering the issue not only with create_chunks_by_rowid but also in other tables. Let's assume I'm running it for data transfer in one table. It's inserting incorrect records. However, when I simply restart the code without making any changes, it then inserts the correct record. So, I'm not sure what's missing. What could be the problem?

A reader, October 19, 2023 - 6:22 pm UTC

When I first asked the question, I mentioned that I'm getting the same result with create_chunks_by_rowid. Nothing is changing. Every time the code runs, it either inserts the correct data or the wrong data. It's not consistently working. You can try this test in your own environments with a table containing over 100 million rows. I'm confident you will identify the same issue.
Chris Saxon
October 20, 2023 - 5:04 pm UTC

Ah yes you did. Sorry for missing that.

A reader, October 19, 2023 - 6:47 pm UTC

Could you provide another example of code as shown below? This one is behaving exactly as I described. It sometimes works correctly, and sometimes it doesn't. There can be duplicate records. I can't understand the reason for this. When I set the by_row parameter to FALSE, the result doesn't change either.

declare
  cursor c1 is select table_name from table_list;
  ic1 c1%rowtype;

begin

  execute immediate 'alter session enable parallel dml';

  open c1;

   loop

   begin
   fetch c1 into ic1;
   exit when c1%notfound;

    DBMS_PARALLEL_EXECUTE.CREATE_TASK ('mytask1');

    DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_rowid(
                     task_name => 'mytask1'
                    ,table_owner=>'table_owner'
                    ,table_name=>ic1.table_name
                    ,by_row  => TRUE
                    ,chunk_size  => 20000
                  );

                  

    v_sql := 'insert into tbl_owner.'||ic1.table_name||'@db_link
              select  * from  table_owner.'||ic1.table_name||'
              where rowid between :start_id and :end_id
              ';

 

   DBMS_PARALLEL_EXECUTE.RUN_TASK('mytask1', v_sql, DBMS_SQL.NATIVE,parallel_level => 16);

   DBMS_PARALLEL_EXECUTE.drop_task ('mytask1'); 

   end loop; 

 --commit;              

end;

Chris Saxon
October 20, 2023 - 5:01 pm UTC

Why are you enabling parallel DML?

The point of using DBMS_PARALLEL_EXECUTE is a DIY parallel operation rather than the database functionality. Does disabling this have any effect?

To "reader": a few suggestions

Stew Ashton, October 20, 2023 - 9:56 am UTC

In your latest code, you are opening an explicit cursor but not closing it: bad!

I suggest using an implicit cursor:
for rec in (select table_name from table_list) loop
  -- use rec.table_name
end loop;

Also, when creating chunks by rowid, I suggest using by_row => FALSE and a smaller chunk size.

Are you getting duplicate rows, or missing rows, or both?

Is the source table being modified during this processing? Can you lock the source table before copying and unlock it afterwards, at least for testing purposes?

Here's what you could do as a sanity check:
- Create the rowid chunks
- Select count(*) from <source table> where rowid between :start_id AND :end_id
- Add the numbers and compare to select count(*) from the whole table
- If the numbers don't make sense, try to find any gaps or overlaps between ranges. That would identify the problem area for further diagnosis.

Best regards,
Stew
Chris Saxon
October 20, 2023 - 4:59 pm UTC

Good suggestions and questions

Mstfa, October 20, 2023 - 11:52 am UTC

Hello,

Thank you for the response. I had forgotten to close the cursor, you're absolutely right. I also set the by_row parameter to FALSE and tried a wide range of values between 1000 and 100,000, but the result still doesn't change. Both duplicate records and records that haven't been transferred at all can occur. This varies with each run.

The tables I mentioned are not transactional tables; they're tables that change data once a day. However, I will still attempt to apply locks and perform the operation.

After splitting into chunks, I individually checked the record counts, and they match exactly with the total record count of the table. However, after the code runs and I see it marked as FINISHED in the STATUS, when I check the record count in the target environment, it's often incorrect. Of course, sometimes I also see the correct record count.
Chris Saxon
October 20, 2023 - 5:03 pm UTC

So what exactly happens? How are you detecting the duplicate and missing rows? How certain are you there's no one else changing the data?

Please provide a complete test case we can run to reproduce the problem.

Can you log the number of rows inserted per chunk?

Stew Ashton, October 21, 2023 - 8:59 am UTC

In the code that is executed for each chunk, after the insert can you log the number of rows inserted (SQL%ROWCOUNT) into a little log table? Be sure to add something (such as the value of :start_id) that will identify the chunk, as well as the table name (if more than one table).

Then compare those numbers with the count(*) numbers you got before, and with the total number of rows in the target table. That will narrow down the problem area even more.

Finally, in a test environment you might consider adding a column to the target table, which you could use to identify which subtask actually inserted the row. This would tell you which chunks worked together to produce duplicate rows.

One thing I seem to remember from previous work on ROWID ranges: - in SQL, ROWID is a function not data
- in PL/SQL, ROWID is a VARCHAR2 representation of what the ROWID function returns
- this is why there are CHARTOROWID and ROWIDTOCHAR functions.

Why is this significant? Because the SQL ROWIDS and their character representations don't always sort the same way! This led to bugs in my code that dealt with ROWID ranges. I don't know whether DBMS_PARALLEL_EXECUTE has known bugs, but I would be extra careful of any homemade code.

Best regards,
Stew
Connor McDonald
October 22, 2023 - 11:28 pm UTC

thanks for the input stew

test scenario

Mstfa, October 24, 2023 - 4:44 am UTC

Hi Again,

I have prepared a test scenario as follows. Can you interpret it?


/*
The source database is 11g.
The target database is 19c.

All codes were executed on an 11g database. 
*/

alter session enable parallel dml

 

create table sch_11g.TRNS_TEST_TABLE nologging
as
select * from all_tables;

 

 

begin
    for r in 1..15 loop
    insert /*+APPEND*/ into sch_11g.TRNS_TEST_TABLE select * from  sch_11g.TRNS_TEST_TABLE;
    end loop;
    commit;

    DBMS_STATS.GATHER_TABLE_STATS(ownname => 'SCH_11G', tabname => 'TRNS_TEST_TABLE', estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE, granularity => 'ALL', degree=>8);
end;

 
declare
 v_sql          varchar2(20000);
begin

   DBMS_PARALLEL_EXECUTE.CREATE_TASK ('mytask1');
   DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_rowid(
                     task_name => 'mytask1'
                    ,table_owner=>'SCH_11G'
                    ,table_name=>'TRNS_TEST_TABLE'
                    ,by_row  => FALSE
                    ,chunk_size  => 1000
                  );

   v_sql := 'insert  into SCH_19C.TRNS_TEST_TABLE@db_link1
                         (OWNER, TABLE_NAME, TABLESPACE_NAME, CLUSTER_NAME, IOT_NAME, STATUS, PCT_FREE, PCT_USED, INI_TRANS, MAX_TRANS,
                          INITIAL_EXTENT, NEXT_EXTENT, MIN_EXTENTS, MAX_EXTENTS, PCT_INCREASE, FREELISTS, FREELIST_GROUPS, LOGGING, BACKED_UP, NUM_ROWS,
                          BLOCKS, EMPTY_BLOCKS, AVG_SPACE, CHAIN_CNT, AVG_ROW_LEN, AVG_SPACE_FREELIST_BLOCKS, NUM_FREELIST_BLOCKS, DEGREE, INSTANCES,
                          CACHE, TABLE_LOCK, SAMPLE_SIZE, LAST_ANALYZED, PARTITIONED, IOT_TYPE, TEMPORARY, SECONDARY, NESTED, BUFFER_POOL, FLASH_CACHE, CELL_FLASH_CACHE,
                          ROW_MOVEMENT, GLOBAL_STATS, USER_STATS, DURATION, SKIP_CORRUPT, MONITORING, CLUSTER_OWNER, DEPENDENCIES, COMPRESSION, COMPRESS_FOR, DROPPED,
                          READ_ONLY, SEGMENT_CREATED, RESULT_CACHE)
                  select  OWNER, TABLE_NAME, TABLESPACE_NAME, CLUSTER_NAME, IOT_NAME, STATUS, PCT_FREE, PCT_USED, INI_TRANS, MAX_TRANS,
                          INITIAL_EXTENT, NEXT_EXTENT, MIN_EXTENTS, MAX_EXTENTS, PCT_INCREASE, FREELISTS, FREELIST_GROUPS, LOGGING, BACKED_UP, NUM_ROWS,
                          BLOCKS, EMPTY_BLOCKS, AVG_SPACE, CHAIN_CNT, AVG_ROW_LEN, AVG_SPACE_FREELIST_BLOCKS, NUM_FREELIST_BLOCKS, DEGREE, INSTANCES,
                          CACHE, TABLE_LOCK, SAMPLE_SIZE, LAST_ANALYZED, PARTITIONED, IOT_TYPE, TEMPORARY, SECONDARY, NESTED, BUFFER_POOL, FLASH_CACHE, CELL_FLASH_CACHE,
                          ROW_MOVEMENT, GLOBAL_STATS, USER_STATS, DURATION, SKIP_CORRUPT, MONITORING, CLUSTER_OWNER, DEPENDENCIES, COMPRESSION, COMPRESS_FOR, DROPPED,
                          READ_ONLY, SEGMENT_CREATED, RESULT_CACHE
                  from  SCH_11G.TRNS_TEST_TABLE
                  where rowid between :start_id and :end_id
                  ';

       DBMS_PARALLEL_EXECUTE.RUN_TASK('mytask1', v_sql, DBMS_SQL.NATIVE,parallel_level => 32);

   --DBMS_PARALLEL_EXECUTE.drop_task ('mytask1'); 
end;


select * from USER_PARALLEL_EXECUTE_TASKS

TASK_NAME   CHUNK_TYPE     STATUS
mytask1     ROWID_RANGE    FINISHED

 
--Results :

select count(1) from  sch_11g.TRNS_TEST_TABLE  --130198960

select count(1) from  sch_19c.TRNS_TEST_TABLE@db_link1  --130200150



As you can see, there is duplication of certain rows in the target database. What is your recommendation?
Chris Saxon
October 26, 2023 - 1:44 pm UTC

Try running this to see if a ROWID is in two chunks and report back what you find:

select t.rowid, count(*) 
from   user_parallel_execute_chunks
join   sch_11g.trns_test_table t
on     t.rowid between start_rowid and end_rowid
group  by t.rowid
having count(*) > 1;

It wasn't a duplicate, it turns out it was corrupt data.

A reader, October 27, 2023 - 12:22 am UTC

I ran the same queries on different data again. The results are as follows. I thought there were duplicate records. I performed a minus operation between two tables. It turns out the data it inserted was corrupt. While there is no such corruption in the table normally, why is it inserting data like this? And why does it sometimes do this on the same table? Running the same code, it either inserts the correct data or corrupt data like this.

select count(1) from  sch_11g.trns_test_table  -- 176.551.767

select count(1) from  sch_19c.trns_test_table@db_link1  -- 176.553.227

select t.rowid, count(*)
from   user_parallel_execute_chunks
join   sch_11g.trns_test_table t
on     t.rowid between start_rowid and end_rowid
group  by t.rowid
having count(*) > 1;   -- No rows returned.

 
create table sch_11g.result as 
select * from sch_19c.trns_test_table@db_link1
minus
select * from sch_11g.trns_test_table


select first_column from sch_11g.result  
/* 1460 rows returned. I can't retrieve all the columns. Because it's throwing an error. "ora-01455 converting column overflows integer datatype while export" */

first_column
--------------
"xz­xy
5!xs
*xo17xs
xu[1]
-7xt

2xm
4wÆ              ‑xm,(xi
*/xn9
xg[1]
xw
xx7x{
‑xx
",xz
&xz
[1]9x{[1]
*4xj
‑xj
­xw
)x{[1

xj­‑xl
xp
ÿÿ
ÿ
8000007000440

Chris Saxon
October 27, 2023 - 12:48 pm UTC

Yikes! I don't know what's going on here; with data corruption like this speak with support.

In the meantime, you may want to explore other options for migrating the rows like data pump.

More to Explore

PL/SQL demos

Check out more PL/SQL tutorials on our LiveSQL tool.

PL/SQL docs

PL/SQL reference manual from the Oracle documentation library