Challenges of Large Scale DML using T-SQL

Using T-SQL to insert, update, or delete large amounts of data from a table will results in some unexpected difficulties if you’ve never taken it to task.

Let’s say you have a table in which you want to delete millions of records. If the goal was to remove all then we could simply use TRUNCATE. However, if we want to remove records which meet some certain criteria then executing something like this will cause more trouble that it is worth.

Why not run the following in a production environment?

DELETE 
FROM myTable
WHERE columnA = 'whatever';

DELETE

FROM myTable

WHERE columnA = 'whatever';

Transaction log growth – expect a large transaction like this to cause the transaction log to expand. Calls to expand the database files are some of the most expensive operations SQL Server makes to the OS. Hopefully the initial sizing was good and the growth settings are reasonable.
Blocking – concurrency is going to increase the likelihood of lock escalation. That will lead to the possibility of excessive blocking.
Rollbacks – if the SPID is killed then the transaction will enter a rollback state. For very large transactions this can take a lot of time. Don’t even think about killing or stopping SQL Services during a rollback unless you want a corrupt database!

What can we do to improve this?

We follow an important maxim of computer science – break large problems down into smaller problems and solve them.

We break up the transaction into smaller batches to get the job done. Each of the above points can be relived in this manner. Let’s take a look at some examples.

How to Use T-SQL to Delete Millions of Rows

Combine the top operator with a while loop to specify the batches of tuples we will delete. Tracking progress will be easier by keeping track of iterations and either printing them or loading to a tracking table.

Let’s setup an example and work from simple to more complex.

--Make table with millions of records
CREATE TABLE Colors
(
	colA	varchar(10)
);

INSERT INTO Colors (colA) VALUES('Red');
GO 1000000

select * from Colors; --1,000,000

--Delete in batches
declare @rc int; --record count
set @rc = 1;

while @rc > 0
begin
	--deleting by sets of 10,000
	delete top (10000) from Colors;
	set @rc = @@ROWCOUNT;
end

--Make table with millions of records

CREATE TABLE Colors

(

colA varchar(10)

);

INSERT INTO Colors (colA) VALUES('Red');

GO 1000000

select * from Colors; --1,000,000

--Delete in batches

declare @rc int; --record count

set @rc = 1;

while @rc > 0

begin

--deleting by sets of 10,000

delete top (10000) from Colors;

set @rc = @@ROWCOUNT;

end

Now let’s print some indication of progress.

--Delete in batches with progress printed
declare @rc int; 
set @rc = 1;

while @rc > 0
begin
	delete top (10000) from Colors;
	set @rc = @@ROWCOUNT;
	raiserror('deleted 10,000 rows',10,1) with nowait;
end

--Delete in batches with progress printed

declare @rc int;

set @rc = 1;

while @rc > 0

begin

delete top (10000) from Colors;

set @rc = @@ROWCOUNT;

raiserror('deleted 10,000 rows',10,1) with nowait;

end

A better way is to store progress in a table instead of printing to the screen.

/*
	Delete in batches with progress put into table
*/

--make table for holding progress
create table progress
(
	 tableName varchar(128)
	,totalRecords int
	,recordsDeleted int
    ,datetimestamp	datetime
);

insert into progress (tableName, recordsDeleted) values('Colors',0);

declare @totalRecords int;
select @totalRecords = count(*) from Colors;

update progress
set totalRecords = @totalRecords;

select * from progress;

--begin to delete
declare @rc int; 
set @rc = 1;

while @rc > 0
begin
	delete top (10000) from Colors;
	set @rc = @@ROWCOUNT;

	update progress
	set recordsDeleted += @rc
       ,datetimestamp = getdate();
end

--View in another query thread
select * from progress with(nolock)

Delete in batches with progress put into table

--make table for holding progress

create table progress

(

tableName varchar(128)

,totalRecords int

,recordsDeleted int

,datetimestamp datetime

);

insert into progress (tableName, recordsDeleted) values('Colors',0);

declare @totalRecords int;

select @totalRecords = count(*) from Colors;

update progress

set totalRecords = @totalRecords;

select * from progress;

--begin to delete

declare @rc int;

set @rc = 1;

while @rc > 0

begin

delete top (10000) from Colors;

set @rc = @@ROWCOUNT;

update progress

set recordsDeleted += @rc

,datetimestamp = getdate();

end

--View in another query thread

select * from progress with(nolock)

Finally let’s incorporate dynamic SQL.

/*
	Delete in batches (parameter) with progress put into table
	Keep same progress table logic as abvoe
*/

declare @sql nvarchar(max);

set @sql = '
	declare @batchSize int;
	declare @rc int; --record count

	set @batchSize = 10000;
	set @rc = 1;

	while @rc > 0
	begin
		delete top (@batchSize) from Colors;
		set @rc = @@ROWCOUNT;

		update progress
		set recordsDeleted += @rc
           ,datetimestamp = getdate()
	end';

exec sp_executesql @sql;

Delete in batches (parameter) with progress put into table

Keep same progress table logic as abvoe

declare @sql nvarchar(max);

set @sql = '

declare @batchSize int;

declare @rc int; --record count

set @batchSize = 10000;

set @rc = 1;

while @rc > 0

begin

delete top (@batchSize) from Colors;

set @rc = @@ROWCOUNT;

update progress

set recordsDeleted += @rc

,datetimestamp = getdate()

end';

exec sp_executesql @sql;

How to Use T-SQL to Insert Millions of Rows

Similar principles can be applied to inserting large amounts from one table to another. We cannot use the same code as above with just replacing the DELETE statement with an INSERT statement.

What do you see wrong here?

--Source table
CREATE TABLE sourceTable
(
	colA	varchar(10)
);


--Target table
CREATE TABLE targetTable
(
	colA	varchar(10)
);

INSERT INTO sourceTable(colA) VALUES('Red');
GO 1000000

declare @rc int; 
set @rc = 1;

--WRONG!
while @rc > 0
begin
	insert into targetTable select top 10000 * from sourceTable;
	set @rc = @@ROWCOUNT;
end

--Source table

CREATE TABLE sourceTable

(

colA varchar(10)

);

--Target table

CREATE TABLE targetTable

(

colA varchar(10)

);

INSERT INTO sourceTable(colA) VALUES('Red');

GO 1000000

declare @rc int;

set @rc = 1;

--WRONG!

while @rc > 0

begin

insert into targetTable select top 10000 * from sourceTable;

set @rc = @@ROWCOUNT;

end

The problem is a logic error – we’ve created an infinite loop!

To avoid that we will need to keep track of what we are inserting. Here are a few examples:

--Keeping track of inserts
declare @rc int; 
declare @tableRows int;
declare @batchSize int;
declare @start int;
declare @end int;

set @rc = 1;
select @tableRows = count(*) from sourceTable; --initialize to total number of rows in table
set @batchSize = 10000;
set @start = 1;
set @end = @start + @batchSize - 1;

while @rc < @tableRows
begin
	with cte(colA, RowNbr) as
	(
		select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr' 
		from sourceTable
	)

	insert into targetTable(colA)
	select colA
	from cte
	where RowNbr between @start and @end;

	set @rc += @batchSize;
	set @start = @end + 1 ;
	set @end = @start + @batchSize - 1;
end

--Keeping track of inserts

declare @rc int;

declare @tableRows int;

declare @batchSize int;

declare @start int;

declare @end int;

set @rc = 1;

select @tableRows = count(*) from sourceTable; --initialize to total number of rows in table

set @batchSize = 10000;

set @start = 1;

set @end = @start + @batchSize - 1;

while @rc < @tableRows

begin

with cte(colA, RowNbr) as

(

select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr'

from sourceTable

)

insert into targetTable(colA)

select colA

from cte

where RowNbr between @start and @end;

set @rc += @batchSize;

set @start = @end + 1 ;

set @end = @start + @batchSize - 1;

end

Assume here we want to migrate a table – move the records in batches and delete from the source as we go.

create table #tmp
(
	 rownbr int
	,colA varchar(10)
);

declare @rc int; 
declare @tableRows int;
declare @batchSize int;

set @rc = 1;
select @tableRows = count(*) from sourceTable; --initialize to total number of rows in table
set @batchSize = 10000;

while @rc < @tableRows
begin
	--Load records into temp table
	with cteInsertingRows as
	(
		select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr' 
		from sourceTable
	)
	insert into #tmp (rownbr, colA) 
	select RowNbr, ColA from cteInsertingRows where RowNbr between 1 and @batchSize;

	--insert into target table
	insert into targetTable 
	select colA from #tmp;
	
	--delete from source table
	delete st
	from 
		(select colA, ROW_NUMBER() over(order by colA) as 'rownbr' from sourceTable) st 
		inner join #tmp t on st.colA = t.colA and st.rownbr = t.rownbr

	--clear temp table for next iteration
	truncate table #tmp;

	set @rc += @batchSize;
end

create table #tmp

(

rownbr int

,colA varchar(10)

);

declare @rc int;

declare @tableRows int;

declare @batchSize int;

set @rc = 1;

select @tableRows = count(*) from sourceTable; --initialize to total number of rows in table

set @batchSize = 10000;

while @rc < @tableRows

begin

--Load records into temp table

with cteInsertingRows as

(

select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr'

from sourceTable

)

insert into #tmp (rownbr, colA)

select RowNbr, ColA from cteInsertingRows where RowNbr between 1 and @batchSize;

--insert into target table

insert into targetTable

select colA from #tmp;

--delete from source table

delete st

from

(select colA, ROW_NUMBER() over(order by colA) as 'rownbr' from sourceTable) st

inner join #tmp t on st.colA = t.colA and st.rownbr = t.rownbr

--clear temp table for next iteration

truncate table #tmp;

set @rc += @batchSize;

end

How to Use T-SQL to Update Millions of Rows

We can employ similar logic to update large tables.

--Updates in batches by row numbers
declare @rc int; 
declare @tableRows int;
declare @batchSize int;
declare @start int;
declare @end int;

set @rc = 1;
select @tableRows = count(*) from Colors; 
set @batchSize = 10000;
set @start = 1;
set @end = @start + @batchSize - 1;

while @rc < @tableRows
begin
	with cte(colA, RowNbr) as
	(
		select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr' 
		from Colors
	)

	update cte
	set colA = 'Blue'
	from cte
	where RowNbr between @start and @end;

	set @rc += @batchSize;
	set @start = @end + 1 ;
	set @end = @start + @batchSize - 1;
end

--Updates in batches by row numbers

declare @rc int;

declare @tableRows int;

declare @batchSize int;

declare @start int;

declare @end int;

set @rc = 1;

select @tableRows = count(*) from Colors;

set @batchSize = 10000;

set @start = 1;

set @end = @start + @batchSize - 1;

while @rc < @tableRows

begin

with cte(colA, RowNbr) as

(

select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr'

from Colors

)

update cte

set colA = 'Blue'

from cte

where RowNbr between @start and @end;

set @rc += @batchSize;

set @start = @end + 1 ;

set @end = @start + @batchSize - 1;

end

Concluding Thoughts

Breaking down one large transaction into many smaller ones can get job done with less impact to concurrency and the transaction log.

This is just a start. Consider what we left out:

T-SQL is not the only way to move data. We can also consider bcp, SSIS, C#, etc.
When going across linked servers the waits for network may be the greatest. Keep that in mind as you consider the right method.
Joins play a role – whether local or remote. Be mindful of foreign keys and referential integrity.
Indexes can make a difference in performance
Dynamic SQL and cursors can be useful if you need to iterate through sys.tables to perform operations on many tables

UPDATE – 11/3/18

I want to clarify some things about this post. I got good feedback from random internet strangers and want to make sure everyone understands this. But first…

WARNING! If you are not careful when batching you can actually make things worse!

WARNING! These are contrived examples meant to demonstrate a methodology. Please do NOT copy them and run in PROD!

WARNING! There is no “one size fits all” way to do this. Please test this and measure performance before running in PROD! Don’t just take code blindly from the web without understanding it.

Pointing out a couple things:

Be mindful of your indexes. Sometimes it can be better to drop indexes before large scale DML operations.
The INSERT piece – it is kind of a migration. You can use an output statement on the delete then insert. Can also move the row number calculation out of the loop so it is only executed once.
Here is a good example you should read to illustrate testing the performance of batches.

Like what you are reading? Tell your friends. Tell your foes. Please subscribe!

Yes I want to Subscribe!

4 thoughts on “Using T-SQL to Insert, Update, Delete Millions of Rows”

Handling Large Data Modifications – Curated SQL says:

October 17, 2016 at 5:05 pm

[…] Jeff Mlakar shows how to insert, update, and delete large numbers of records with T-SQL: […]

SQL Server Drop Tables in Bulk - 2 Methods – MlakarTechTalk says:

May 10, 2019 at 6:05 am

[…] If you liked this post then you might also like my recent post about Using T-SQL to Insert, Update, Delete Millions of Rows. […]

Jon says:

August 29, 2019 at 10:53 am

There is a bug in the batch update code. The line ‘update Colors’ should be ‘update cte’ in order for the code to work.

- Jeff Mlakar says:
  
  October 3, 2019 at 8:40 pm
  
  Good catch! Thanks – I made the correction.

MlakarTechTalk

All Things Data | InfoSec | Privacy | Technology

Using T-SQL to Insert, Update, Delete Millions of Rows

Challenges of Large Scale DML using T-SQL

How to Use T-SQL to Delete Millions of Rows

How to Use T-SQL to Insert Millions of Rows

How to Use T-SQL to Update Millions of Rows

Concluding Thoughts

UPDATE – 11/3/18

Yes I want to Subscribe!

Like this:

4 thoughts on “Using T-SQL to Insert, Update, Delete Millions of Rows”

Leave a CommentCancel reply

Challenges of Large Scale DML using T-SQL

How to Use T-SQL to Delete Millions of Rows

How to Use T-SQL to Insert Millions of Rows

How to Use T-SQL to Update Millions of Rows

Concluding Thoughts

UPDATE – 11/3/18

Yes I want to Subscribe!

Share this:

Like this:

4 thoughts on “Using T-SQL to Insert, Update, Delete Millions of Rows”

Leave a CommentCancel reply