Break Up Large Table Query Into Results of N Rows

pbd22

unread,

Jan 31, 2012, 9:57:42 AM1/31/12

to

Hi.

I am working in SQL Server 2005 and want to break up a table of 1M
rows into distinct results of 80,0000.

I feel like this is basically

use [database]

select * from [table]
where email_address like '%gmail.com'
group by ????

I am messing up with the group by (if this is the right way to go).
How do achieve this?

Thanks.

pbd22

unread,

Jan 31, 2012, 3:50:27 PM1/31/12

to

Anybody? To people use this group any more?

Bob Barrows

unread,

Jan 31, 2012, 4:49:46 PM1/31/12

to

We sure do, but I, for one, don't understand your question. maybe show us:

CREATE TABLE statement
INSERT STATEMENT with sample data
desired results from supplied sample data

and we'll have a better chance of understanding your request (first off,
what's 80,0000 mean?)

Erland Sommarskog

unread,

Jan 31, 2012, 5:03:05 PM1/31/12

to

pbd22 (dus...@gmail.com) writes:
> On Jan 31, 9:57 am, pbd22 <dush...@gmail.com> wrote:
>> I am working in SQL Server 2005 and want to break up a table of 1M
>> rows into distinct results of 80,0000.
>>
>> I feel like this is basically
>>
>> use [database]
>>
>> select * from [table]
>> where email_address like '%gmail.com'
>> group by ????
>>
>> I am messing up with the group by (if this is the right way to go).
>> How do achieve this?
>>
>

> Anybody? To people use this group any more?

I am here! However, I only look in here about once a day. The traffic
does not really warrant anymore. If you want speedy answers to your
questioms, this is a busy place:
http://social.msdn.microsoft.com/Forums/en-US/transactsql/threads
Although, a bit too many people answers questions that goes outside
what they really know.

Anyway, your problem is a little vague. What is the purpose with
this division? And must the batches be exactly 80000 rows in size
or can the number vary?

--
Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx

pbd22

unread,

Feb 1, 2012, 10:34:49 AM2/1/12

to

On Jan 31, 5:03 pm, Erland Sommarskog <esq...@sommarskog.se> wrote:

Hi Erland and Bob,

Thanks both a bundle for your responses. Looks like there is life on
planet
Google Groups after all, much appreciated!

The specifics is that we are doing email deployments but google is
moving
all of the email sent to gmail users to their spam boxes. As a result,
we have
to "chunk" the gmail users out of the total amount and send in
manageable
batches. We have figured that 80,000 per batch out of the total gmail
users
in the table is possible.

And, Erland, to answer your question, I would prefer to turn this into
a stored
procedure we can use for the purpose of "chunking" email addresses.
Accordingly,
the number per batch should be a variable parameter.

Bob, the table we are querying against is pretty simple. Essentially,
it has one
one column - "email_address" which is a varchar. Its data is about 1
million email
addresses (but that number changes often). The result set table(s)
should only have
two columns, the count (INT) and the email_address (varchar). Please
see below.

The query I am trying to write is supposed to dump each result set
(batch)
to a text file in some folder on the hard drive. Each result set
should have
a count and the email addresses as columns. Something like this:

RESULT 1:

[COUNT] [EMAIL_ADDRESS]
1 na...@gmail.com
2 na...@gmail.com
3 na...@gmail.com
4 na...@gmail.com
5 na...@gamil.com
...
80,000 na...@gmail.com

RESULT 2:

[COUNT] [EMAIL_ADDRESS]
80,001 na...@gmail.com
80,002 na...@gmail.com
80,003 na...@gmail.com
80,004 na...@gmail.com
80,005 na...@gamil.com
...
16,000 na...@gmail.com

And so on up to the to total amount of the gmail addresses out of the
original table.

So, the statement should read something like this (pseudo code):

select all distinct users
from the master table
where email_address like '%gmail.com'
return in batches of N (such as 80,000)
and write each batch to a text file on the
hard drive.

I hope I have explained myself well. Let me know if anything is
unclear.

Thanks a bundle for your help.

Bob Barrows

unread,

Feb 1, 2012, 3:36:07 PM2/1/12

to

pbd22 wrote:
> On Jan 31, 5:03 pm, Erland Sommarskog <esq...@sommarskog.se> wrote:
>> pbd22 (dush...@gmail.com) writes:
>>> On Jan 31, 9:57 am, pbd22 <dush...@gmail.com> wrote:
>>>> I am working in SQL Server 2005 and want to break up a table of 1M
>>>> rows into distinct results of 80,0000.
>>

<snip>

> And so on up to the to total amount of the gmail addresses out of the
> original table.
>
> So, the statement should read something like this (pseudo code):
>
> select all distinct users

You keep making a point of saying "distinct". Does that imply that there are
duplicate email addresses in that 1-million row table?

> from the master table
> where email_address like '%gmail.com'
> return in batches of N (such as 80,000)
> and write each batch to a text file on the
> hard drive.

I assume you can handle this part, correct? It's only batching the data that
you need help with?

>
> I hope I have explained myself well. Let me know if anything is
> unclear.
>
> Thanks a bundle for your help.

SQL 2008 has some paging functionality builtin but you're using SQL 2005.
I can think of a couple approaches. Here's one:
1. Create a temp table (#batches) with an identity column (indentcol) and an
email column (email). Insert the distinct email addresses into it:
insert #batches (email)
select distinct email_address from master_table where email_address like
'%gmail.com'

Then use a WHILE loop to retrieve the batches, using a variable to keep
track of them.
declare @batchsize int --convert this to a parameter for your sproc
set @batchsize=80000

declare @lastrec int, @endrec int
set @lastrec=(select max(identcol) from #batches)
set @endrec=@batchsize
WHILE @endrec-@batchsize<=@lastrec
BEGIN
select email from #batches where identcol >=@lastrec
--process the batch
delete #batches where identcol >=@lastrec
set @endrec=@endrec + @batchsize
END

Erland Sommarskog

unread,

Feb 1, 2012, 5:54:36 PM2/1/12

to

pbd22 (dus...@gmail.com) writes:
> The specifics is that we are doing email deployments but google is
> moving all of the email sent to gmail users to their spam boxes. As a
> result, we have to "chunk" the gmail users out of the total amount and
> send in manageable batches. We have figured that 80,000 per batch out of
> the total gmail users in the table is possible.

>...

> Bob, the table we are querying against is pretty simple. Essentially,
> it has one one column - "email_address" which is a varchar. Its data is
> about 1 million email addresses (but that number changes often). The
> result set table(s) should only have two columns, the count (INT) and
> the email_address (varchar). Please see below.

The simple-minded solution would be:

WITH numbered AS (
SELECT email, row_number() OVER(ORDER BY email) AS rowno
FROM addresses
WHERE email LIKE '%@gmail.com'
)
SELECT email
FROM numbered
WHERE rowno > (@batchno - 1) * @batchsize
AND rowno <= (@batchno - 1) * @batchsize

But it would be far more efficient to do:

CREATE TABLE gmail_addresses (rowno int NOT NULL,
gmail nvarchar(255) NOT NULL,
CONSTRAINT pk_gmail PRIMARY KEY CLUSTERED (rowno),
CONSTRAINT pk_unique UNIQUE NONCLUSTERED(gmail))

INSERT gmail_addresses(rowno, gmail)
SELECT row_number() OVER(ORDER BY (SELECT 1)), gmail
FROM (SELECT DISTINCT email
FROM addresses
WHERE email like '%@gmail.com') AS x

Then you have materialised the row number once for all, and a selection
of 80000 accounts will be quick.