OutboxRecord Table - NHibernate and Table Partitioning

325 views
Skip to first unread message

Corith Malin

unread,
Jan 12, 2016, 3:11:30 PM1/12/16
to Particular Software
Helpful information to include
Product name: NServiceBus
Version:5.x
Description: We have an endpoint that we need the Outbox feature enabled for because we need to guarantee at most once delivery.  This service is putting about 7500 messages in an AzureServiceBus queue a minute.  We're seeing a large slow down in the endpoint processing because of the way Outbox works, which is fine and expected, but because of the many records it needs to delete on each delete (we keep the messages for 5 minutes and clean up every couple minutes) the whole table is locked.  Add to the fact that we have 10 instances of this endpoint spun up in Azure worker roles and we get enough contention and deadlocks that the delete fails and the table fills up...

What can be done about this? My thought would be to enable Partitioning on that SQL server based on the DispatchedAt column, but how can I get NServiceBus to create a table partition on that table since the host controls the schema for that table?  Any other suggestions?

Thanks!

william...@particular.net

unread,
Jan 14, 2016, 6:15:02 AM1/14/16
to Particular Software
Hey Colin

You can try change the indexes and see if that helps. If you come across a change that fixes your performance please let us know so we can add it to our implementation!

Regarding the deadlocks, can I ask you to confirm:
  1. That there is indeed a clustered index on the primary key?
  2. Are you running SQL Inside an Azure VM? Or a SQL Azure instance?
  3. If SQL Azure, what pricing tier are you using?
You could try set the cleanup to a relatively short time period, say 0.5 seconds and see if that helps by making the delete chunks smaller. If you're running SQL Azure, your pricing tier could also have an impact on performance.

We have a couple ongoing initiatives aimed at improving the Outbox table performance, so you should keep an eye out for our release notices too!

William

Corith Malin

unread,
Jan 20, 2016, 8:52:02 AM1/20/16
to Particular Software
Hey William,
  There is clustering on the primary key.
  We are running SQLAzure, not SQL inside an Azure VM.
  We're at the P4 pricing tier right now.

  What we've done to mitigate this...

  1. SQLAzure has no SQL Jobs nor does it re-index your tables automatically.  Instead we created an Azure Automation job that uses a PowerShell workflow to re-index all indexes in a given database.  Currently we are doing this every hour as the fragmentation on the OutboxRecord soars after only 30 minutes (greater than 90%).
  2. The other issue is that since we have 7 instances of this service running in Azure Worker roles, we cannot set the  to a short time frame, because then each instance tries to issue delete commands against the database every 5 minutes.  Instead we set them to one day, knowing that the delete issued by the NServiceBus host will fail.
  3. To purge old records in the OutboxRecord table, we've created an Azure Automation job that uses a PowerShell workflow to accomplish this.  It looks at the OutboxRecord table and deletes anything with a DispatchedAt <= DATEADD(HOUR, -1, GETUTCDATE()).

I've attached both workflows for other people that might need the same thing.  #3 allows the deletes to be consolidated and only run once (as opposed to per service).  This seems to have mitigated the table from filling up and the slow performance.

The next thing we're trying is to recreate that table with Partitions, and to then roll off old partitions. This will hopefully allow us to keep our data a bit longer and not need to be so agressive in our cleanup.  It will also reduce the amount of fragmentation happening which should also improve INSERT performance.
Remove-ExpiredSQLTableRecords.ps1
Update-SQLIndexRunbook.ps1

Jérémie Desautels

unread,
Jan 20, 2016, 7:13:47 PM1/20/16
to Particular Software
I have a suggestion based on my experience dealing with a similar situation: I faced a similar problem where I needed to delete a lot of records from a large table in Azure SQL and avoid table locks, deadlocks, etc. (my scenario was not related to NServiceBus at all). In my case, the table contained several million records and deleting all records matching my criteria was causing table locks for several minutes which was unacceptable. I ended up solving the issue by limiting my DELETE statement to a more 'reasonable' number of records and I would repeat the DELETE statement in a loop until there are no more records to delete. My TSQL query looked something like this:

WHILE 1 = 1
BEGIN
    WITH t AS
    (
        SELECT    TOP 1000 *
          FROM    MyTable
         WHERE    MyField = 'abc123'
    )
    DELETE    t

    IF @@ROWCOUNT < 1000 BREAK
END


I think your idea of partitioning the table is even better though. It would be so much more efficient to drop a partition than to delete a bunch of records. However, this wasn't possible until recently because Azure SQL did not support partitioning. Microsoft added this capability relatively recently and I did not revisit this possibility. Besides, I could not figure out how to partition based on a date field so that records are moved to a partition at they age.

Hope this helps

Corith Malin

unread,
Jan 20, 2016, 9:57:18 PM1/20/16
to particula...@googlegroups.com
Hey Jérémie,
  I looked at doing something similar to that too.  Does that really limit table locks if the table is severely fragmented?  
My theory was that it really wouldn't, but I'm not a HUGE SQL guy either so I could have my theory wrong.

Thanks!

william...@particular.net

unread,
Jan 25, 2016, 2:21:20 AM1/25/16
to Particular Software
@Corith: If you've got the spare time could you maybe give the batch delete a chance and report back on the results?  It would be really helpful feedback for us.

We're probably going to be splitting up the table in the future. At the moment it handles both de-dup as well as processing, which isn't ideal and - as is evident by your case - doesn't scale.

William

Corith Malin

unread,
Jan 25, 2016, 9:37:40 AM1/25/16
to Particular Software
Hey William,
  I had the batch delete set to 2 minutes and to only store 5 minutes worth of data and that's when we were getting 100% utilization and deadlocks.  I think part of the problem as well is we currently have this scaled out to 5 instances, so when we set such a short batch delete interval, we have 5 machines all trying to run a costly delete job on a highly fragmented table...  When I set the delete job to once a day (and it ends up timing out which is fine), our utilization hovers around 5% (which means we could probably lower our pricing tier as we increased it before we figured out this batching delete stuff) and the inserts are happening just fine.

william...@particular.net

unread,
Jan 25, 2016, 1:43:44 PM1/25/16
to Particular Software
When you say "had the batch delete set to 2 minutes", do you mean the built-in cleanup delete, or the batch that Jérémie suggested?

Another option that could be promising would be to set the cleanup interval to something really short, but on ONLY one of your services, and leave the cleanup interval as a long one on the remaining services. This would reduce the number of services competing to delete and hopefully decrease the deadlocks.

William

Corith Malin

unread,
Jan 25, 2016, 2:48:38 PM1/25/16
to Particular Software
William,
  I mean the built-in cleanup delete.

  I had thought of that too.  Unfortunately how azure deployments work, that would mean that I would need have two deployments, one that contains one server and its settings, and another for all the other ones that get scaled up and down automatically.

william...@particular.net

unread,
Jan 26, 2016, 4:30:37 AM1/26/16
to Particular Software
Yeah that's a good point.

We have an issue logged to track this, can I ask you to keep an eye on it from time to time and perhaps jump into the discussion so we make sure we don't miss your feedback and needs?

William
Reply all
Reply to author
Forward
0 new messages