Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
A properly designed database provides you with access to up-to-date, accurate information. Because a correct design is essential to achieving your goals in working with a database, investing the time required to learn the principles of good design makes sense. In the end, you are much more likely to end up with a database that meets your needs and can easily accommodate change.
This article provides guidelines for planning a desktop database. You will learn how to decide what information you need, how to divide that information into the appropriate tables and columns, and how those tables relate to each other. You should read this article before you create your first desktop database.
Certain principles guide the database design process. The first principle is that duplicate information (also called redundant data) is bad, because it wastes space and increases the likelihood of errors and inconsistencies. The second principle is that the correctness and completeness of information is important. If your database contains incorrect information, any reports that pull information from the database will also contain incorrect information. As a result, any decisions you make that are based on those reports will then be misinformed.
Decide what information you want to store in each table. Each item becomes a field, and is displayed as a column in the table. For example, an Employees table might include fields such as Last Name and Hire Date.
To find and organize the information required, start with your existing information. For example, you might record purchase orders in a ledger or keep customer information on paper forms in a file cabinet. Gather those documents and list each type of information shown (for example, each box that you fill in on a form). If you don't have any existing forms, imagine instead that you have to design a form to record the customer information. What information would you put on the form? What fill-in boxes would you create? Identify and list each of these items. For example, suppose you currently keep the customer list on index cards. Examining these cards might show that each card holds a customers name, address, city, state, postal code and telephone number. Each of these items represents a potential column in a table.
Next, consider the types of reports or mailings you might want to produce from the database. For instance, you might want a product sales report to show sales by region, or an inventory summary report that shows product inventory levels. You might also want to generate form letters to send to customers that announces a sale event or offers a premium. Design the report in your mind, and imagine what it would look like. What information would you place on the report? List each item. Do the same for the form letter and for any other report you anticipate creating.
The requirement to send e-mail messages to customers suggests another item to record. Once you know that a customer wants to receive e-mail messages, you will also need to know the e-mail address to which to send them. Therefore you need to record an e-mail address for each customer.
Think about the questions you might want the database to answer. For instance, how many sales of your featured product did you close last month? Where do your best customers live? Who is the supplier for your best-selling product? Anticipating these questions helps you zero in on additional items to record.
To divide the information into tables, choose the major entities, or subjects. For example, after finding and organizing information for a product sales database, the preliminary list might look like this:
When you first review the preliminary list of items, you might be tempted to place them all in a single table, instead of the four shown in the preceding illustration. You will learn here why that is a bad idea. Consider for a moment, the table shown here:
c80f0f1006