reading xml into database

2407 views
Skip to first unread message

mark_dre...@yahoo.co.uk

unread,
Oct 16, 2007, 5:14:42 AM10/16/07
to
I've an XML document which is in the region of 20meg which comes in
weekly, the contents of the document need to be read and some fields
tweaked and then the content imported into a SQL Server database.
Anybody got any recommendations on the best way of doing this? Any
online articles on the best approach would be really useful.

MTIA

Marc Gravell

unread,
Oct 16, 2007, 5:35:11 AM10/16/07
to
What database version? In 2005, I wouldn't rule out (without testing)
using the xml data-type; unlike XmlDocument, this shreds the file into
a relational database model, meaning you can query and modify portions
of it (with xpath etc) without having to load the entirety into
memory. You can also query it quiet efficiently if your next step is
to extract it to a relational set of tables.

The tricky bit would be getting the data over to the database; perhaps
SqlBulkCopy would be the best option. The other question is: are there
lots of first-level elements (directly under the document-element)? If
so, I do something similar, but I treat each child of the document
element as a new row in to database, and I have a cunning custom
IDataReader implementation (since SqlBulkCopy can consume IDataReader
very efficiently) that wraps an XmlReader for this purpose (i.e.
IDataReader.Read() advances XmlReader to the next sibling, getting the
entire inner[or outer, can't remember] xml in the process). This
allows me to upload my uber-xml in a number of chunks, say 200 at a
time. Makes it maneagable, and keeps the size of each record in the
database acceptable. Let me know if you want more detail, or perhaps
the custom IDataReader implementation.

I haven't read any performance details for the performance of very
large xml fragments as a single row - but I have a setup that would
allow me to investigate if you really want.

Marc


mark_dre...@yahoo.co.uk

unread,
Oct 16, 2007, 6:46:02 AM10/16/07
to

Hi Marc

Thanks for the reply. The database would be 2005.

The xml is records for offices so I'm probably only looking at one
table for the time being.

xml format is something like the following:-

<office>
<OfficeName>officename</OfficeName>
<Address>
<Street>street</Street>
<Town>town</Town>
<PostCode>postcode</PostCode>
</Address>
<field1>xxxx</field1>
<field10>xxxx</field10>
</office>

There may be in the region of 5000 offices which are likely to need
updated regularly (don't ask ;). At the time of reading the XML I
need to do some manipulation on the content in fields 1 - 10 before it
goes to the database. I've looked into sqlBulkCopy but thought there
was no way to manipulate the data being copied? Is this incorrect?

Thanks

Marc Gravell

unread,
Oct 16, 2007, 7:47:58 AM10/16/07
to
> The xml is records for offices so I'm probably only looking at one
> table for the time being.
I agree; one table, but that isn't what I am suggesting.

Suppose your xml for 5000 offices is:

<xml><office ... >...</office><!-- lots and lots in here --><office
...>...</office></xml>

What I am saying is that rather than dealing with this as a single
bulk, you bulk copy into a table where each row is a different
office - rather than a single CLOB of all the offices in one row. The
advantage of this is that each record describes a single entity, and
is easy to manage.

> I've looked into sqlBulkCopy but thought there
> was no way to manipulate the data being copied?

If you are the data-source, you can do whatever you like. I've checked
my code, and my current IDataReader implementation contains too much
project-specific junk to be easily postable - so I've tried to
illustrate from your xml taking your requirement to edit the xml on
the fly into account (although you'd normally start the XmlReader from
a Stream; the StringReader is just for quick demonstration); the only
bit you'd really need to change is the XmlDataReader.DoRead() method;
the point being that you can hand this IDataReader to SqlBulkCopy, and
it will use it as the data source. Note that at any point in time we
only have the data for a single office in memory.

The SqlBulkCopy class can be used to copy this into a staging table
(the ctor sets it up for a single column, but that can be changed);
once at the database, you can use all the SQL2005 methods to move that
into a relational model, or just leave as xml. Note that you can do
all sorts of fun things like persisted calculated columns on xml data,
so it isn't always (depending on scenario) necessary to make it
relational at all.

The other approach you can take is to use the same SqlBulkCopy model,
but to forget about xml at the database, and simple use XmlReader /
XmlDocument to parse each office into columns (using SetValues()
inside DoRead() to update the current row each time). This only works
if the schema stays simple; if you start getting multi-valued, it gets
very tricky to do this efficiently. My assumption (when working with
xml) is always worst-case (i.e. assume the data might get more
complex), which is why I have suggested working with xml first, and
suggested parsing to columns second. But either would work.

Marc

(long code post)

using System;
using System.Data;
using System.Xml;
using System.IO;
using System.Diagnostics;


static class Program {
const string xml = @"<offices>
<office>
<OfficeName>officename1</OfficeName>


<Address>
<Street>street</Street>
<Town>town</Town>
<PostCode>postcode</PostCode>
</Address>
<field1>xxxx</field1>
<field10>xxxx</field10>

</office><office>
<OfficeName>officename2</OfficeName>


<Address>
<Street>street</Street>
<Town>town</Town>
<PostCode>postcode</PostCode>
</Address>
<field1>xxxx</field1>
<field10>xxxx</field10>

</office><office>
<OfficeName>officename3</OfficeName>


<Address>
<Street>street</Street>
<Town>town</Town>
<PostCode>postcode</PostCode>
</Address>
<field1>xxxx</field1>
<field10>xxxx</field10>

</office><!-- comment test--><office>
<OfficeName>officename4</OfficeName>


<Address>
<Street>street</Street>
<Town>town</Town>
<PostCode>postcode</PostCode>
</Address>
<field1>xxxx</field1>
<field10>xxxx</field10>

</office></offices>";
static void Main() {
using (IDataReader reader = GetReader()) {
while (reader.Read()) {
Trace.WriteLine(reader.GetValue(0));
}
}
}
static IDataReader GetReader() {
return new XmlDataReader(XmlReader.Create(new
StringReader(xml)));
}
}

class XmlDataReader : SimpleDataReader {
private XmlReader reader;
readonly int targetDepth;
public XmlDataReader(XmlReader reader) : base(
new string[] {"Xml"},
new Type[] {typeof(string)}) {

if (reader == null) throw new ArgumentNullException("reader");
this.reader = reader;
reader.MoveToContent();
targetDepth = reader.Depth + 1;

}
protected override void DoClose() {
if(reader!=null) reader.Close();
reader = null;
}
protected override bool DoRead() {
while (reader.Read() && reader.Depth == targetDepth) {
if (reader.NodeType == XmlNodeType.Element) {
string xml;
using (XmlReader subReader = reader.ReadSubtree()) {
XmlDocument doc = new XmlDocument();
doc.Load(subReader);
doc.DocumentElement.SelectSingleNode("field1").InnerText
+= " *";
xml = doc.OuterXml;
}
SetValues(xml);
return true;
}
}
return false;
}


}


public abstract class SimpleDataReader : IDataReader {
readonly string[] columnNames;
readonly Type[] columnTypes;
readonly object[] values;

protected SimpleDataReader(string[] names, Type[] types) {
if (names == null) throw new
ArgumentNullException("names");
if (types == null) throw new
ArgumentNullException("types");
if (names.Length != types.Length) throw new
ArgumentException("Names / types lengths must match");

columnNames = (string[])names.Clone();
columnTypes = (Type[])types.Clone();
values = new object[names.Length];
}
protected void SetValues(params object[] values) {
if (values == null) throw new
ArgumentNullException("values");
if (values.Length != this.values.Length) throw new
ArgumentException("Values length is invalid");
values.CopyTo(this.values, 0);
}
private bool isClosed = false;
void IDisposable.Dispose() { Close(); }
abstract protected void DoClose();
public void Close() {
try {
DoClose();
} finally {
isClosed = true;
}
}
public bool IsClosed { get { return isClosed; } }
public int RecordsAffected { get { return -1; } }


#region IDataReader Members


public int Depth {
get { return 0; }
}

public DataTable GetSchemaTable() {
return null;
}

public bool NextResult() {
Close();
return false;
}
public bool Read() {
return DoRead();
}
protected abstract bool DoRead();

#endregion

#region IDataRecord Members

public int FieldCount {
get { return columnNames.Length; }
}

public bool GetBoolean(int i) {
return Convert.ToBoolean(GetValue(i));
}

public byte GetByte(int i) {
return Convert.ToByte(GetValue(i));
}

public long GetBytes(int i, long fieldOffset, byte[] buffer,
int bufferoffset, int length) {
byte[] data = (byte[])GetValue(i);
int max = data.Length + 1 - (int)fieldOffset;
if (length > max) length = max;
if (length > 0) {
Buffer.BlockCopy(data, (int)fieldOffset, buffer,
bufferoffset, length);
return length;
}
return 0;
}

public char GetChar(int i) {
return Convert.ToChar(GetValue(i));
}

public long GetChars(int i, long fieldoffset, char[] buffer,
int bufferoffset, int length) {
string data = GetString(i);
int max = data.Length + 1 - (int)fieldoffset;
if (length > max) length = max;
if (length > 0) {
data.CopyTo((int)fieldoffset, buffer, bufferoffset,
length);
return length;
}
return 0;
}

public IDataReader GetData(int i) {
throw new NotSupportedException();
}

public string GetDataTypeName(int i) {
return GetFieldType(i).Name;
}

public DateTime GetDateTime(int i) {
return Convert.ToDateTime(GetValue(i));
}

public decimal GetDecimal(int i) {
return Convert.ToDecimal(GetValue(i));
}

public double GetDouble(int i) {
return Convert.ToDouble(GetValue(i));
}

public Type GetFieldType(int i) {
return columnTypes[i];
}

public float GetFloat(int i) {
return Convert.ToSingle(GetValue(i));
}

public Guid GetGuid(int i) {
object obj = GetValue(i);
byte[] binary = obj as byte[];
if (binary != null) {
return new Guid(binary);
} else {
return new Guid(Convert.ToString(obj));
}
}

public short GetInt16(int i) {
return Convert.ToInt16(GetValue(i));
}

public int GetInt32(int i) {
return Convert.ToInt32(GetValue(i));
}

public long GetInt64(int i) {
return Convert.ToInt64(GetValue(i));
}

public string GetName(int i) {
return columnNames[i];
}

public int GetOrdinal(string name) {
for (int i = 0; i < columnNames.Length; i++) {
if
(StringComparer.InvariantCultureIgnoreCase.Equals(name,
columnNames[i]))
return i;
}
return -1;
}

public string GetString(int i) {
return Convert.ToString(GetValue(i));
}

public object GetValue(int i) {
return values[i];
}

public int GetValues(object[] values) {
if (values == null) throw new
ArgumentNullException("values");
int fields = values.Length < FieldCount ? values.Length :
FieldCount;
Array.Copy(this.values, values, fields);
return fields;
}

public bool IsDBNull(int i) {
object obj = GetValue(i);
return obj == null || obj is DBNull;
}

public object this[string name] {
get { return this[GetOrdinal(name)]; }
}

public object this[int i] {
get { return GetValue(i); }
}

#endregion
}

Marc Gravell

unread,
Oct 16, 2007, 8:00:39 AM10/16/07
to
Reply all
Reply to author
Forward
0 new messages