Gal Ratner
Gal Ratner is a Techie who lives and works in Los Angeles CA and Austin TX. Follow galratner on Twitter Google
Reading an Excel spreadsheet into dynamic objects

Excel represents documents using Office Open XML (OOXML), a zipped XML based format developed for spreadsheet and chart representation.  In this article I am going to demonstrate parsing an Excel spreadsheet and loading the data into a generic List. OOXML was incorporated into Office in 2007 therefore the following code will not work with older versions of Excel.
This is our spreadsheet:


In order to read and parse the Spreadsheet we are going to use the Open XML SDK 2.0 for Microsoft Office available for download from Microsoft. The complete SDK includes the DocumentFormat.OpenXml library and some productivity tools that will help you examine the structure of Office documents. Open XML SDK is built on top of System.IO.Packaging so in order to use the DocumentFormat.OpenXml library you also need to import WindowsBase.dll.


Let’s begin by loading the correct worksheet into a SpreadsheetDocument

private static List<ExpandoObject> GetSpreadsheetData(string workSheet, string filePath)
        {
            List<ExpandoObject> data = new List<ExpandoObject>();

            using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
            {
                // Get the worksheet we are working with
                IEnumerable<Sheet> sheets = spreadsheetDocument.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == workSheet);
                WorksheetPart worksheetPart = (WorksheetPart)spreadsheetDocument.WorkbookPart.GetPartById(sheets.First().Id);
                Worksheet worksheet = worksheetPart.Worksheet;
                SharedStringTablePart sstPart = spreadsheetDocument.WorkbookPart.GetPartsOfType<SharedStringTablePart>().First();
                SharedStringTable ssTable = sstPart.SharedStringTable;
                // Get the CellFormats for cells without defined data types
                WorkbookStylesPart workbookStylesPart = spreadsheetDocument.WorkbookPart.GetPartsOfType<WorkbookStylesPart>().First();
                CellFormats cellFormats = (CellFormats)workbookStylesPart.Stylesheet.CellFormats;

                ExtractRowsData(data, worksheet, ssTable, cellFormats);
            }

            return data;
        }


As you can tell we loaded the worksheet and a styles used for cell formatting. This will come in handy later on.
Now that we have the worksheet we use the first row to set the field names of our objects. We use ExpandoObject since it allows dynamically adding of members.

/// <summary>
        /// Get the data using the first row as columns and the rest of the rows as data
        /// </summary>
        /// <param name="data"></param>
        /// <param name="worksheet"></param>
        /// <param name="ssTable"></param>
        /// <param name="cellFormats"></param>
        private static void ExtractRowsData(List<ExpandoObject> data, Worksheet worksheet, SharedStringTable ssTable, CellFormats cellFormats)
        {
            var columnHeaders = worksheet.Descendants<Row>().First().Descendants<Cell>().Select(c => Convert.ToString(ProcessCellValue(c, ssTable, cellFormats))).ToArray();
            var columnHeadersCellReference = worksheet.Descendants<Row>().First().Descendants<Cell>().Select(c => c.CellReference.InnerText.Replace("1"string.Empty)).ToArray();
            var spreadsheetData = from row in worksheet.Descendants<Row>()
                                  where row.RowIndex > 1
                                  select row;

            foreach (var dataRow in spreadsheetData)
            {
                dynamic row = new ExpandoObject();
                Cell[] rowCells = dataRow.Descendants<Cell>().ToArray();
                for (int i = 0; i < columnHeaders.Length; i++)
                {
                    // Find and add the correct cell to the row object
                    Cell cell = dataRow.Descendants<Cell>().Where(c => c.CellReference == columnHeadersCellReference[i] + dataRow.RowIndex).FirstOrDefault();
                    if (cell != null)
                        ((IDictionary<StringObject>)row).Add(new KeyValuePair<StringObject>(columnHeaders[i], ProcessCellValue(cell, ssTable, cellFormats)));
                }
                data.Add(row);
            }
        }

Its important to mention that fields that are empty cells are not a part of the row collection and do not show up in the underlying XML structure, therefore we need to explicitly search for the correct cell in the current row by matching its CellReference to the CellReference of the header. For example:  Cell A7 will be matched against header A1. If a Cell exists in the XML we will add it as an object member.
The last step is to process the cell’s value. Values in Excel can be a little misleading. Some values will be held in the actual cell, some will be held in the SharedStringTable. Some cells data types will be set and some wont forcing us to rely on cell formatting. I have not found a defined set of guidelines to extracting cell values and all of the code examples I found online did not fully work so I had to simply rely on reading the underlying XML source and coming up with my own set of rules.

/// <summary>
        /// Process the valus of a cell and return a .NET value
        /// </summary>
        static Func<CellSharedStringTableCellFormatsObject> ProcessCellValue = (c, ssTable, cellFormats) =>
        {
            // If there is no data type, this must be a string that has been formatted as a number
            if (c.DataType == null)
            {
                CellFormat cf = cellFormats.Descendants<CellFormat>().ElementAt<CellFormat>(Convert.ToInt32(c.StyleIndex.Value));
                if (cf.NumberFormatId >= 0 && cf.NumberFormatId <= 13) // This is a number
                    return Convert.ToDecimal(c.CellValue.Text);
                else if (cf.NumberFormatId >= 14 && cf.NumberFormatId <= 22) // This is a date
                    return DateTime.FromOADate(Convert.ToDouble(c.CellValue.Text));
                else
                    return c.CellValue.Text;
            }

            switch (c.DataType.Value)
            {
                case CellValues.SharedString:
                    return ssTable.ChildElements[Convert.ToInt32(c.CellValue.Text)].InnerText;
                case CellValues.Boolean:
                    return c.CellValue.Text == "1" ? true : false;
                case CellValues.Date:
                    return DateTime.FromOADate(Convert.ToDouble(c.CellValue.Text));
                case CellValues.Number:
                    return Convert.ToDecimal(c.CellValue.Text);
                default:
                    if (c.CellValue != null)
                        return c.CellValue.Text;
                    return string.Empty;
            }
        };

NumberFormatId is defined in a table and took a while to find online.
After running the code the result is a generic list holding dynamic objects containing  strongly typed members matching the types in your spreadsheet..

Shout it


Posted 18 Mar 2011 3:36 AM by Gal Ratner
Filed under: , ,

Powered by Community Server (Non-Commercial Edition), by Telligent Systems