Navigation

Thursday 2 September 2010

How to: Achieve Count(*) on a large SharePoint list

This has been a mission of mine for a while now (before I went on holiday and took a 2 week hiatus from all things SharePoint :)).

One of the clients I've been working with has been trying to replicate a pretty simple operation (by normal development standards). They have a SharePoint list with a LOT of items in it (we are talking 200,000 list items and above) and includes some Choice fields.

They want to return a count of how often each choice value is being used. Now, if you were using SQL Server you would simply do the following pseudo-SQL:
select count(*) from myList group by myChoiceField
At first look in SharePoint this is not possible:
  • There is no "count" operation in CAML, nor any other kind of aggregation function
  • SharePoint Search "full text query" does not support the count(*) operator (or anything similar)
  • The only reference to aggregations is in the SPView.Aggregations property .. this is only used by the rendered HTML and the values are not returned in the result set.
Now .. I know that you can get count values on a list, if you create a View with a Group By then it shows you the number of items in each group, so it MUST be possible! So my mission started

List view with groups
We want to replicate this behaviour,
but programmatically!

First.. we need a test environment
The first thing I did was create a really big list. We are talking about 200,000 list items, so you can't just pull all the items out in an SPQuery (as it would be far too slow!).

I generated a simple custom list. I add a choice field (with optional values of 1-20) and then generated 200,000 list items with a randomly assigned choice value (and a bunch of them without any choice value at all .. just for laughs).

Now I could play with my code

Attempt Number 1 - Retrieve all list and programmatically calculate the counts (fail)
I kinda knew this wouldn't work .. but I needed a sounding board to know HOW bad it really was. There are 200,000 items after all, so this was never going to be fast.
  • Use SPQuery to retrieve 2 fields (the ID, and my "choice" field).
  • Retrieve the result set, and iterate through them, incremementing an integer value to get each "group" count value
This was a definite #fail.To retrieve all 200,000 list items in a single SPQuery took about 25 seconds to execute ... FAR too slow.

Attempt Number 2 - Execute separate query for each "group" (fail)
I was a little more positive with this one ... smaller queries execute much faster so this had some legs (and this is certainly a viable option if you only want the count for a SINGLE group).
  • Create an SPQuery for each of the "choice" values we want to group by (there are 20 of them!)
  • Execute each query, and use SPListItemCollection.Count to get the value
Unfortunately this was another spectacular #fail. Each query executed in around 2 seconds .. which would be fine if we didn't have to do it 20 times! :( (i.e. 40 second page load!!)

Attempt Number 3 - Use the SPView object (success!)
Ok .. so I know that the SPView can render extremely fast. With my sample list, and creating a streamlined "group by" view it was rendering in about 2 seconds (and thats on my laptop VM! I'm sure a production box would be much much quicker).

The main problem is ... how do you get these values programmatically?

The SPView class contains a "RenderAsHtml" method which returns the full HTML output of the List View (including all of the group values, javascript functions, the lot). My main question was how did it actually work? (and how on earth did it get those values so quickly!)

I started off poking into the SPView object using Reflector (tsk tsk). The chain I ended up following was this:
  • SPView.RenderAsHtml() -->

    • SPList.RenderAsHtml() (obfuscated ... arghhhh)
So that was a dead end .. I did some more poking around and found out that SPContext also has a view render method ...
  • SPContext.RenderViewAsHtml() -->

    • SPContextInternalClass.RenderViewAsHtml() -->

      • COM object ! (arghhhh)
Now .. the fact that we just hit a COM object suggests that we are starting to wander towards the SQL queries that get executed to retrieve the view data .. I didn't want to go anywhere NEAR that one, so I decided to leave it there and perhaps try using the output HTML instead (nasty .. but not much of a choice left!).
using (SPSite site = new SPSite(http://myspsite))
{
SPList list = site.RootWeb.Lists["TestList"];
string strViewHtml = list.Views["GroupedView"].RenderAsHtml();
}
Having done this we now have the HTML output of our view (and this code takes about 2-3 seconds to execute ... fast enough for my laptop .. we can always cache the value if needed).
 
Looking through the DOM output in the browser, it was possible to identify the "group" element by their attributes. It is a TBody node with both an ID attribute and a "groupString" attribute (the GroupString is the important one, as it tells us the view is configured to "Group By").
 
What I needed next was a way of getting the actual values out of the HTML. For this I used the extremely awesome "HTML Agility Pack" from Codeplex. This is a set of libraries that allow you to parse DOM elements, including both plain "poorly formed" HTML as well as XHTML, and then use XPath queries to extract any values you want (much in the same way you would normally use the XML namespace for XHTML).
 
This gave me the TBODY node, and from there I could use string manipulation on the "InnerText" to pull out the group name and the count value :)
// Using HTML Agility Pack - Codeplex
// load the HTML into the HtmlDocument object

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(strViewHtml);

// retrieve all TBODY elements which have both
// an ID and groupString attribute
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//tbody[@id][@groupstring]");
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
// extract the Group Name
string strGroupName = node.InnerText.Substring(node.InnerText.LastIndexOf(" ")+6);
strGroupName = strGroupName.Substring(0, strGroupName.IndexOf("&#")-1);
Console.Write ("Group: " + strGroupName + ", ");

// extract the number of items
string strValueText = node.InnerText.Substring(node.InnerText.LastIndexOf("(") + 1);
Console.WriteLine("Number of Items: " + strValueText.Substring(0, strValueText.Length - 1));
}
}
As you can see I'm doing some rather nasty SubString statements.. there may well be a quicker and cleaner way to do this using Regex .. this was more a proof of concept than anything else :)

Result!
Console output, showing group names and counts.
3 seconds isn't bad, running on a "single server" laptop VM image :)

The end result was 2-3 second bit of code, retreiving Group By, Count values for a list with 200,000 list items.

Not bad for an afternoons work :)

Attempt 4 - Do the same thing in JQuery (kudos to Jaap Vossers)
This was actually the original solution, I asked Jaap if he could look at this if he had spare time, as I knew he had a lot of JQuery experience (and he blew me away by having it all working in under 30 minutes!).

Basically it uses pretty standard JQuery to go off and retrieve the HTML content from another page, scraping the HTML and pulling back the values. Same as the C# it grabs the group TBody, then walks down the DOM to retrieve the text value that it outputs.

The speed is roughly the same as the actual view itself. I'm sure some more JQuery could be employed to pull out the specific values and do more with them, but the concept appears to be sound:
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>

<script type="text/javascript">
$(document).ready(function(){

// you will need to change this URL
var url = http://myspsite/Lists/MyList/GroupedView.aspx;

var groupings = [];

$.get(url, function(data) {
$(data).find("tbody[id^=titl][groupString] > tr > td").each(

function(index, value){
groupings.push($(this).text());
}
);

$("#placeholder").append("<ul></ul>");

$.each(groupings, function(index, value){

$("#placeholder ul").append("<li>" + value + "</li>")
});
});
});

</script>
<div id="placeholder"></div>
tada .. (thanks Jaap!)


Result from JQuery output,
dropped into a Content Editor Web Part

Summary
Well .. I know doing HTML scraping isn't pretty, but seeing as the code is MUCH faster than anything else I've seen (and is stuck in the middle of a COM object) there didn't seem to be much choice.

By all means, feel free to let me know if you have any alternatives to this.

10 comments:

  1. Haah I was working on top contributors web part. thought of doing it using datatable because getting group count using object model appeared almost impossible.
    You made my day. We survive because of people like you.
    Thanks a lot

    ReplyDelete
  2. Any suggestion of what to try if this doesn't work for me right away? Everything seems great, I verified the paths to the grouped view and jquery, as well as verified that the view was set up in groups, but this script doesn't output anything. Any pointers would be greatly appreciated.

    ReplyDelete
  3. hey Martin,

    As you mentioned, you had 200k items in the list. Did you create items in different folders. I have a similar requirement to have 100k items in a task list and I am not sure of the best way to do this. Can you please show me the way? I would reaaly really aprpeciate if you can please take out time to answer me.

    Thanks.
    Nisha
    nisha.xclusiv@gmail.com

    ReplyDelete
  4. Hi Martin,

    AS you mentioned you 200k items in a List. I have similar required to put 100k items in a task list. I am not sure of the best way to approach it. Can you please advise?
    I would really appreciate if you can please take out some time to answer me.

    Thanks

    ReplyDelete
  5. I've done this a little bit different and HTML Scrapped the 'All site content' page.

    /_layouts/viewlsts.aspx

    This page shows the item count and doesn't require a new page to be specially created.

    Adam Larner

    ReplyDelete
  6. Adam,

    Yes, although that page actually calls the SPList.ItemCount property which gives the same information.

    My example above was providing counts for filtered items within a list (i.e. a Group By operator)

    ReplyDelete
  7. Like Skeene said, I did not get an output. I replaced the var url with mines but nothing was displayed... any ideas? anything else that needs to be replaced?

    ReplyDelete
  8. the url may need quotes

    // you will need to change this URL
    var url = "http://myspsite/Lists/MyList/GroupedView.aspx";



    blah. bot checks... some day only robots will be able to prove they're not robots

    ReplyDelete
  9. Dude. This is so unbelievably awesome. Thank you!

    ReplyDelete
  10. Hi, from your screenshots you were using SharePoint 2007.
    Have you tried that with SharePoint 2010 and its famous list view threshold?
    As your list is above the 5,000 items threshold (200,000 items) your grouped view shouldn't have been able to return all items... I'm interested to know as I need to do something similar with SharePoint 2010. Thanks in advance for your help.

    ReplyDelete

This blog has been moved to www.martinhatch.com

Note: only a member of this blog may post a comment.