[O365] Parallel processing and CSOM; friends or foes?
In the project I’ve been working on the last few months, we had to import some list items and file into SharePoint coming from a legacy application which is being replaced by SharePoint Online. There is plenty of guidance available on how to do this, it boils down to using the CSOM API’s to upload documents and create list items, set metadata, etcetera. But when volumes start growing, so does the pain because these actions are usually not that fast. So whilst your process is waiting on a file to be uploaded, why not do something in the meantime? Well you can with some parallel programming magic and it isn’t even that hard, but there are some caveats you need to take into account.
Basically this post revolves around one important property of the CSOM objects, specifically ClientContext. And that is: CSOM objects are not thread safe!
Let’s start with some background information on thread safety. When a program uses threading (any kind of it), this means there is code running simultaneously in multiple threads. This is especially useful in today’s world where most PC’s are powerful anyway and usually care multiple CPU cores. Threading allows your program to do things whilst waiting for something else. The fact that objects are labeled as “not thread safe” means that you will start running into problems when more than one thread tries to use that object. There’s all kinds of reasons for this which I’m not going to cover in this post. The important thing to understand is that the object (and thus your code) will get into trouble if multiple threads access it, change things at the same time and maybe try to update SharePoint as well.
So the first lesson of this post: make sure you use you ClientContext objects in one thread only. And the second lesson: this goes for almost all of the things you create using it, too. For instance, you cannot get a ListItem from context A and use it with context B, this will fail.
Here’s a simple code example of how you could implement this:
class WorkerClass : IDisposable { ClientContext _context; public WorkerClass() { _context = ClientContextFactory.GetContext(); } public void DoWork() { // this method will perform time consuming work } public void Dispose() { if (_context != null) { _context.Dispose(); _context = null; } } }
The class holds an instance of ClientContext, so each class you create will be thread safe because it will have its own context object to interact with. To avoid memory leaks, you need to implement IDisposable as soon as you’re using object which do the same (and ClientContext does that).
By the way, you see that I’m using a ClientContextFactory to get my ClientContext object. This is a simple class following the FactoryPattern (sort of) to create new ClientContext object with the correct authentication and URL set based on config settings. Not required, but I like doing it that way.
Throttling
So now that we’ve managed to create something that will work in parallel scenario’s, the next problem that will come in is throttling. There is another nice MSDN page on that which you can find here. Basically, throttling ensures that no one takes down SharePoint servers by firing idiotic amounts of requests at them, kind of like a DDoS attack. I found that when you use CSOM in a single threaded application it’s pretty hard to run into the throttling limitations. But when you go parallel, your chances of doing to just doubled. Or tripled, quadrupled, etc. depending on the number of threads you’re using, Throttling limits requests per user, so using multiple ClientContext objects won’t solve this limitation. By the way, you might perceive it as annoying (I do too), but I guess it’s for the best that your little console app doesn’t tear down your SharePoint environment because you made a little boo boo, right?
Lesson 3: throttling prevents you from firing too many requests at the same time.
Dataflow (Task Parallel Library)
This next little gem will help us deal with the throttling issue. It also is that basis of what we will be using to make our app really multi-threaded.
There’s quite some threading stuff available in the .NET framework to choose from. And there are lots of articles on those so I’m not going to detail them all. The one I like best is class the Task Parallel Library (TPL), also Dataflow. As the MSDN page says: “The Task Parallel Library (TPL) provides dataflow components to help increase the robustness of concurrency-enabled applications.”.
The library is built around the basic threading stuff in .NET, but adds some functionality which make it easier to control. The best feature in this case is the ability to queue stuff but use only a limited number of threads to process that queue. Super handy when uploading files for instance so you do not have too many uploads running at once (which will only hurt performance instead of boosting it).
You can add the library via NuGet by searching on ‘dataflow’. Note that you need to be running at least .NET 4.5 in order to use it,.
ActionBlocks
With the TPL library installed, you can now start using ActionBlocks. An ActionBlock is a block of code which is processed in a parallel way. Options can be set to influence how the processes are handled, the obvious one to set it MaxDegreeOfParallelism which lets you set the number of processes to run simultaneously.
// set the options to use when executing the parallel stuff // MaxDegreeOfParallelism sets the maximum number of parallel processes ExecutionDataflowBlockOptions options = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 3 }; // Define the actionblock. This is the block of code which is executed. // The WorkPackage generic parameter allows to pass in any object you like // as input for the process. ActionBlock block = new ActionBlock(package => { using (WorkerClass worker = new WorkerClass()) { worker.DoWork(package); } }, options);
Notice how I create a WorkerClass inside of a using statement in the ActionBlock. This will now happen for every item that needs to be processed, ensuring that there is no sharing between threads. You can put some code in the ActionBlock itself, I would recommend creating a method or class for it so your code keeps clean.
Now that you have your ActionBlock set-up and created, we can start queuing work for it. This is done using the Post() method:
// Now we define a list of workpackages, just some dummies for show List workPackages = new List() { new WorkPackage("Package1"), new WorkPackage("Package2"), new WorkPackage("Package3") }; // Post each item to the block, which queues the package for processing foreach (WorkPackage package in workPackages) { block.Post(package); }
This is an easy example, but think that you can also have more complex code creating workpackages based on other actions and posting them to your actionblock(s) for processing. Lots of possibilities there. Also, I used a pretty lame and simple class as workpackage. Due to the ActionBlock using generics, you can pass in any class you want to use. Usually these will contain the information required as input for your process.
Once you start posting, the ActionBlock will start processing. Now the rest is simple:
// block.Complete() signals the dataflow that all of the work has been sent block.Complete(); // block.Completion.Wait() tells the dataflow to wait untill all of the items // have been processed block.Completion.Wait();
This tells the block that you’re done posting (no more additional work is coming) and that it needs to wait untill all of the work has been done.
And there you go. Your code is now running in three simultaneous threads. Back to the throttling issue, so how many threads can you run simultaneously? Well that 100% depends on what your code is doing. Request heavy stuff will run into throttling sooner than code which only makes incidental calls to the SharePoint API’s. So that’s not a question I can answer here. My advise would be to test and measure performance whilst keeping an eye out for throttling. That way you’ll find out which value works best for you without running into trouble.
I shared the sample project on my OneDrive here. That doesn’t do anything, but it should get you get up and running with the TPL. And should you have any questions or comments, of course you’re more than welcome to use the comment section below!
November 26, 2015 at 11:07 am |
Hi,
Nice post about a complex topic!
I can imagine some scenarios where the customer just want to “update some data in O365” 😉
Regards
Daniel
July 8, 2016 at 1:26 pm |
Hey,
Thanks for this guide! I tried to download your sample code but the links seems broken. Can you update it, please? 🙂
July 8, 2016 at 2:43 pm |
I just checked it, working fine for me. What error are you getting?
July 12, 2016 at 8:18 am |
I get this error: http://imgur.com/y4sT3qc
I tried with Chrome, IE, Edge and Firefox with no luck :/
July 12, 2016 at 8:51 am |
That’s odd, it keeps working for me. Try this one: https://1drv.ms/u/s!AnP8b1qgAMQmlKBC1iVvGaLA1vGKrw
July 19, 2016 at 8:13 am
(sorry for the delay)
This one works, thanks! 🙂
July 19, 2016 at 11:26 am
Thanks again, I tried your code and it works fine. I now understand how TPL works (I even read something about TransforBlock which is something I may use later) but I’m thinking of one thing: why didn’t you use a simple Parallel.For/ForEach?
Is it for code separation or an other reason?
July 19, 2016 at 2:39 pm
Good question and to be honest I’m not quite sure why as this already was a while back. I think it had something to do with the .NET framework version we were using that didn’t support that. I believe a lot of the parallels stuff that is now in the .NET framework natively actually came from the TPL library.
July 27, 2016 at 3:13 pm
“I believe a lot of the parallels stuff that is now in the .NET framework natively actually came from the TPL library.”
After some searches, it seems to be the case 🙂
That’s awesome how fast and good the .NET Framework is evolving!