Commented Unassigned: http_listener (Windows) - server performance deteriorates completely after enough file uploads [341]

We use Casablanca in our product, where the main use case is to upload files via HTTP to the server for further processing. The HTTP clients which submit the files use polling to keep up with the current state of the server.

However, after enough file uploads, the performance of polling deteriorates significantly - from 1-2% CPU time before uploads, to 7-10% after 200 files (7MB each) were uploaded, to 9-13% when 50 more files are uploaded - see attached "Polling Perf. after Uploads.jpg" file. Further upload of 100 files deteroirates polling perf. even to 30-33% (not shown, as it wouldn't fin into the JPG)!

I attached the simple client-server project I used to reproduce the issue. It's a VS-2013 project using casablanca 2.4.0 on Windows. Polling is conducted every 0.25 sec as to make the perf. drop more visible. Note: nonstandard include & library paths for Casablanca used, as I built the SDK locally - our build system requires it when building cross-platform. The file used for the upload is included too, it lies in the Debug drirectory.

As of my investigations, the problem seems to lie deep in Windows Concurrency Runtime. The profiler says that the program is spending much of the time in (the numbers come from the experiment shown on the attached JPG):

FreeThreadProxy::Dispatch - 98%
...
WorkSearchContext::GetUnrealizedChore - 96.49%
WorkSearchContext::StealUnrealizedChore - 96.48%
ScheduleGroupSegmentBase::SafelyDeleteDetachedWorkQueue 35,96%
-> ListArray<..>::Remove - 31%
ListArray<...>:operator[] -21,29%
WorkQueue::IsEmpty-13,57%
WorkQueue::IsDetached-9,26%

Thus the bulk of work is thus done traversing the ListArray<...> structure. Another observation is the the ListArray member of ScheduleGroupSegmentBase named m_workQueues is constantly growing, and reached the size of 2685 in the above depicted experiment (case where the polling eat up over 30% of CPU).

As Comcurrency::ListArray impl. never shrinks (not necessary, it should mirror the max. possible concurrency level), and it's build up from chunks, traversing the m_workQueues at each dispatch turn only to find out thtat ther isn't any work to be done seems to eat performance up big deal.

As m_workQueues contains the work queues of the contexts which are not yet ready at the given moment, maybe there's a problem with external contexts stemming from IO-completion callbacks? The queues in workQueues seems to be ready, but aren't marked for detachement.

The problem would preclude using Casablanca in our project, so it's rather a serious one. The server is supposed to run for months without stopping, so any, not even a very small, performance degradation is allowed.

I could help in resolving that, but first try to reproduce the issue & give me some feedback. Maybe my usage of the SDK is plain wrong? Note: getting sync on the file-receiving task by calling a wait() on in didn't help.

Regards,
Marek

PS: one or two times, the server ran on a myserious assert, see the attached "ASSERT _M_numberOfWriters.jpg" file. Seems like an issue with RW-locking, but cannot be reproduced.

Comments: Hi Marek, The CPPREST_FORCE_PPLX switch will go into the master branch with our next release 2.5.0. The date is not yet determined, perhaps sometime in March. Regarding the Concurrency Runtime - I think it might depend on your workload. The reason you are seeing better performance is because the Concurrency Runtime was over time creating too many threads. The Windows threadpool, being part of the operating system has a better view of the overall system. This is one of the reasons for Visual Studio 2015 we've made PPL tasks by default run on the Windows threadpool. Steve