Concurrent data structures are often considerably slower than sequential ones. There are a number of reasons for this. Often the algorithm itself it more complex, so, even without concurrency, each thread has to do more work. Threads may conflict on operations, and have to re-start their operations. Threads will suffer more cache misses, and the memory bus will suffer congestion.
Concurrent priority queues scale particularly badly. The head of the queue is inevitably a bottleneck for removeMin, as all threads will be accessing the same part of the datatype. This explains your observation that the priority queue works better when all threads simply enqueue.
The speed-ups you get from concurrent programs will mainly be where the threads act independently, outside shared datatypes. The shared datatypes are a necessary evil, to allow threads to interact. Do you need a pure priority queue in your application, or could you manage with an approximation, where each thread maintains its own queue, and threads sometimes exchange data?
A Google search will find quite a lot of papers about different designs for concurrent priority queues.
Gavin