It's true that spawning multiple threads on a multi-core CPU does not guarantee that they will run on multiple cores, the hardware &/or OS decide that for you. Generally this should happen automatically without you worrying about it, except that if you do some multi-threaded processing for a short duration and expect it to always use all cores then you can get confusing results. This happens often when you measure the time to execute a single iteration of code on single vs multi-threaded code and become surprised that perhaps it is not faster with multi-threading. But if you run the same test for a longer duration (eg: 1 or 2 seconds), then it is safe to say that intensive multi-threaded code would be spread across all 4 cores. You might be lucky and find your 20ms of code runs on multiple cores (eg: if they were already running anyway because of a heavy multi-core app such as the camera or web browser running at the same time, etc), but for measuring performance you should do it over a long interval (this is recommended for normal performance testing including single-core code anyway).
To put it into perspective, let's say for simplicity that your OS is running just once every 10 milliseconds (as this is common), so if you create new threads, the other threads probably won't even get a chance to start for roughly that long, and both the OS & CPU hardware have to detect that based on recent history it is worth powering up some more cores rather than just increasing the clock frequency of the current cores (more cores will not be powered up unless if it really looks worth it, since it will result in higher power draw). If they do get powered up, there will be a delay until the multiple cores are ready, then they will start transferring the multiple threads you created. So if each of these steps happens at say 10 millisecond intervals then it's not surprising that it can take hundreds of milliseconds for your code to be fully spread across 4 cores.
Like I said, running a test for atleast 1 or 2 seconds should be a safe bet (either by doing your test multiple times or on bigger data), and depending on how parallel the code is, you can definitely get very close to 4x speedup by using 4 cores, such as for camera image processing, etc.
Cheers,
Shervin.