Scraping GMC data

841 views
Skip to first unread message

pharry...@googlemail.com

unread,
Nov 2, 2016, 6:41:17 AM11/2/16
to nhshackday

Hi,

 

Anyone interested in participating in scraping doctors' details from GMC website?

If you are interested, please let me know and I’ll assign you a range of numbers to scan (to avoid duplicating efforts).

Or feel free to scan outside of below ranges, though I doubt you'll find anything, except maybe in the 7600000+ range.

As soon as my scans are complete I’ll

 

GMC numbers contain 7 digits giving 10M possibilities, but a brute force scan of all possibly existing GMC numbers on the GMC website would require a scan of only about 1610k numbers.

From experience I’d expect that about 20-25% of these numbers actually (still) exist.

Running two Powershell processes at once I can do about 5k numbers an hour, so that would be 15 days, without interruption.

Two sessions peak occasionally on 10Mbps net traffic and about 150Mb memory, so more than sufficient bandwidth and memory left for other stuff.

 

Scan extracts (for existing numbers):

Extraction datetime

Full Doctor details

Full Doctor history

 

Example Powershell script available at:

https://github.com/DutchHarry/GMC/tree/master

 

Numbers to be scanned:

Old check digit numbers:

000000-499999 (plus check digit) :  500k

Former LR numbers:

5900000-5999999                  :  100k

5000000-5209999                  :  210k

Former FPR numbers:

6000000-6179999                  :  180k

New numbers:

  (since abandoning check digit)

7000000-7600000                  :  600k

Total                            : 1610k

 

 

I already did the first 500k, and making headways in the remainder.

 

 

Purpose:

First of all, the quarterly ODS consultants files (econcur+wconcur) are not that up to date, and the same applies for the weekly egmcmem with the GMC numbers for GPs.

So for certain analytics you may want to know the existing GMC numbers, so you can identify 'made-up' codes which the hospital may use to signal different activity (they are not supposed to do it that way, but alas).

And if you're really paranoid or cynical and have access to clear data you might want to check if there are hospitals who are still using codes of long retired (even deceased) consultants, who are still treating long removed (emigrated or deceased) patients. After all it's not only Virgin Care that tries to screw its commissioners, in the NHS this type of activity, apparently considered fraud elsewhere (even in Nigeria), is 'business as usual'.

 

Also interesting the percentage of foreign trained persons on the specialist register and foreign trained GPs, Just to see if a couple of £100M for additional medical school places to reduce ‘dependency on foreigners’ is more than just 'gesture-politics'.

 

Cheers

 

 

Neville Dastur

unread,
Nov 2, 2016, 6:51:57 AM11/2/16
to nhsha...@googlegroups.com
Happy to help.

BTW will there be a public API with the data?

Also what are the plans to keep the data up to date after the initial scrape.

Neville
--
Clinical Software Solutions: http://www.clinsoftsolutions.com
Find our free and paid apps on the iTunes Apple store and Android Google Play store
hospify.com - secure healthcare communication



--
You received this message because you are subscribed to the Google Groups "nhshackday" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nhshackday+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Helen Jackson

unread,
Nov 2, 2016, 6:53:49 AM11/2/16
to nhsha...@googlegroups.com
+1 (10000000) to API
Ping @pacharanero and @thatdavidmiller..

Marcus Baw

unread,
Nov 2, 2016, 6:54:41 AM11/2/16
to nhsha...@googlegroups.com
Apparently (I was told yesterday) GMC numbers and other information can be obtained from the NHS Spine Demographic Service, which is an API (of sorts)

Anyone know if this is true?

@harry if true would SDS help you in your project?

On 2 November 2016 at 10:51, Neville Dastur <neville...@gmail.com> wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to nhshackday+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "nhshackday" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nhshackday+unsubscribe@googlegroups.com.

Marcus Baw

unread,
Nov 2, 2016, 6:56:07 AM11/2/16
to nhsha...@googlegroups.com
@helen - thanks

also pinging @jakobmathiszig-lee

M

pharry...@googlemail.com

unread,
Nov 2, 2016, 7:45:16 AM11/2/16
to nhshackday
It's a moving target, unfortunately.
What I have so far is in an Excel file here:
https://drive.google.com/open?id=0BxQ_GTAUuHlCfkIyWGtmckgyRDhDZElSUmFYNXFOSDFkLU1zYzJWdlEwTTlMd0RxMjRWY0U

What this exercise does is scanning the potential relevant numbers of 10,000k that's only 1610k at the moment.
After that a rescan would require redoing the about 300k existing (non-deceased) numbers, and the range of the potential new numbers (currently near the 76nnnnn range), so that's reletively few.

API would be nice, but GMC would have to do it, as they 'own' the data. Apparently even NHS has to pay for it.
AFAIK spine only has the equivalent of the ODS (SDS) data, that's econcur+wconcur, but I'll check

I'll wait till tomorrow morning, and then divide the remaining numbers to scan in a few batches for you to have a go)

Cheers

Kevin Mayfield

unread,
Nov 2, 2016, 9:16:44 AM11/2/16
to nhshackday
Yes, most of the ODS/SDS is available via the Spine Directory Service. You would use LDAP queries to interrogate it from software. We used LDAPAdmin (http://www.ldapadmin.org/) to browse the data and help construct our queries.

You need correct certificates to access (don't have details, we gained access via another project)
 
 

pharry...@googlemail.com

unread,
Nov 2, 2016, 9:46:24 AM11/2/16
to nhshackday
response from NHS Digital:

My immediate question is what you intend to use the GMC data for, and what information is it you require (over and above the GMC identifier, obviously)?

 

The files we publish come from a variety of sources and use the GMC to identify an individual within a particular context. Econcur and Wconcur are hospital consultants, taken from the NHS Electronic Staff Record system. Egmcmem is a mapping of GMC to GP Prescriber codes. Spine does indeed hold some GMC codes but only those that we match to users that already hold a smart card (and therefore have a Spine record).

 

So – none of our listings that contain GMC somewhere will be ‘complete’ i.e. not everyone registered with the GMC goes on to become a hospital consultant, or a GP prescriber, or be allocated a smart card. In addition, despite the fact we are provided a copy of the register by the GMC, this is under the terms of a license agreement that restricts what we are able to share (essentially the code and name only) – so I can’t give you a dump of the GMC data in full anyway.

 

You are able to obtain a full extract of the GMC register direct from GMC, although you will have to pay…


pharry...@googlemail.com

unread,
Nov 2, 2016, 11:59:47 AM11/2/16
to nhshackday
Also have asked GMC this:

 

Can you provide a monthly extract of the full GMC register free of charge for (NHS clients related) analytical purposes.

(Or even better, create an API interface that anyone can use to automate data requests)

 

This would save us scraping this data periodically from your website, which is a bit of a pain as obviously we don’t know which numbers exist, and it takes a lot of valuable time.

 

 

Kindest Regards


pharry...@googlemail.com

unread,
Nov 2, 2016, 12:14:49 PM11/2/16
to nhshackday

I'm afraid I don't need your help now. Only about 280k numbers left to scan. As overnight I can run 7 processes at a bit more than 2.5k/hour it should be (nearly) finished by tomorrow morning.

As excuse a short explanation on how come?
A bit of an 'advantage of a disadvantage' (to paraphrase the philosopher Johan Cruijff)
I was a bit ill over the weekend and last few days, so set the machine on 8 processes at once, over 4 days. without intervening processes this rarely broke, so making up the balance a few minutes ago, only 280k left.

Will post (new) link to full set tomorrow

Thanks for the offer Neville

Cheers

Harry

Gavin Jamie

unread,
Nov 2, 2016, 12:17:11 PM11/2/16
to nhsha...@googlegroups.com
At the risk of being a bit of a wet blanket I suspect that you could asked to desist by the GMC or even prosecuted for breach of copyright. They are likely to point towards their £600 a year service for supplying information. Not saying that I approve, just be careful.

--
Gavin Jamie
QOF Database - www.gpcontract.co.uk

pharry...@googlemail.com

unread,
Nov 2, 2016, 12:39:58 PM11/2/16
to nhshackday, q...@gpcontract.co.uk
Thanks Gavin, I was aware that that's an option they have, and that they may be inclined to go that route, especially now I explcitly made them aware by asking.
Would not be their smartest option, I'd guess.
To unsubscribe from this group and stop receiving emails from it, send an email to nhshackday+...@googlegroups.com.

Gavin Jamie

unread,
Nov 2, 2016, 1:26:39 PM11/2/16
to nhsha...@googlegroups.com
They stand to lose £66,000 which may influence their decision...

--
Gavin Jamie
QOF Database - www.gpcontract.co.uk

To unsubscribe from this group and stop receiving emails from it, send an email to nhshackday+unsubscribe@googlegroups.com.

JCE H

unread,
Nov 2, 2016, 1:58:52 PM11/2/16
to nhshackday, q...@gpcontract.co.uk

:)

I can remember about 10 years ago and several times since asking the NACS/ODS/TRUD people for a full list of consultant codes with start and end dates because all sorts of weird codes were turning up in CDS/SUS/HES and they only provide snapshots which aren't much use for historical trending. Eventually the GMC offered the extract service but still the NHS cannot get decent data with paying. Given the financial environment people will find ways around it to understand the data in front of them by hook or by crook. I suspect that like everyone else I have no interest in making money but I do object to having my job made more difficult by someone else making money and pressurising an already stretched public purse.

pharry...@googlemail.com

unread,
Nov 2, 2016, 3:26:54 PM11/2/16
to nhshackday, q...@gpcontract.co.uk
That's why part of my email to GMC was in bold typeface.
No intention to sell the data, just make it available for analytics. So I don't see the revenue losses you mention.

We'll see; I prefer to tell them what I'm doing upfront.


On Wednesday, 2 November 2016 17:26:39 UTC, Gavin Jamie wrote:

pharry...@googlemail.com

unread,
Nov 2, 2016, 3:31:57 PM11/2/16
to nhshackday
For a complete record, my reply to NHS Digital:

Hi Mike,

 

Thanks for the swift response, that answers my question.

 

To answer your question:

I use the GMC numbers to identify consultant codes providers use in their data for contracting, commissioning and pathway analytics for my clients.

Providers use more numbers than the ones available in econcur and wconcur, and occasionally even no-longer existing numbers (retired or deceased consultants)

Don’t think even NHS Digital in SUS actually enforces that much (if any) data quality on consultant codes and referrer codes, and many other codes for that matter; even the data dictionary got watered down a bit, when ‘mandatory’ was changed to ‘mandatory, where available’ :>(

 

The odd provider uses ‘made up’ codes to flag certain activity for their commissioners, unfortunately the ‘business rules/data dictionaries’ of these ‘flags’ are not always shared by the so called ‘lead-commissioner’, so sometimes we have to figure this out by some guesswork.

 

Cheers

 

Harry

JCEH

unread,
Nov 3, 2016, 5:23:16 AM11/3/16
to nhsha...@googlegroups.com
 
Think you sent that to my Optum account. BCC'ed me in?
 


From: 'pharry...@aol.com' via nhshackday [mailto:nhsha...@googlegroups.com]
Sent: 02 November 2016 19:32
To: nhshackday
Subject: Re: [nhshackday] Scraping GMC data

--
You received this message because you are subscribed to a topic in the Google Groups "nhshackday" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nhshackday/Uq9UuQ-Lx60/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nhshackday+...@googlegroups.com.

pharry...@googlemail.com

unread,
Nov 3, 2016, 7:00:29 AM11/3/16
to nhshackday
Slightly modified script and results on github:
https://github.com/DutchHarry/GMC

Re API:
code can easily be adapted to operate 'API-like', e.g. a function that takes one number as an argument and returns all GMC details if the numbers is available through the web interface.
But that's too slow for processing larger datasets. Then you need the data in your own database to optimise for speed.

Cheers

Harry

Eirinaios Theodorou

unread,
Nov 25, 2018, 7:55:02 AM11/25/18
to nhshackday
Hi, I know this is 2 years after your last commit on GitHub. I am trying to make it work with python and RoboBrowser. I posted a question on stack overflow: stackoverflow question
Can you please help me out? I will really appreciate it because I have been stuck here for ages now and can't get the status of any GMC number!
Reply all
Reply to author
Forward
0 new messages