I loaded DB dump to my dev environment and analyzed data. This is my finding and ideas:
The database copy of Feb 13, 2016 had 49,881 records in the aabuddy_meeting table. All of them had internal_type "submitted" and the is_active value = t
SELECT count (id) as "cnt", internal_type, is_active FROM public.aabuddy_meeting
group by internal_type, is_active
(It is convenient to use these fields to mark records which need to be edited and removed)
46867 entries had data_source value equal to "KEEP", 2975 had "user_submitted" and 39 had "ohio45208.psv"
SELECT count (id) as "cnt", data_source FROM public.aabuddy_meeting group by data_source
Most active users submitting meeting:
|
# of entries |
Data source |
ID |
Username |
|
46867 |
KEEP |
1 |
admin |
|
1009 |
User submitted |
1 |
admin |
|
231 |
User submitted |
10 |
|
|
69 |
User submitted |
807 |
|
|
65 |
User submitted |
9 |
|
|
62 |
User submitted |
3506 |
|
|
59 |
User submitted |
5239 |
|
|
53 |
User submitted |
170 |
|
|
44 |
User submitted |
4432 |
|
|
41 |
User submitted |
3629 |
|
|
39 |
ohio45208.psv |
1 |
admin |
|
39 |
User submitted |
5123 |
|
|
35 |
User submitted |
1019 |
|
|
35 |
User submitted |
2418 |
|
|
32 |
User submitted |
11 |
|
|
31 |
User submitted |
4269 |
|
|
31 |
User submitted |
5598 |
|
|
30 |
User submitted |
1627 |
SELECT count (public.aabuddy_meeting.id) as "cnt", data_source, creator_id , username
FROM public.aabuddy_meeting
Left Join public.auth_user on creator_id=public.auth_user.id
group by data_source, creator_id, username
order by count (public.aabuddy_meeting.id) DESC, creator_id
Cases of Simple Full Duplicates
There are 48 records with identical values for day_of_week, start_time, end_time, name, description, address
SELECT count (id) as "cnt", day_of_week, start_time, end_time, name, description, address
FROM public.aabuddy_meeting
where is_active='t'
group by internal_type, geo_location,
day_of_week, start_time, end_time, name, description, address
having count (id)>1
order by name, description, address
Every record had exactly one twin copy, so the following query corner them by getting the Max(id):
update public.aabuddy_meeting set is_active='f', internal_type='dpl simpl'
where id in ( SELECT max(id) FROM public.aabuddy_meeting
where is_active='t'
group by geo_location, day_of_week, start_time, end_time, name, description, address
having count (id)>1
);
(To simplify processing I removed a special character in the address field of one record:
update public.aabuddy_meeting set address='20 E 91st St, New York, NY 10128' where id=265481;
)
For some reason majority of the addresses for New York, NY does not have Zip code, to be exact out of 1160 “New York, NY” addresses only 15 has Zip codes. The 1145 New York addresses without Zip Codes do have multiple derivative duplicates with State Code NY replaced with some random state code, like
|
id |
name |
address |
|
217010 |
BRONXVILLE (:II) |
1 Mead Way, 2nd Floor - Room #22 , New York, MA |
|
243067 |
BRONXVILLE (:II) |
1 Mead Way, 2nd Floor - Room #22 , New York, ME |
|
185813 |
BRONXVILLE (:II) |
1 Mead Way, 2nd Floor - Room #22 , New York, NY |
|
213519 |
BRONXVILLE (:II) |
1 Mead Way, 2nd Floor - Room #22 , New York, PA |
I marked these entries as “NY not NY” using the update query:
update public.aabuddy_meeting set is_active='f', internal_type='NY not NY'
where id in (
SELECT id
FROM public.aabuddy_meeting
where is_active='t'
and (right (trim(address), 2) ~ '^[0-9]+$' = 'f')
and address like '% New York, %'
and right (trim(address), 2) <> 'NY'
);
This method gave 89 entries in 27 groups that didn’t have a single original New York, NY entry. I marked these 89 records as “All not NY”:
update public.aabuddy_meeting set is_active='f', internal_type='All not NY'
where left( trim(address), length(trim(address)) - 2) in (
SELECT left( trim(address), length(trim(address)) - 2) as BeforeStateCode
FROM public.aabuddy_meeting
where (right (trim(address), 2) ~ '^[0-9]+$' = 'f')
and address like '% New York, %'
group by left( trim(address), length(trim(address)) - 2)
having max(Cast(is_active as integer)) = 0
)
and internal_type='NY not NY';
3. Cases of Multistate Addresses
It was easy to find out that the problem with random state code seeding was not unique to New York, NY. The following query shows all entries with identical meeting names, days of week and the part of the address field before the state value:
SELECT count(id) as "cnt", name, day_of_week, left( trim(address), length(trim(address)) - 2) as BeforeStateCode,
min(address) as MinAddress, max(address) as MaxAddress
FROM public.aabuddy_meeting
where (right (trim(address), 2) ~ '^[0-9]+$' = 'f') and is_active = 't'
group by name, day_of_week, left( trim(address), length(trim(address)) - 2)
having min(address) <> max(address) and count(id)>1
It retrieved 2,232 rows (when NY is deactivated). Some of this entries are valid meetings, but I can’t find a way to figure out which of them, so for a while I just mark these records “multistate” and keep them active:
update public.aabuddy_meeting set internal_type='multistate'
where left( trim(address), length(trim(address)) - 2) in (
SELECT left( trim(address), length(trim(address)) - 2) as BeforeStateCode
FROM public.aabuddy_meeting
where (right (trim(address), 2) ~ '^[0-9]+$' = 'f') and is_active = 't'
group by name, day_of_week, left( trim(address), length(trim(address)) - 2)
having count(id)>1 and min(address) <> max(address)
);
This query affected 9,617 rows. You can review the records marked as “multistate” with the query
SELECT count(id) as "cnt",
left( trim(address), length(trim(address)) - 2) as BeforeStateCode,
name, description, day_of_week, start_time, end_time,
min(right (trim(address), 2)) as MinAddress, max(right (trim(address), 2)) as MaxAddress
FROM public.aabuddy_meeting
where internal_type='multistate' and is_active = 't'
group by left( trim(address), length(trim(address)) - 2), name, description, day_of_week, start_time, end_time
having count(id) > 1
order by left( trim(address), length(trim(address)) - 2), name, description, day_of_week, start_time, end_time, count(id) DESC;
It groups 9617 records into 2233 rows; we need to go through all of them and identify the correct address in every group. As the result we are going to remove about 7300 bogus meetings in the process. The output of this query is attached; please, share your ideas where all these entries came from and how to clear them out.
My very last query allows to identify the “double addressing” of geo location, ie returns rows with identical coordinates and different address values:
select "NumOfAddresses", "SumOfMeetings", "MinAddress", "MaxAddress",
ST_Y(geo_location::geometry) as "latitude",
ST_X(geo_location::geometry) as "longitude", geo_location
from (
select grouped_addresses.geo_location,
min(address) as "MinAddress",max(address) as "MaxAddress",
count("NumOfMeetings") as "NumOfAddresses",
sum("NumOfMeetings") as "SumOfMeetings"
from (
SELECT geo_location, address, count(id) as "NumOfMeetings"
FROM public.aabuddy_meeting
where is_active = 't' and internal_type='submitted'
group by geo_location, address
having count(id)>1
) as grouped_addresses
group by geo_location
having count("NumOfMeetings") > 2
) as frequent_locations
order by "NumOfAddresses" DESC, "SumOfMeetings" DESC;
Lo and Behold! After the clearing made it returned only 28 rows, all of them caused by different spellings of the same addresses. At the beginning of the process it returned about five thousand entries.
Happy Presidents' Day,
Michael
The recent goings on have left me in a rather unserene frame of mind. I am going to take a month off and try to get back into it with a clear head afterwards. Tentative resumption date is Saturday March 19th.
--
You received this message because you are subscribed to the Google Groups "mcjug-android" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mcjug-androi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I think your conclusions about the Global Meeting Scraper data having some bad data are probably correct. But please make sure that you're viewing the correct output data (there are about 3 or 4 directories) because they have been filtered many times.
We originally knew (and accepted the fact) that the meetings found by Global Meeting Scraper included bogus meetings. But we were willing to accept that fact in order to get any meetings in the database. You can go to the original Web site if you want to see how really bad/inaccurate the data was.
Also keep in mind, we may have gone from 15000 meetings per state down to 10k, down to 5k meetings after much automated filtering if bad meetings was done. We had discussed manually going through each state, but there wasn't enough support for that at the time. Victor can fill you in about how AA stores their meetings online (hint... not centralized, and not in a common format), and he can also explain to you why things were done (or not done) for the past two years on the project.
Yep, I would agree that the Meeting Scraper data (not Global Meeting Scraper) is correct, and at one time, that metro area data was in the database, including all the DC meetings. At some point in time (perhaps during a database recovery), those particular meetings were not reloaded.
If you have some suggestions (I would need to go back and read your posting) about how to clean up existing meetings that are bogus, without deleting valid meetings, then that would be good. But I'd clear it with Victor first since a) he knows the history of what was done and why things were done, and b) I think he mentioned that (for right now, at least) he didn't want to mess with anything server-side, because it is actively being worked upon.
Talk to you later.
...
--
--
Maybe possible to do it using the MapPoint on my laptop☺
I will try.
| DEBUG 2016-02-18 23:28:56,962 views 2949 140676865280320 About to register youknowyouwa...@gmail.com |
| ERROR 2016-02-18 23:28:57,066 base 2949 140676865280320 Internal Server Error: /meetingfinder/create_user |
| Traceback (most recent call last): |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/core/handlers/base.py", line 113, in get_response |
| response = callback(request, *callback_args, **callback_kwargs) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view |
| return view_func(*args, **kwargs) |
| File "/var/www/aabuddy/releases/current/aabuddy/environments/../aabuddy/views.py", line 227, in create_user |
| user.save() |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/base.py", line 546, in save |
| force_update=force_update, update_fields=update_fields) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/base.py", line 650, in save_base |
| result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/manager.py", line 215, in _insert |
| return insert_query(self.model, objs, fields, **kwargs) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/query.py", line 1675, in insert_query |
| return query.get_compiler(using=using).execute_sql(return_id) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 943, in execute_sql |
| cursor.execute(sql, params) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/backends/util.py", line 41, in execute |
| return self.cursor.execute(sql, params) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 58, in execute |
| six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2]) |
| File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 54, in execute |
| return self.cursor.execute(query, args) |
| DatabaseError: value too long for type character varying(30) |
|
Total Records |
1,464 |
|
|
Not Found States |
8 |
0.55% |
|
2 States |
60 |
4.10% |
|
3 States |
9 |
0.61% |
|
Wrong State |
8 |
0.55% |
|
Correct Answers |
1404 |
95.90% |
It had some weird ideas when asked to get something smaller than state.
| internal_type | Count | Meaning |
| multistate | 9,199 | Bad entries, mostly the ones with randomly assigned multiple states |
| submitted | 9,654 | Latest 3k+ entries and the ones which I could not reconcile |
| verCityZip | 20,314 | Fair, Google address had matched Zip and City, but not street address |
| verified | 8,635 | Best, Google addr Street, City and Zip Match |
| ver-spel | 770 | Good, same geo_location as verified, but the address spelled differently. |