taking a month off

3 views
Skip to first unread message

victor semenov

unread,
Feb 13, 2016, 6:23:45 PM2/13/16
to mcjug-...@googlegroups.com
The recent goings on have left me in a rather unserene frame of mind. I am going to take a month off and try to get back into it with a clear head afterwards. Tentative resumption date is Saturday March 19th.

Michael Semenov

unread,
Feb 14, 2016, 11:58:25 PM2/14/16
to mcjug-...@googlegroups.com

I loaded DB dump to my dev environment and analyzed data. This is my finding and ideas:

A. Some Statistic

The database copy of Feb 13, 2016 had 49,881 records in the aabuddy_meeting table. All of them had internal_type  "submitted" and the is_active value = t

SELECT count (id) as "cnt", internal_type, is_active FROM public.aabuddy_meeting

group by internal_type, is_active


(It is convenient to use these fields to mark records which need to be edited and removed)


46867 entries had data_source value equal to "KEEP", 2975 had "user_submitted" and 39 had "ohio45208.psv"

SELECT count (id) as "cnt", data_source FROM public.aabuddy_meeting group by data_source


Most active users submitting meeting:

# of entries

Data source

ID

Username

46867

KEEP

1

admin

1009

User submitted

1

admin

231

User submitted

10

maria.k...@gmail.com

69

User submitted

807

igor...@gmail.com

65

User submitted

9

victor....@gmail.com

62

User submitted

3506

jana...@gmail.com

59

User submitted

5239

animo...@gmail.com

53

User submitted

170

lcuthber...@yahoo.com

44

User submitted

4432

Eric...@gmail.com

41

User submitted

3629

the2...@aol.com

39

ohio45208.psv

1

admin

39

User submitted

5123

peacewo...@gmail.com

35

User submitted

1019

texever...@gmail.com

35

User submitted

2418

Sirh...@gmail.com

32

User submitted

11

david.dit...@gmail.com

31

User submitted

4269

elj...@gmail.com

31

User submitted

5598

ahers...@gmail.com

30

User submitted

1627

shayn...@gmail.com

 

SELECT count (public.aabuddy_meeting.id) as "cnt", data_source, creator_id , username

FROM public.aabuddy_meeting

Left Join public.auth_user on creator_id=public.auth_user.id

group by data_source, creator_id, username

order by count (public.aabuddy_meeting.id) DESC, creator_id

 

B. Data Inconsistencies and Fishy entries

 

Cases of Simple Full Duplicates

There are 48 records with identical values for day_of_week, start_time, end_time, name, description, address

SELECT count (id) as "cnt", day_of_week, start_time, end_time, name, description, address      

FROM public.aabuddy_meeting

where is_active='t'

group by internal_type, geo_location,

day_of_week, start_time, end_time, name, description, address

having count (id)>1

order by name, description, address


Every record had exactly one twin copy, so the following query corner them by getting the Max(id):

update public.aabuddy_meeting set is_active='f', internal_type='dpl simpl'

where id in ( SELECT max(id) FROM public.aabuddy_meeting

                where is_active='t'

                group by geo_location, day_of_week, start_time, end_time, name, description, address

                having count (id)>1

);


Cases of New York not NY

(To simplify processing I removed a special character in the address field of one record:

update public.aabuddy_meeting set address='20 E 91st St, New York, NY 10128' where id=265481;

)

For some reason majority of the addresses for New York, NY does not have Zip code, to be exact out of 1160 “New York, NY” addresses only 15 has Zip codes. The 1145 New York addresses without Zip Codes do have multiple derivative duplicates with State Code NY replaced with some random state code, like

id

name

address

217010

BRONXVILLE (:II)

1 Mead Way,  2nd Floor - Room #22 , New York, MA

243067

BRONXVILLE (:II)

1 Mead Way,  2nd Floor - Room #22 , New York, ME

185813

BRONXVILLE (:II)

1 Mead Way,  2nd Floor - Room #22 , New York, NY

213519

BRONXVILLE (:II)

1 Mead Way,  2nd Floor - Room #22 , New York, PA

 

I marked these entries as “NY not NY” using the update query:

 

update public.aabuddy_meeting set is_active='f', internal_type='NY not NY'

where id in (

                SELECT id

                FROM public.aabuddy_meeting 

                where is_active='t'

                                and (right (trim(address), 2) ~ '^[0-9]+$' = 'f')

                                and address like '% New York, %'

                                and right (trim(address), 2) <> 'NY'

);

This method gave 89 entries in 27 groups that didn’t have a single original New York, NY entry. I marked these 89 records as “All not NY”:

update public.aabuddy_meeting set is_active='f', internal_type='All not NY'

where left( trim(address), length(trim(address)) - 2) in (

                SELECT left( trim(address), length(trim(address)) - 2) as BeforeStateCode

                FROM public.aabuddy_meeting 

                where (right (trim(address), 2) ~ '^[0-9]+$' = 'f')

                                and address like '% New York, %'

                group by left( trim(address), length(trim(address)) - 2)

                having max(Cast(is_active as integer)) = 0

)

and internal_type='NY not NY';

 

3.       Cases of Multistate Addresses

It was easy to find out that the problem with random state code seeding was not unique to New York, NY. The following query shows all entries with identical meeting names, days of week and the part of the address field before the state value:


SELECT count(id) as "cnt", name, day_of_week, left( trim(address), length(trim(address)) - 2) as BeforeStateCode,

                min(address) as MinAddress, max(address) as MaxAddress

FROM public.aabuddy_meeting

where (right (trim(address), 2) ~ '^[0-9]+$' = 'f')  and is_active = 't'

group by name, day_of_week, left( trim(address), length(trim(address)) - 2)

having min(address) <> max(address) and count(id)>1


It retrieved 2,232 rows (when NY is deactivated). Some of this entries are valid meetings, but I can’t find a way to figure out which of them, so for a while I just mark these records “multistate” and keep them active:


update public.aabuddy_meeting set internal_type='multistate'

where left( trim(address), length(trim(address)) - 2) in (

                SELECT left( trim(address), length(trim(address)) - 2) as BeforeStateCode

                FROM public.aabuddy_meeting

                where (right (trim(address), 2) ~ '^[0-9]+$' = 'f')  and is_active = 't'

                group by name, day_of_week, left( trim(address), length(trim(address)) - 2)

                having count(id)>1 and min(address) <> max(address)

);


This query affected 9,617 rows. You can review the records marked as “multistate” with the query


SELECT count(id) as "cnt",

                left( trim(address), length(trim(address)) - 2) as BeforeStateCode,

                name, description, day_of_week, start_time, end_time,

                min(right (trim(address), 2)) as MinAddress, max(right (trim(address), 2)) as MaxAddress

FROM public.aabuddy_meeting

where internal_type='multistate' and is_active = 't'

group by left( trim(address), length(trim(address)) - 2), name, description, day_of_week, start_time, end_time

having count(id) > 1

order by left( trim(address), length(trim(address)) - 2), name, description, day_of_week, start_time, end_time, count(id) DESC;


It groups 9617 records into 2233 rows; we need to go through all of them and identify the correct address in every group. As the result we are going to remove about 7300 bogus meetings in the process. The output of this query is attached; please, share your ideas where all these entries came from and how to clear them out.

 

C. And the Final Strike for Reconciliation of the Job done

My very last query allows to identify the “double addressing” of geo location, ie returns rows with identical coordinates and different address values:

select "NumOfAddresses", "SumOfMeetings", "MinAddress", "MaxAddress",

                ST_Y(geo_location::geometry) as "latitude",

                ST_X(geo_location::geometry) as "longitude", geo_location

from (

                select grouped_addresses.geo_location,

                min(address) as "MinAddress",max(address) as "MaxAddress",

                count("NumOfMeetings") as "NumOfAddresses",

                sum("NumOfMeetings") as "SumOfMeetings"

                from (

                                SELECT geo_location, address, count(id) as "NumOfMeetings"

                                FROM public.aabuddy_meeting

                                where is_active = 't' and internal_type='submitted'

                                group by geo_location, address

                                having count(id)>1

                ) as grouped_addresses

                group by geo_location

                having count("NumOfMeetings") > 2

) as frequent_locations

order by "NumOfAddresses" DESC, "SumOfMeetings" DESC;

 

Lo and Behold! After the clearing made it returned only 28 rows, all of them caused by different spellings of the same addresses. At the beginning of the process it returned about five thousand entries.


Happy Presidents' Day,

Michael



On Sat, Feb 13, 2016 at 6:23 PM, victor semenov <victor....@gmail.com> wrote:
The recent goings on have left me in a rather unserene frame of mind. I am going to take a month off and try to get back into it with a clear head afterwards. Tentative resumption date is Saturday March 19th.

--
You received this message because you are subscribed to the Google Groups "mcjug-android" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mcjug-androi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MultistateRows.xls

Maria Klimchak

unread,
Feb 15, 2016, 1:33:32 PM2/15/16
to mcjug-...@googlegroups.com
That data is in the global files Dave gave you the link to. Some data is not correct in those files, that is why your are seeing same address in different states. It came from the website we used to get the data,

If an address had a zipcode we were able to tell if it was in correct state and we remove them. But without zipcode you cannot tell if address is in correct state.

victor semenov

unread,
Feb 15, 2016, 2:54:02 PM2/15/16
to mcjug-...@googlegroups.com
We can talk more once I get back into it in march, but as a direction it might be helpful to assemble a list of meetings that definitely, without doubt have an incorrect lat and long. Those meetings are second biggest cause of bad ratings(after lack of meetings in an area). It would probably be good to delete them. Once they are gone it should be simpler to find areas with lots of people but few meetings, since areas filled with wrong lat/long meetings won't be confusing us with false negatives anymore.

Michael Semenov

unread,
Feb 15, 2016, 4:08:03 PM2/15/16
to mcjug-...@googlegroups.com
Honestly, more I look at it, more I tend to agree with Victor and understand his frustration. We have 9,617 suspicious records, 20% of the entire DB; we know that once they pointed to up to 2,233 actual meetings (not actual)... So, with this particular dataset our APP returns a correct answer about 23% of times.

I cannot find any practical way to use Global Meeting Scraper info. The problem is that these files were uploaded 2+ years ago and that time it had from 20% to 25% of incorrect data. (As an example, search for the term "Metropolis+Club" in GitHub; it returns 27 entries, use the address 928 5th St as your reconciliation point; there are multiple entries in Global Meeting Scraper files like the ND_GeoMeetings.psv , all of them were wrong. All entries in the Meeting Scrapper were correct, but they were never uploaded to DB).

I vote to ax all the "multistate" meetings and ask Galina to recover what she can by looking through the table of 2233 records. She found her way to get Geo locations from google really fast and accurate.

Maria Klimchak

unread,
Feb 15, 2016, 4:20:55 PM2/15/16
to mcjug-...@googlegroups.com

Not sure what frustration you are referring to. But as Victor just said, we can talk about it when we get back together in March.

Maria

David Ditzenberger

unread,
Feb 15, 2016, 5:04:55 PM2/15/16
to mcjug-...@googlegroups.com

I think your conclusions about the Global Meeting Scraper data having some bad data are probably correct.   But please make sure that you're viewing the correct output data (there are about 3 or 4 directories) because they have been filtered many times.

We originally knew (and accepted the fact) that the meetings found by Global Meeting Scraper included bogus meetings.   But we were willing to accept that fact in order to get any meetings in the database.   You can go to the original Web site if you want to see how really bad/inaccurate the data was.

Also keep in mind,  we may have gone from 15000 meetings per state down to 10k, down to 5k meetings after much automated filtering if bad meetings was done.   We had discussed manually going through each state,  but there wasn't enough support for that at the time.   Victor can fill you in about how AA stores their meetings online (hint... not centralized, and not in a common format), and he can also explain to you why things were done (or not done) for the past two years on the project.

Yep, I would agree that the Meeting Scraper data (not Global Meeting Scraper)    is correct, and at one time, that metro area data was in the database,  including all the DC meetings.  At some point in time (perhaps during a database recovery), those particular meetings were not reloaded.

If you have some suggestions (I would need to go back and read your posting) about how to clean up existing meetings that are bogus,  without deleting valid meetings, then that would be good.   But I'd clear it with Victor first since a) he knows the history of what was done and why things were done,  and b) I think he mentioned that (for right now,  at least) he didn't want to mess with anything server-side, because it is actively being worked upon.

Talk to you later.

Michael Semenov

unread,
Feb 15, 2016, 6:24:02 PM2/15/16
to mcjug-...@googlegroups.com
The Global Meeting Scraper data maybe incorrect or outdated even when we have the state and the city. The very Metropolis Club has 4 dozens AA meetings at 938 Rhode Island Ave NE (the address from the Meeting Scraper), and nothing at the 928 5th St., Northwest (former Metropolis address,from the Global Meeting Scraper). 

I gave Washington DC data to my wife to merge into her data entry process and got back from her that, let me make it gentle and politically correct way, the social life in Washington DC is spinning a bit faster than our implementation procedures. Also she added her own rules and principles into it, for example she does not enter a meeting if the only cross-reference for it is on zillow.com website, considering that DC has some really bad parts and we should not direct our users there.

DavidD

unread,
Feb 15, 2016, 7:22:58 PM2/15/16
to mcjug-android
We agree Global Meeting Scraper data (as well as many other web sites, including official AA sites) may possibly have outdated or incorrect data.  We knew that when we decided to go ahead and use the Global Meeting Scraper data because the benefits far outweighed the extra bogus or incorrect meetings (because there was no other web site or data source with centralized, comprehensive data).

Please keep in mind that even official sources have incorrect addresses and times, or meetings that are no longer there (as I'm sure your wife has already found out).  That's why we included a "not there" button on the GUI -- not a perfect solution, but probably the most realistic way of maintaining the data.

There is no doubt that your wife (or any of us given time) could enter as valid, correct, or up-to-date data -- far better than automated processes parsing error prone data.  And, for that effort, we are certainly appreciative.  In fact, you should get her an extra nice Valentine's Day gift if you haven't already. ;-) However, I'm sure not how to scale that effort short of working your poor wife until she is exhausted, or, hiring a bunch of folks to enter meeting data for the rest of the United States.

Something to think about (I'm not criticizing, but):
I suspect that some people that need AA most might live in those "bad neighborhoods" and have no means of transportation to meetings outside of their neighborhoods.  It seems as if anybody (even in the bad neighborhoods) can afford a $9.99 Tracfone that will run our application.  So, I think we should direct people to those meetings.  Many of those meetings are held in Churches or community centers (I volunteered at one in Benton Harbor, Michigan) and they are some of the safest locations where people who live in those area can congregate.
...

victor semenov

unread,
Feb 15, 2016, 7:43:57 PM2/15/16
to mcjug-...@googlegroups.com
I for one am not frustrated about the state of our meeting list. What we have is good enough to get downloads and have people use our app. I do not see a reason for us to launch a fault-finding mission into why some of the meetings are incorrect.

Mike could you focus on finding meetings that definitely have the wrong location? How they ended up this way is irrelevant. We can decide if we want to remove or "relocate" them once I get back.

--

Michael Semenov

unread,
Feb 15, 2016, 7:44:09 PM2/15/16
to mcjug-...@googlegroups.com
Yes, she enters all meetings in churches and community centers, she just avoid private residences. It doesn't look like a permanent arrangement.

On Mon, Feb 15, 2016 at 7:22 PM, DavidD <david.dit...@gmail.com> wrote:

--

Michael Semenov

unread,
Feb 15, 2016, 7:53:37 PM2/15/16
to mcjug-...@googlegroups.com
The analysis was done yesterday, found "Definitely not there" and "Most probably not there" ones. Do you want queries to update the production DB or a spreadsheet with the addresses?

victor semenov

unread,
Feb 15, 2016, 8:12:31 PM2/15/16
to mcjug-...@googlegroups.com
Not gonna do anything with them till march 19th. For now just see if you can enhance your queries somehow and increase the size of the Definitely Not There list.

victor semenov

unread,
Feb 15, 2016, 10:33:49 PM2/15/16
to mcjug-...@googlegroups.com
Mike I think its been long enough since we last looked into it and maybe the available libraries have gotten better. If you reach a point where db queries can not be improved anymore, try finding/writing a method that takes a lat/long and spits out the state that it is in(MD, VA...). If it turns out that is possible, try going one step further, and write a thingy that calls some api with a lat/long and spits out the address, its ok if this one only does a few a day for free.

It may not be possible to write tools that accomplish this, but if it turns out that either one is doable without paying google or mapquest a bunch of money, our meeting curation process can be improved significantly.

Michael Semenov

unread,
Feb 16, 2016, 12:09:37 AM2/16/16
to mcjug-...@googlegroups.com

Maybe possible to do it using the MapPoint on my laptop☺
I will try.

Michael Semenov

unread,
Mar 12, 2016, 7:50:00 PM3/12/16
to mcjug-...@googlegroups.com
If somebody fancy to do some programming, we need a small but urgent patch. Strange people out of there, somebody figured out that long email addresses are crashing our system and keeps bombarding it with registration requests causing 500 Internal Server Error. This is one example, I found dozens of them in the log file:

DEBUG 2016-02-18 23:28:56,962 views 2949 140676865280320 About to register youknowyouwa...@gmail.com
ERROR 2016-02-18 23:28:57,066 base 2949 140676865280320 Internal Server Error: /meetingfinder/create_user
Traceback (most recent call last):
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/core/handlers/base.py", line 113, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view
    return view_func(*args, **kwargs)
  File "/var/www/aabuddy/releases/current/aabuddy/environments/../aabuddy/views.py", line 227, in create_user
    user.save()
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/base.py", line 546, in save
    force_update=force_update, update_fields=update_fields)
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/base.py", line 650, in save_base
    result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw)
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/manager.py", line 215, in _insert
    return insert_query(self.model, objs, fields, **kwargs)
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/query.py", line 1675, in insert_query
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 943, in execute_sql
    cursor.execute(sql, params)
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/backends/util.py", line 41, in execute
    return self.cursor.execute(sql, params)
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 58, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File "/var/www/aabuddy/local-python/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 54, in execute
    return self.cursor.execute(query, args)
DatabaseError: value too long for type character varying(30)

victor semenov

unread,
Mar 13, 2016, 3:16:46 AM3/13/16
to mcjug-...@googlegroups.com
Its not a problem. Django sanitizes use input before issuing database commands that use it, they won't be able to sqlinject anything.

Michael Semenov

unread,
Mar 13, 2016, 4:00:56 AM3/13/16
to mcjug-...@googlegroups.com
Back to the "DB Curation Process" I took 1,464 entries with "Main" in the address field (just because that chunk was large enough and small enough) and processed geolocations using MapPoint Control from Excel.

The attachment shows what MapPoint could do to get states out of locations. Overall gave me 95% of successful hits:

 

Total Records

         1,464

 

Not Found States

8

0.55%

2 States

60

4.10%

3 States

9

0.61%

Wrong State

8

0.55%

Correct Answers

1404

95.90%

 

It had some weird ideas when asked to get something smaller than state.

 

MainStreetAddresses.xlsm

Michael Semenov

unread,
Apr 11, 2016, 12:08:47 AM4/11/16
to mcjug-...@googlegroups.com
I am done with Google Address reverse search. Google kindly provided 14,239 unique addresses for our geolocations; I uploaded them back to DB as a separate table (public.verified_address), compared them with our addresses in the aabuddy_meeting table and saved the outcome for every address in the internal_type column:

internal_type Count Meaning
multistate 9,199  Bad entries, mostly the ones with randomly assigned multiple states
submitted 9,654  Latest 3k+ entries and the ones which I could not reconcile
verCityZip 20,314  Fair, Google address had matched Zip and City, but not street address
verified 8,635  Best, Google addr Street, City and Zip Match
ver-spel  770  Good, same geo_location as verified, but the address spelled differently.


My suggestion is to remove all multistate entries and do some manual clean-up to normalize other categories. Any other ideas?


victor semenov

unread,
Apr 11, 2016, 9:18:12 AM4/11/16
to mcjug-...@googlegroups.com
Lets chat about them when we get together on Saturday.
Reply all
Reply to author
Forward
0 new messages