Human Services Data Specification v0.8

2 views
Skip to first unread message

Sophia Parafina

unread,
Dec 1, 2014, 6:03:51 PM12/1/14
to openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski

Most of the major changes have occurred because of developer input. As a result, the specification is 85% in third normal form. This has led to more tables and has added to the complexity of the database schema. However, it does make it possible to query the data more precisely. Case in point, languages are often listed as comma separated values which are often retrieved as one long string of text. By separating the languages into a separate table, the data can be queried for a specific language only. 

The downside of increased complexity in the schema means that data input also becomes increasingly complex. To test the schema, I imported a record from a Bay Area iCarol database into the HSDS specification. Approximately, 90% of the iCarol data cleanly transferred, the remaining 10% requires post processing to import into HSDS. This is especially true of schedules because of multiple ways of formatting the data in the same table. Also languages will need to be normalized to ISO 639-1 codes.

The alternative to the more complex schema is to not to parse the data so finely and allow list of strings. Of course, queries will be less specific.

The Open Eligibility taxonomy was implemented in PostgreSQL as a materialized path model which is implemented in all the major relational databases. This provides a taxonomy for querying the data.

Example HSDS data was exported from test of import from iCarol database. Test scripts provided not for production use.

Changes:
  • Refined and made changes based on commentary from v0.4 and onward.
  • Tested HSDS by importing iCarol data
  • Added service_taxonomy table for taxonomy overlay
  • Tested taxonomy by creating taxonomy table using Open Eligibility and querying data
Additional files:

SQL examples to create tables and populate/export data into and from HSDS in Postgres. Includes Open Eligibility taxonomy.
HSDS example from Postgres export

-- 
Sophia Parafina
Ohana API

Moncef Belyamani

unread,
Dec 1, 2014, 9:54:14 PM12/1/14
to Sophia Parafina, openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski
Thanks for the update!

One thing I disagree with is your assumption that the only way to allow languages (or other similar fields) to be queried precisely is to store that data in a separate table. That is simply not true. All you need to do is store those strings in an array, and Postgres provides a native array type, as well as various efficient ways to query data in an array column.

In the Ohana API database, languages, funding_sources, accreditations, and the other “multiple” type of fields that are just a list of strings, are all stored in a column that is either of the native Posgtres array type, or in a text column that is serialized by Rails. It was a design decision not to create separate tables for those fields because it doesn’t make sense to me to have a table that only has one column.

Ohana comes out of the box with the ability to filter locations by language and other array fields, and it does so without any issues. The ability to query on a specific language does not and should not depend on languages being stored in a separate table, whether it’s Ohana or any other app that uses Postgres.

In Ohana API, languages are exposed as an array in the JSON output, not as one long string of text. If languages are entered as comma-separated strings in a spreadsheet, that doesn’t mean that they will be retrieved as such because thought was put into Ohana to make sure those strings were stored properly.

In my opinion, the spec’s priority should be to make it as easy as possible for data to be entered in a spreadsheet, since many people will be entering data manually in a spreadsheet. How the data is then processed and entered into a DB is up to the entity doing that import. I don’t think the spec should concern itself with how the data is going to be saved in a DB and queried.

The user experience of being able to enter multiple attributes, such as languages, in one field is exponentially greater than having to enter that data in a separate spreadsheet, which will end up containing just a bunch of repeated languages and foreign key ids. Here are the step-by-step differences between the two methods:

Method 1: All in one field
1. Open the locations.csv spreadsheet
2. Under the “languages” column, enter the languages as so: English, French, Spanish

Method 2: Separate spreadsheet
1. Open the languages.csv spreadsheet
2. Open the locations.csv spreadsheet to make sure you are entering the language for the right location, or create a macro that will allow you to choose the location from a dropdown
3. Create a new row
4. Enter the location_id
5. Enter “English”
6. Create a new row
7. Enter the same location_id as in step 4
5. Enter “French”
6. Create a new row
7. Enter the same location_id as in step 4
8. Enter “Spanish"

My 2 cents,
Moncef

--
Moncef Belyamani
2013 Code for America Fellow
ohanapi.org: The API for social services
@monfresh | about.me/moncef




-- 
You received this message because you are subscribed to the Google Groups "OpenReferral" group.
To unsubscribe from this group and stop receiving emails from it, send an email to OpenReferral...@googlegroups.com.
To post to this group, send email to OpenRe...@googlegroups.com.
Visit this group at http://groups.google.com/group/OpenReferral.
For more options, visit https://groups.google.com/d/optout.

Fitch, Dale K.

unread,
Dec 1, 2014, 10:36:04 PM12/1/14
to Moncef Belyamani, Sophia Parafina, openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski

Moncef and Sophia,

Would it be possible to test a use case against the two designs and see how each one performs? The divergence between these two approaches is significant.

Regards,

Dale

 


From: OpenRe...@googlegroups.com [OpenRe...@googlegroups.com] on behalf of Moncef Belyamani [mon...@codeforamerica.org]
Sent: Monday, December 01, 2014 8:54 PM
To: Sophia Parafina
Cc: openreferra...@googlegroups.com; openre...@googlegroups.com; Michal Migurski
Subject: Re: [openreferral] Human Services Data Specification v0.8

Moncef Belyamani

unread,
Dec 1, 2014, 11:51:03 PM12/1/14
to Fitch, Dale K., Sophia Parafina, openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski
Dale,

The performance difference between querying an array column on the same table versus a separate table depends on many things, but I maintain that it should have no bearing whatsoever on how the spec is defined. If you’re building an app that takes data from CSV files and populates a database, it’s up to you to store the data how you see fit. The spec should not mandate how data is stored in a database.

How data is stored and queried is going to differ based on each use case. In the case of Ohana API, storing languages and other similar fields in an array is what made sense for us, and we’ve seen no performance issues. As an example, querying 1704 locations that have the language ‘Spanish’ returns 819 results in 163ms. In production, only 30 records are returned at a time, and that same query with a limit executes in 6ms.

Every situation can be different, but you can find several examples of people reporting much faster queries when using the Postgres native array type versus a join with a separate table. Here are a couple of them:

https://www.bthlabs.pl/post/fun_with_postgresql_tagging_blog_posts.html
https://gist.github.com/joevandyk/031cf5812bd656887623

Cheers,
Moncef

--
Moncef Belyamani
2013 Code for America Fellow
ohanapi.org: The API for social services
@monfresh | about.me/moncef





On Dec 1, 2014, at 10:36 PM, Fitch, Dale K. <fit...@missouri.edu> wrote:

> Moncef and Sophia,
> Would it be possible to test a use case against the two designs and see how each one performs? The divergence between these two approaches is significant.
> Regards,
> Dale
>
> To unsubscribe from this group and stop receiving emails from it, send an email toOpenReferra...@googlegroups.com.

Fitch, Dale K.

unread,
Dec 2, 2014, 9:42:55 AM12/2/14
to Moncef Belyamani, Sophia Parafina, openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski
Moncef,
Thanks for this information. However, what I had in mind was not so much a test in terms of speed, but more so validity and reliability. Beyond the ability to store data and query strings, I was under the impression that an aspect of Open Referral was the capacity for the front end interface to gather aspects of the user's request, e.g., need, location, schedule, and combine those parameters with other aspects of service eligibility, e.g., crisis situation, homelessness, etc., such that the user is directed to the most appropriate service provider.

I was under the impression that in order for a front end interface to infer and reason about a user's request was somewhat dependent upon how the data was stored and indexed in the back end.

Is my assumption incorrect?

Thanks,
Dale
>> * Refined and made changes based on commentary from v0.4 and onward.
>> * Tested HSDS by importing iCarol data * Added service_taxonomy table
>> for taxonomy overlay * Tested taxonomy by creating taxonomy table

Moncef Belyamani

unread,
Dec 2, 2014, 9:56:55 AM12/2/14
to Fitch, Dale K., Sophia Parafina, openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski
I think you are confusing the Open Referral spec with the Ohana project. The spec is merely a set of rules that defines how data is to be entered in a spreadsheet and formatted. How the data is then used in order to build a front end interface is a completely separate issue.

How the data is stored and indexed in the back end, and how a front end interface makes use of that data is up to the people building such an app, and will be based on the needs of the intended audience for that site.

The Ohana Web Search front end website is just one example of a front end that could be built using this type of data.

Open Referral is not a back end or a database, or a website, or an API. Whether the spec says that languages need to be entered in a separate spreadsheet, or that they can be entered as comma-separated strings in one field has absolutely no effect on the ability to properly store and query the data in some back-end. Because it makes no difference, the rule that should be adopted is the one that makes it easiest to enter data, which is as comma-separated strings in one field. That is my argument.

Fitch, Dale K.

unread,
Dec 2, 2014, 2:32:21 PM12/2/14
to Moncef Belyamani, Sophia Parafina, openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski

Moncef,

Thanks for the clarification as this is not my area of expertise. I just wanted to make sure that the structure of the data had no bearing on future use.

Regards,

Dale

Charles Koppelman-Milstein

unread,
Dec 2, 2014, 3:17:20 PM12/2/14
to Fitch, Dale K., Moncef Belyamani, Sophia Parafina, openreferra...@googlegroups.com, openre...@googlegroups.com, Michal Migurski
I agree with Moncef that that assumption is not entirely correct.  In addition to Postgres's array type, document-based storage systems (e.g., ElasticSearch or CouchDB) can certainly query based on more complex requirements.  Even with a traditional SQL db, especially if the delimiter is set, folks who are ingesting the list of languages can easily create their own separate tables if they want to help searching.

I realize that I'm probably one of the people who's pushed for more normalization, but I see that this can be dealt with on the consumer end and not become a production burden.

Charles



--

Charles S. Koppelman-Milstein | Developer

Polaris − Freedom Happens Now
PO Box 65323 | Washington, DC 20035

Tel: 202-745-1001x156

ckopp...@polarisproject.org | www.polarisproject.org


This electronic mail message may contain confidential and proprietary information. If the reader of this message is not the intended recipient, any disclosure, copying or distribution of this communication is strictly prohibited. If you have received this communication in error, please immediately delete this message and any attachments.

Sophia Parafina

unread,
Dec 2, 2014, 3:46:06 PM12/2/14
to Charles Koppelman-Milstein, Fitch, Dale K., openre...@googlegroups.com, Michal Migurski, openreferra...@googlegroups.com, Moncef Belyamani
On the data production end, I think increased normalization decreases the amount of errors in the data. For example, entering a string of languages with a delimiter is more error prone than separating each language into a row. Of course the argument can always be made that you can have verification when entering the data, however, I think that exporting existing data would be the more likely case. The iCarol model is a combination of lots of repeating fields, e.g. phone1, …, phoneN, and text strings that are arrays of languages. The core data is a single table with agency, site, and programs tied together with ids. The application logic maintains data integrity instead of using relational database design. It works but it would be technically cleaner to extract the data and provide it in 3rd normal form for ingestion into other databases.

The model for HSDS has been the Google Transit Feed Specification. It's important to keep in mind that this and HSDS are meant for developers/data people and not end users. Like GTFS, HSDS is a container for exchanging data and not how it's stored or structured in an application.

-- 
Sophia Parafina
Ohana API

Reply all
Reply to author
Forward
0 new messages