Querying Checkbook API - Issues & Best practices

71 views
Skip to first unread message

Oleksandra (Sasha) Filippova (DCP)

unread,
Nov 20, 2023, 10:10:55 AM11/20/23
to checkb...@googlegroups.com, DataEngineering_DL
Hello,

My name is Sasha Filippova, and I am a Data Engineer at NYC DCP. My team would like to incorporate checkbook NYC data into one of our data products. When trying to scrape records from your API, I am receiving two types of errors for some of my requests: 
  • 500 Server Error: Internal Server Error
  • 200 response but the body of the response has this message: SQLSTATE[08006] [7] timeout expired
The API requests are sent sequentially once per 5-10 seconds. 

Are there limits on how many requests can be sent per a time period? And could you recommend general best practices when querying your API? 

Thank you in advance!
Sasha

Sasha Filippova

Data EngineerGeographic Data and Engineering

(She/her)

 

NYC Department of City Planning 

Kavitha Gopalakrishnan

unread,
Jan 3, 2024, 8:47:06 AM1/3/24
to Checkbook NYC
Hi Sasha,
 
Team tried validating the few API queries and was able to get the results without any error. We are listing the possible reasons why you would have run into these errors.
 
403 response error
Possible Reason 1:
The requests which resulted in a 403 response were due to a block which had been put in place to prevent automated scripts from scraping the graphical part of the site. Some of them were Python scripts so they were using the default python/pycurl user-agent.
As the user discovered, changing the user-agent to one of a web browser would prevent this from being blocked.
We have also changed the restriction on our server so it does not block requests to /api based on user-agent.
 
Possible Reason 2:
The other reason for receiving a 403 or "connection reset" error would be due to the limits we've put in place on the API endpoint. Requests are limited to 1 concurrent session per IP and at a rate of 1 per second.
If you are automating multiple requests at a time, they should be scripted to wait for each request to complete before sending another. This helps reduce the load on our server as each API request can take up to 2 minutes to return results, depending on the size and complexity.
 
500/003 response error
Possible Reason 1:
The 500/503 errors are due to timeouts in the web server or database processes, which we tested and were unable to reproduce. As we discussed, trying again during the evening when there is less traffic would make the requests more likely to succeed.
Possible Reason 2:
Another possibility is NYC Comptroller’s office hitting outbound proxy connection timeout before our site's 120-second timeout.
 
Best Practice
Avoid sending more than one concurrent request otherwise it might hit the api rate limit.
Best time to run the script will be around 5.00 to 8.00 pm EST.
 
Please let us know if you have any questions/comments.

Thanks
Kavitha Gopalakrishnan (REI Systems)
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
0 new messages