Watch out for Copilot

415 views
Skip to first unread message

Martin Blais

unread,
Nov 15, 2024, 9:35:46 AM11/15/24
to Beancount
Dear Beancount users,

This is a PSA -- TL;DR: don't enable Github Copilot completion on your ledgers.

If you installed Github Copilot in your personal code editor/computer, be aware that it uploads "snippets" of your input files to it and possibly to third-party APIs (e.g., OpenAI). I think people are just beginning to become aware of the implications of this due to their employers crafting policies around what LLMs they can use and what-not, but it's still early days and it's easy to accidentally screw up, so here are some thoughts about this.

I think it's really easy to install Github Copilot to get code completions in say, Emacs, and then to open up your ledger and it's in Copilot minor-mode everywhere (for example if you enabled it via `(add-hook 'prog-mode-hook 'copilot-mode)` or similar, to be turned on everywhere ("it's amazing, right?")), which means you get completions on its contents. AFAICT it's impossible to know how much context is sent up to the models for queries. GH claims general "context" is sent:

image.png

In other places I've seen it's mentioned that "a few lines of context before and after the code you're editing". AFAIK there's no way to know how large this context is, and I've seen mentions of the selection somewhere. For example, if you select your entire ledger file, does it upload the whole thing as context for your completion prompt?

Github's retention policy mentions prompts aren't retained, but what about context?  
I see "Prompts and Suggestions" in the FAQ:
image.png

And some of your transaction data may end up getting used to train new models?
image.png

Please correct me if I'm wrong:
- I don't believe there is a local log (on your computer) of what was actually sent. 
(If you just accidentally once opened up your ledger with the entire history of your financial life, it's not impossible that the whole thing was uploaded to Copilot.)
- I don't believe Github lets you view the content you've uploaded and sent from their site either.
- I don't believe Github lets you delete the content as a matter or normal usage (like Google Dashboard does, e.g., https://myaccount.google.com/dashboard)

There's some mention in the FAQ:
image.png

This takes you to this page:
image.png
Okay, so maybe. This looks good in theory, but what if your data has also been sent to a third-party service?
AFAIK Copilot uses OpenAI's Codex model.  Do they have a setup to host and run it themselves?
Or is all the data sent to a service run by OpenAI?

I think it's appropriate to be really cautious about this.

Chary Chary

unread,
Nov 15, 2024, 12:21:28 PM11/15/24
to Beancount
Martin,

thanks for bringing up this issue. Just thinking aloud:

1) It is possible to disable certain file types for copilot and it appears, that copilot claims it will not be accessing these file types then 

Do you think it is sufficient to disable copilot just for .bean files?

2) Suppose someone keeps financial data in Google Sheets (which I generally don't, but suppose). Is there any reason to get more concerted of copilot accessing my financial data than google?

Martin Blais

unread,
Nov 15, 2024, 12:42:57 PM11/15/24
to bean...@googlegroups.com
On Fri, Nov 15, 2024 at 12:21 PM Chary Chary <char...@gmail.com> wrote:
Martin,

thanks for bringing up this issue. Just thinking aloud:

1) It is possible to disable certain file types for copilot and it appears, that copilot claims it will not be accessing these file types then 

Do you think it is sufficient to disable copilot just for .bean files?

I don't know. 
The answer would be found in the source code for the copilot editor support for your editor.

 

2) Suppose someone keeps financial data in Google Sheets (which I generally don't, but suppose). Is there any reason to get more concerted of copilot accessing my financial data than google?

They're different entities.
Github is Github, Google is Google. Github is also Microsoft. Github also ferries your data to models, which surely means some well-provisioned model APIs, which I suspect are located at OpenAI.
Google is one entity with internal protocols for safety and privacy mechanisms, including access restriction mechanisms preventing even employees from accessing data (except very narrowly defined based only on business need).

FWIW I used to work there and as a result of what I've seen during my time I trust my data there as much as files on my own personal computers.
But a lot of people like to hate on large companies these days... that's up to you... I'm the wrong person to ask, I'm a fanboy.

 
--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/beancount/fe428c54-9a6b-4b28-be75-84b1a5daa805n%40googlegroups.com.

Red S

unread,
Nov 16, 2024, 11:19:56 PM11/16/24
to Beancount
If you installed Github Copilot in your personal code editor/computer, be aware that it uploads "snippets" of your input files to it and possibly to third-party APIs (e.g., OpenAI). I think people are just beginning to become aware of the implications of this due to their employers crafting policies around what LLMs they can use and what-not, but it's still early days and it's easy to accidentally screw up, so here are some thoughts about this.

I think it's really easy to install Github Copilot to get code completions in say, Emacs, and then to open up your ledger and it's in Copilot minor-mode everywhere (for example if you enabled it via `(add-hook 'prog-mode-hook 'copilot-mode)` or similar, to be turned on everywhere ("it's amazing, right?")), which means you get completions on its contents. AFAICT it's impossible to know how much context is sent up to the models for queries. GH claims general "context" is sent:

Glad you brought this up. The first thing I did before installing Copilot long ago was to solve for this. I use both Copilot and Codeium with Neovim personally. In short, here are some options I found. These work well for folks who use terminal based editors (vim/emacs, mostly):

  1. configure Copilot/Codeium/AI in your editor to be disabled for certain file types
  2. configure your editor to disable the Copilot/Codeium/AI plugin for certain file types
  3. entirely disable network access from your editor

(1) involves trusting the plugin under question, which isn’t a great idea.

(2) is better, but I found how easy it was to mess this up and get it wrong. Editor configurations for power users span many files and directories, and it’s easy to overlook something when updating your config

(3) is best (most secure), and I use it for things I need most security for (files with account numbers, passwords, cloud API keys, and other sensitive data). My setup is to run a separate instance of neovim via flatpak. Under the hood, it’s essentially containerized execution of neovim, which means all one has to do is to disable the network interface on that container like so:

my_editor_secure () { # my editor uses a gpg plugin for which it needs to access the gpg-agent flatpak run --user --unshare=network --socket=gpg-agent io.neovim.nvim $* + }

Which guarantees nothing will leave your computer. You could simply make this your default editor command, and occasionally run it with network access enabled if you need to update plugins and such.

Marvin Ritter

unread,
Nov 23, 2024, 2:47:05 PM11/23/24
to bean...@googlegroups.com
If you have Copliot enabled I would recommend enabling it for specific file types/languages and disable it by default. I think it's easy to forget a file type with sensitive content. And you can always enable it for a language if you forgot it.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.

Martin Blais

unread,
Nov 23, 2024, 3:01:27 PM11/23/24
to bean...@googlegroups.com
Hey Marvin,
Do you know if there's a Google service for code completion similar to Copilot?
Do you know if people are realistically running CodeGemma locally?


Martin Blais

unread,
Nov 23, 2024, 3:02:29 PM11/23/24
to Martin Blais, bean...@googlegroups.com
On Sat, Nov 23, 2024 at 3:01 PM Martin Blais <bl...@furius.ca> wrote:
Hey Marvin,
Do you know if there's a Google service for code completion similar to Copilot?
Do you know if people are realistically running CodeGemma locally?

Hmm, I see it's supported by Ollama:
I wonder if it's easy to setup in Emacs


Martin Blais

unread,
Nov 23, 2024, 3:04:47 PM11/23/24
to Martin Blais, bean...@googlegroups.com
On Sat, Nov 23, 2024 at 3:02 PM Martin Blais <bl...@furius.ca> wrote:
On Sat, Nov 23, 2024 at 3:01 PM Martin Blais <bl...@furius.ca> wrote:
Hey Marvin,
Do you know if there's a Google service for code completion similar to Copilot?
Do you know if people are realistically running CodeGemma locally?

Hmm, I see it's supported by Ollama:
I wonder if it's easy to setup in Emacs

Oh my... Ellama. 
(Sorry, I'm still catching up with the universe.)

Gary Roach

unread,
Nov 28, 2024, 4:00:06 PM11/28/24
to Beancount
I was actually thinking about making an importer that sends transaction statements to chatgpt and extracts the information in beancount format. It's amazing at parsing pdfs and csv files, and unlike institution specific importers you'd never have to worry about the institution making format changes which break your importers.

Convenience typically sacrifices some amount of security, but why are we concerned about our banking transactions being made accessible to other companies? Aren't they already public in the sense that every transaction involves multiple parties other than yourself (banking institution/broker, employer, merchant, credit card processor, etc...).

Is there something I'm missing that could be exploited if an organization or even an individual accessed my entire ledger? 

Martin Blais

unread,
Nov 28, 2024, 4:09:17 PM11/28/24
to bean...@googlegroups.com
On Thu, Nov 28, 2024 at 4:00 PM Gary Roach <groa...@gmail.com> wrote:
I was actually thinking about making an importer that sends transaction statements to chatgpt and extracts the information in beancount format. It's amazing at parsing pdfs and csv files, and unlike institution specific importers you'd never have to worry about the institution making format changes which break your importers.

If you have a semi-decent GPU (I have an old RTX 3060) you can run free models on your own computer and do some extraction. 
I kicked the tires on "Llama 3.2 Vision" a few days ago this way and could run some OCR tasks.
No need to send things up to an API if the free models are good enough for your particular task.
Convenience typically sacrifices some amount of security, but why are we concerned about our banking transactions being made accessible to other companies? Aren't they already public in the sense that every transaction involves multiple parties other than yourself (banking institution/broker, employer, merchant, credit card processor, etc...).

Is there something I'm missing that could be exploited if an organization or even an individual accessed my entire ledger? 

Every person has a different threshold, but personally I'm not comfortable with my personal data going out to an API.
Got nothing to hide, but I don't walk around naked either.



 

Gary Roach

unread,
Nov 28, 2024, 4:35:09 PM11/28/24
to Beancount
I see, it's more of a comfortability level thing. I can understand why that would make someone a little apprehensive.

I haven't played around with local models before, but I should check them out – Thanks for the suggestion. I've got everything virtualized on a HP ProLiant DL360p Gen8, so no good GPU to speak of at the moment, but I may pick something up soon. 

Red S

unread,
Nov 28, 2024, 10:33:08 PM11/28/24
to Beancount
It's more than just a comfort thing IMHO: I'd be concerned about putting my account numbers, transactions, and current positions and balances out there, and open myself up to phishing attacks. I'd personally highly caution anyone against feeding their statements to GPT online for the same reason. My two cents.

Gary Roach

unread,
Nov 29, 2024, 1:26:29 AM11/29/24
to Beancount
That's a good callout. Account numbers, transactions, and current balances could be used as a way to prove identity when calling the institution as well. I didn't consider this initially. Thanks for the callout. 

Simon Michael

unread,
Dec 12, 2024, 4:45:30 PM12/12/24
to bean...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages