Hi folks, my subject stems from having recently done a deep-dive into the pi_vision implementation. The original face detection and tracking was rusted, so I revamped it. In doing so I added in a hook for eventually augmenting the "new_face" message with some face recognition. I was informed that rather than splicing in some face detection algorithm at the pi_vision level, the "vision" would be to have the image elements reach the atomspace, and thus allow recognition to occur at a more basic level.Therefore, pursuant to the above, I'm asking for a high level description of how AGI vision could be accomplished. Perhaps we can also address the question of why face detection and tracking are "ok" but face recognition is not? Maybe all processing should be done at a lower level?
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CA%2Ba9A7AYNxawVTjbn5sQXp7AjToj1xteyCnCibrBO7TZwDDsSQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37f2-KYNQG8oj4QQdoLNXLib4%2BeJB1hKrf5H9r5qPy%3Dgw%40mail.gmail.com.
Thanks Linas, I do have the vision part working now. (sending ROS messages for FACES, LOST_FACE and NEW_FACE).However I don't have the Eva/Sophia head working yet, I'm working towards that, any help is welcome.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CA%2Ba9A7BNh8mfLjOyxDOkmtzCr4nOZOmdOpdmnRwfg3G6ux5-zw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37LfMp5mSa4Uuf20b-ynUfeEFF3aYG4eesOa%2BH5LM%2Byuw%40mail.gmail.com.
Not sure if this is cogent since my application is autonomous robots in actual hardware, but maybe useful…
I used OpenCV with a carrier board ("StereoPi") for the Raspberry Pi Compute Module that breaks out both camera ports on the Pi. I automated face recognition with code that leveraged OpenCV that I came to find from one Adrian Rosebock (pyimagesearch.com) that employed Haar Cascades to determine there was a face present. Once a face is detected, it sends the center-of-face data to another Pi (the robots have three Pis in them – "cores" – a vision acquisition "core", a language "core" and vision processing core). The vision processing core (depending on the state the robot is in) takes this face positioning data, chews on it and sends the corresponding servo signals to the motor core that controls the head and eyes, and the robot follows you with its gaze and head movements. So in theory, face *detection* and tracking are always functionally available, but may be overridden/ignored by other behavioral commands/statuses.
The language processing side of things is always listening (I use python speech recognition with PocketSphinx as the recognizer which works surprisingly well) and now has several hundred routines it can engage depending on what it hears, and some conflict resolution and buffering code in case responses to one phrase would interfere with ongoing responses playing out).
The system is set up so that if I use a phrase like "my name is", or "I'd like to introduce you to" (and several similar phrases that are recognized by a fuzzy-logic kind of similarity finder I wrote), *AND* it can tell a face is present, It can filter out the name given, if any. Then a few things happen – first, the language processor confirms the name by speaking "Hello <name> - did I get that right?" and listens for a variety of words that are either affirming or denying.
On affirmation, the system immediately begins taking snapshots every 10 frames and stores them in a folder (the new faces dataset) of the person's name plus the date and time as a numeric string (Dave-202202251623 for example). Once either the person exits the view for more than 100 frames (would-be 10 snapshots) or the system gains 100 actual face snapshots, it hands off those images to another of the scripts from Adrian Rosebock (encode_faces.py) that encodes the faces and turns the whole bunch into a pickle, which is then appended to the bigger pickle that all the other known faces are in… The name and data are also written to the database of "people known", where additional data is written over time as interactions with that person accrue.
So I'm not sure if this answers your question about integrating it into the speech subsystem – I basically have the audio input and processing, audio output and visual input and processing all running in parallel on separate physical SBCs, which all talk to each other via ZeroMQ (or PyZMQ specifically).
It works very well, reasonably fast (especially given it only runs on Pi 4/8gig SBCs) and provides people interacting with the unmistakable feeling that the robot sees them, responds to their movements and speech, etc., and remembers them.
The drawback that I haven't done anything with in the past year or so, but has a relatively easy fix – is that the pickle data for a given person ages (my grandkids are no longer reliably recognized since they were 3 and 5 when I first implemented that build, and they are 6 and 8 now) – so I need to add a routine that occasionally updates the images silently in the background in the recognition pickle to keep up with changes… but I've not had the time I wanted to to do these things…
If any of this gives you anything useful to pick from, I can get you code, original source and my custom stuff. It's all Python, so I'm guessing you should be good with that.
Dave
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37f2-KYNQG8oj4QQdoLNXLib4%2BeJB1hKrf5H9r5qPy%3Dgw%40mail.gmail.com.
Not sure if this is cogent since my application is autonomous robots in actual hardware, but maybe useful…
I used OpenCV with a carrier board ("StereoPi") for the Raspberry Pi Compute Module that breaks out both camera ports on the Pi. I automated face recognition with code that leveraged OpenCV that I came to find from one Adrian Rosebock (pyimagesearch.com) that employed Haar Cascades to determine there was a face present.
Once a face is detected, it sends the center-of-face data to another Pi (the robots have three Pis in them – "cores" – a vision acquisition "core", a language "core" and vision processing core). The vision processing core (depending on the state the robot is in) takes this face positioning data, chews on it and sends the corresponding servo signals to the motor core that controls the head and eyes, and the robot follows you with its gaze and head movements. So in theory, face *detection* and tracking are always functionally available, but may be overridden/ignored by other behavioral commands/statuses.
The language processing side of things is always listening (I use python speech recognition with PocketSphinx as the recognizer which works surprisingly well)
and now has several hundred routines it can engage depending on what it hears, and some conflict resolution and buffering code in case responses to one phrase would interfere with ongoing responses playing out).
The system is set up so that if I use a phrase like "my name is", or "I'd like to introduce you to"
(and several similar phrases that are recognized by a fuzzy-logic kind of similarity finder I wrote), *AND* it can tell a face is present, It can filter out the name given, if any. Then a few things happen – first, the language processor confirms the name by speaking "Hello <name> - did I get that right?" and listens for a variety of words that are either affirming or denying.
On affirmation, the system immediately begins taking snapshots every 10 frames and stores them in a folder (the new faces dataset) of the person's name plus the date and time as a numeric string (Dave-202202251623 for example). Once either the person exits the view for more than 100 frames (would-be 10 snapshots) or the system gains 100 actual face snapshots, it hands off those images to another of the scripts from Adrian Rosebock (encode_faces.py) that encodes the faces and turns the whole bunch into a pickle, which is then appended to the bigger pickle that all the other known faces are in… The name and data are also written to the database of "people known", where additional data is written over time as interactions with that person accrue.
So I'm not sure if this answers your question about integrating it into the speech subsystem – I basically have the audio input and processing, audio output and visual input and processing all running in parallel on separate physical SBCs, which all talk to each other via ZeroMQ (or PyZMQ specifically).
It works very well, reasonably fast (especially given it only runs on Pi 4/8gig SBCs) and provides people interacting with the unmistakable feeling that the robot sees them, responds to their movements and speech, etc., and remembers them.
The drawback that I haven't done anything with in the past year or so, but has a relatively easy fix – is that the pickle data for a given person ages (my grandkids are no longer reliably recognized since they were 3 and 5 when I first implemented that build, and they are 6 and 8 now) – so I need to add a routine that occasionally updates the images silently in the background in the recognition pickle to keep up with changes… but I've not had the time I wanted to to do these things…
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CH0PR20MB3980707E0E973F770A5D3196B53E9%40CH0PR20MB3980.namprd20.prod.outlook.com.
I feel your pain lol. Fortunately, I'm the only engineer:
Moore's law.
So, ahh, one person who should have known better ordered the best, highest-resolution webcams they could find. 1280x1024 or something. You could only plug two of them into a USB hub before the USB hub was overwhelmed. And the CPU attached to that could barely keep up with the frame rate. Despite this obvious hardware-fail, there was tremendous resistance to down-scaling to a far more practical 640x480. Add to that a power, heat and cooling budget. Ugh.
Managing engineers is like herding cats. Or pushing rope. Something like that.
I de-resed my camera down to something even less than 640x480 and had to work with the creator of PyZMQ's "imageZMQ", Jeff Bass, to come up with a frame buffering/overflow and use-last-in-if-delayed scheme that was fully custom. Works like a champ, no hangs, no delays, and it broadcasts via ImageZMQ to all my Pis so they can all see what the eyeballs see and operate on that. I'll never understand why ROS didn't adopt ZMQ & ImageZMQ – it's incredibly versatile, fast and efficient.
I have a frame rate of around 5 frames/second, which is more than enough for what I want it to do, since we're not driving cars on highways or anything.
Regarding folks turning up their noses at local-only recognition, that was a decision I made because I wanted my robots to be independent of an internet connection. They have to be fully functional off-line. But – as you state – background noise can be an issue. My "folks" don't have to perform in a tech-convention environment. They just have to work roaming around on my property, where there's virtually zero background noise, unless it's really windy lol
As for the face-recognition with Haar cascades, I admit to cheating a little. Usually, when I introduce someone new, and get the initial photo dataset, I make a point of making sure I surreptitiously trigger the face-learning scripts in different environments, not just my office. I have the robot's "known people" interact with them up on my porch during daylight, evenings with artificial lighting, etc., and it adds to their database, and the recognitions become more robust… when I do that. But again, I haven't really done much in about a year, partially because I am seeing tech SOAR past what I can do on my own. It's a little disheartening. I see robots capable of *INCREDIBLE* facial expression (Ameca, https://www.cnn.com/videos/business/2021/12/08/humanoid-robot-ameca-lon-orig-tp.cnn), linguistic awesomeness (I have GPT-2 running on a Pi, but I'll NEVER get GPT-3 running on a Pi lol), and unimaginable dexterity (Boston Dynamics Atlas). When I started in this, I was competitively doing OK. The world rocketed past me lol. Now my robots are, well, toys that I play with sometimes. But – that said – the tech in some parts of them is still pretty damn good… I've never seen ANYTHING better at identifying questions vs. statements than the code I have running in my guys.
Anyway – wish I could help. I understand the issues, I have overcome several as best as I can – but alas, funding and real-world demands make it so I can't follow up on what I want to do, and could easily if I had the same budgets as some of these other outfits lol But good chance there's a dozen other folks out there who could make the same claim. Good thing I have gainful employment elsewhere 😊
If you ever think I can help – let me know…
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35x4vEj_GJvFdOYNc7-4vfnjipW%2BdFv1LQDc%3DcSpTOTOA%40mail.gmail.com.
But again, I haven't really done much in about a year, partially because I am seeing tech SOAR past what I can do on my own. It's a little disheartening.
If you ever think I can help – let me know…
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CH0PR20MB398072E84567CEEB6BC385F9B53F9%40CH0PR20MB3980.namprd20.prod.outlook.com.
I also have some errors coming from the Sophia.blend head on start up. In particular it complains of missing NOSE bone, and a number of cyclic dependencies between the bones. I haven't got any insight into using blender yet, so I'm totally on the outside ot that one.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CA%2Ba9A7AOzX1j6t3UVNWYoVM8m_iR1PF8Fw63YNyf5WcMcEmc8w%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37Zx4vPsKM4KDsHYpagw%2B3SyVJjDDj3rWhRL%3DnQoLiRdQ%40mail.gmail.com.
Hi Mark,
I am in full agreement with you on genuine AGI. An architecture needs to be created that has no "if-thens" as I call them – it has to start from scratch and build its world model in a way that allows it to record "successful" interactions with the world (such as receiving food, aka power recharges), and it needs to start from the "servos just randomly twitching" phase.
That said, my expertise has always been in tying many different functions together in python. I've managed to tie together the visual, auditory and movement functions on my robots successfully, but definitely NOT in the "neonatal" AGI framework I would have liked. I was after immediate results and short-term successes, which I achieved, but I regret not having put forth the effort to produce an "infant AGI".
That said, I have created the frameworks for such an AGI to express within. The visual acquisition system is robust, with a few hardwired functions pertaining to recognizing faces and objects; the auditory processing functions are robust and in some ways do still, despite my lagging in SO many other areas, excel beyond much else out there (with the exception perhaps of GPT-3, but my model is about 1% of the size of GPT-3 and works well at very specific parts of identifying what is being presented to it.) The auditory output is quite robust, although devoid of nonverbal utterances, which I have created in code to allow a sort of analog movement-based expression of unverbalized "statements". And I have a rudimentary facial/head/body movement function running, but nothing anywhere near close to the amazing stuff I'm seeing with Ameca (https://www.cnn.com/videos/business/2021/12/08/humanoid-robot-ameca-lon-orig-tp.cnn)
I consider having this framework in place a perfect "cradle" in which to embed the type of AGI we are both interested in, but I am still myself in "AGI Infancy". My goal was to create robots that I could interact with in a useful and significant manner (for laughs, one of my goals is to have robots that can stack my annual firewood deliveries lol)
I have other systems functioning in test-bed environments that are non-human in structure, but with very sophisticated visual systems that are – apologies – intended to drive intelligent wagons with grippers that will autonomously pick up pinecones on my property every spring. We literally get hundreds of thousands from the pine groves that line our property lol.
So no, Mark, I don't believe you're being impractical. What I believe is being impractical is the trend towards creating AGI as an "instant adult". That is what I attempted to do and as a result I have systems that are very impressive in a very narrow range of functions. So long as they are here, on my property, in the environments I have trained them on, on the tasks I have trained them on, they're pretty cool. But they'd fail horribly outside of this environment. Your vision is what will create robust AGI.
I just hope we survive as a species long enough to see these goal realized….
Let me know if any of this is useful to you,
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CA%2Ba9A7By5pp5XXYuyHnDytTrfEqS63tr%3Dgcn2CA8RFSzkOhVTg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CH0PR20MB3980CD1EA204537380EA070CB53F9%40CH0PR20MB3980.namprd20.prod.outlook.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CA%2Ba9A7C1M8eh4tQFauLjXSe3b_U3Nt9bQP0o6cqFsM6Ly6SMRw%40mail.gmail.com.
Email Marketing will grow your business faster than any other marketing medium!
I use and recommend GetResponse: ClickHere
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAB%2BUbUgwMFa1ObNHTAE%3DbJsQDBWnaexJs1NtAPu0m4gicz%2BzdA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35VQTZS6U_kveYXdMNS5Qfn_TxYwjVUSqH85pmOf9aF5A%40mail.gmail.com.
So with regards to the foundational thinking of how the AtomSpace can truly unite the sensory data with the motors, the background knowledge, what theory is governing that? I see that the AtomSpace allows linkages to be made, and it allows a common data representation/language. But what is the "glue" that causes these commonly held but algorithmically distinct islands of atomspace to coalesce?