Technology overview on how a humanoid robot is built

Let's use the openAI application as a mini simulation, and after we understand the principle, we will proceed to the robot.

The first part is probably normal software that detects when the user stops talking, by simply detecting a significant decrease in the volume of the sound coming from the device's microphone for a certain period of time, let's say 2 seconds.
The second step is the speech to text model of the incoming sound.

The third step is sending this text as a prompt to the transformer, in this case chatGPT 4.
The transformer returns text. The next step is text to speech.
Then we hear the answer of chatGPT 4 in the voice of a British woman. Or an American man.

An end user may think that he is chatting seamlessly with chatGPT 4 but in fact it is many parts connected together that create this illusion.

**Now let's talk about a humanoid robot**. We will make the process as abstract as possible and ignore a ton of details just to get the big picture of the robot's activity.

In addition to just clarifying - this is an attempt to understand how the robot works according to my research. I'll update this article as I learn more - and part of the purpose of this post is to get feedback from engineers and learn from it.
So the beginning is that the robot listens to the user through a microphone and a model from the sound to the text. In addition, after we have a text that came from the user, we move to a model similar to GPT 4 that was probably specially trained. From now on we will call it "**Cortex**".

Another input that the Cortex receives is probably an image of the environment, from which an image to text model extracted a detailed description.

**The Cortex receives the user's text plus a text describing the environment.**
The cortex can return an output to speech, which of course goes text to speech, or an instruction for action for the hand model.
If the cortex decides on an instruction to act, let's say because the text it received from the user is: "**Bring me the apple**" and the text it received **from the general description of the environment: "In front of you is a table with an apple on it**" - in this case the action reaches the model that moves the hands.

This model is the really big deal. He must receive a text of the action: "Hand the apple to the person in front of you", he must also receive an image as input, and this time not using image to text, because he needs more precise information to know where to move his hands. He must also have a certain description of the image that includes labeling - otherwise how will he know that this block in front of him is the apple in question.

5 times a second (I just threw 5) the model receives inputs and moves the hand a little towards the apple.

If so then the inputs of the hand model are at least:

1. Text that contains the action from the cortex.

2. A processed image, perhaps reducing weight, adding labeling of the environment, perhaps converting to a three-dimensional description of the environment, one that can enter as input to the robot's hand model

The model's hands will move quite a bit towards the apple.

The Cortex, the one who gave the command for the action, will receive the original action, plus a new image that has been image to text of the environment, and it decides whether to give the hand model a new command, or tell it to continue. He may decide that the operation has ended successfully, so he orders the hands model to calmly place his hands on the table and the text to speech model to say: "I hope you enjoy your apple. How else can I help you?"

This loop will continue until the Cortex tells no other part of the robot to do any action.

The part of listening to speech to text of course continues to work, and if new information arrives the cortex may restart the text to speech to talk to the user, or activate the hands model to do an action.

Please, if you have any comments, additional information, refinement, article reference or source, or your own knowledge on the subject - please share. We are here to learn.

Technology overview on how a humanoid robot is built

The stupidest artificial intelligence that can be created - Perceptron