We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.
Community, Computer vision, Games, Human feedback, Image generation, Language, Open source, Reasoning, Reinforcement learning, Representation learning, Research, Responsible AI, Safety & Alignment, Speech recognition, Transformers, Video generation