In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified ...