A Brown University Research Group

< back to Resources

The Breakfast Actions Dataset


Illustration of the Actions
About the page


A common problem in computer vision is the applicability of the algorithms developed on the meticulously controlled datasets on real world problems, such as unscripted, uncontrolled videos with natural lighting, view points and environments. With the advancements in the feature descriptors and generative methods in action recognition, a need for comprehensive datasets that reflect the variability of real world recognition scenarios has emerged.

This dataset comprises of 10 actions related to breakfast preparation, performed by 52 different individuals in 18 different kitchens. The dataset is to-date one of the largest fully annotated datasets available. One of the main motivations for the proposed recording setup “in the wild” as opposed to a single controlled lab environment is for the dataset to more closely reflect real-world conditions as it pertains to the monitoring and analysis of daily activities.

The number of cameras used varied from location to location (n = 3 − 5). The cameras were uncalibrated and the position of the cameras changes based on the location. Overall we recorded ∼77 hours of video (> 4 million frames). The cameras used were webcams, standard industry cameras (Prosilica GE680C) as well as a stereo camera (BumbleBee , Pointgrey, Inc). To balance out viewpoints, we also mirrored videos recorded from laterally-positioned cameras. To reduce the overall amount of data, all videos were down-sampled to a resolution of 320×240 pixels with a frame rate of 15 fps.

Cooking activities included the preparation of:

  1. coffee (n=200)
  2. orange juice (n=187)
  3. chocolate milk (n=224)
  4. tea (n=223)
  5. bowl of cereals (n=214)
  6. fried eggs (n=198)
  7. pancakes (n=173)
  8. fruit salad (n=185)
  9. sandwich (n=197)
  10. scrambled eggs (n=188).

Illustration of the actions

Two sample pictures for the activities ‘juice’ and ‘cereals’ with coarse and fine annotations:




The benchmark and database are described in the following article. We request that authors cite this paper in publications describing work carried out with this system and/or the video database.

H. Kuehne, A. B. Arslan and T. Serre. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. CVPR, 2014.  PDF Bibtex

H. Kuehne, J. Gall and T. Serre. An end-to-end generative framework for video segmentation and recognition. WACV, 2016.  PDF Bibtex  Project Website


Videos: BreakfastII_15fps_qvga_sync.rar (3.6 GB)

Videos as multipart files: part 1, part 2, part 3part 4part 5part 6

To extract the files, please use the command line to go to the download directory and use this command:

cat Breakfast_Final.tar* | tar -zxvf –

Pre-computed STIP features are availble here: BreakfastII_15fps_qvga_sync_stips.rar (~14GB)

Pre-computed dense trajectoires:
One large file: dense_traj_all.rar (~220GB)
Splitted in four: dense_traj_all_s1.tar.gz (~37GB) dense_traj_all_s2.tar.gz (~57GB) dense_traj_all_s3.tar.gz (~42GB) dense_traj_all_s4.tar.gz (~75GB)

Frame-based precomputed reduced FV (64 dim): breakfast_data.tar.gz (~1GB)

Frame-based precomputed BOW representation (30 dim): hist_h3d_c30.rar (~180 MB)

Coarse segmentation information: segmentation_coarse.tar.gz

Fine segmentation information:  segmentation_fine.tar.gz

The splits for testing and training are:

s1: P03 – P15
s2: P16 – P28
s3: P29 – P41
s4: P42 – P54

For further information, please contact: kuehne [@] iai . uni-bonn . de


Current version of the system is available on GitHub: https://github.com/hildekuehne/HTK_actionRecognition

The previous matlab demo for action recognition on htk is still available here: demo_bundle . To run the example, please follow the instructions in the README file



Sequence recognition (10 activities)

DT + R-fv 73.3 [2]
HOG/HOF + R-FV 62.3 [2]
HOG/HOF + BOW 40.5 [1]

Unit RECOGNITION (48 units)

HOG/HOF+ BOW 31.8 [1]

Frame-based RECOGNITION (48 UNITS)

DT + r-FV 56.3 [2]
HOG/HOF+ BOW 28.8 [1]


  1. H. Kuehne, A. B. Arslan and T. Serre. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. CVPR, 2014.
  2. H. Kuehne, J. Gall and T. Serre. An end-to-end generative framework for video segmentation and recognition. WACV, 2016.


About the page


For questions about the datasets and benchmarks, please contact Hilde Kuehne ( kuehne  [@] iai . uni-bonn . de ).


This work was supported by ONR grant (N000141110743) and NSF early career award (IIS- 1252951) to TS. Additional support was provided by the Robert J. and Nancy D. Carney Fund for Scientific Innovation and the Center for Computation and Visualization (CCV). HK was funded by the Quaero Programme and OSEO, the French State agency for innovation.


  • 04/10/2014 first version of the web page
  • 03/22/2016 second version of the web page
  1. The multi-cue boundary detection dataset
  2. HMDB: a large human motion database
  3. The Breakfast Actions Dataset
  1. Lab GitHub repository
  2. Color processing
  3. Action recognition
  4. Automated system for rodent behavioral phenotyping
  5. Object recognition
    Supplementary Information
  1. Learning sparse prototypes for crowd perception
  2. Computational mechanisms of color processing
  3. A feedforward architecture accounts for rapid categorization
  4. A neuromorphic approach to computer vision