Whilke I was in college, I did an experiment for my cognitive sciences professor. It used what this system uses, in-principle, except that the method of display used a machine called a tachistoscope. Nowadays, it's just a computer program. It did the same thing.
I sat in a room and watched hundreds of faces appear one-by-one, on a screen. I held a Jeopardy-style button, and when I thought I saw a face for the second time, I pushed a button. There were no distractors, I was alone in the room, and it was climate-controlled and the lighting was dim, except for the tv screen. Optimal visual environment, you could say.
The time the images appeared was controlled, but it didn't speed-up or slow-down. It was pretty-quick, as I recall, not as-quick as this though. The duration of the appearing image is just one of the variables which could affect your cognitive performance, how-many you correctly saw for a second time vs. how many you got wrong. I did it a couple times several weeks apart.
The professor was studying eyewitness testimony, which was his expert field of study.
What would be interesting is to ask the people who participate in this to describe what they saw, w/o suggesting they saw "monsters." I think you would have to be very careful so you didn't "lead the witness," as they say in objection during a trial-witness examination/cross-examination.
Since the brain wants a "complete picture," to better-understand the environment in-which the subject is experiencing, you "fill-in the gaps," and though you don't actually directly "see" the peripheral images, (if you are following directions and focus on the cross) your mind "completes" them as they pass-by. Some are perceived as more-disturbing than others. If you actually look at the images after following instructions the first time, you see there are no "monsters." It's just a variety of faces, displayed under uniform lighting conditions and w/a plain background. So, the brain works to try and "explain" or "complete" what is actually an incomplete retinal impact using center focus away from the face. The macula is the portion of the retina which receives focused central images which then are transmitted by the optic nerves to the visual cortex, the brain's visual-processing area. You've probably heard of macular degeneration, which causes problems with being able to view and accurately-process images on-which you normally would focus. This is a serious interruption of one's visual acuity as the person likely will be unable to read or to operate machinery, including autos, safely, but they retain their peripheral vision, so their eye problem allows them to still have a degree of function better than the person who becomes blind, for instance.
The rest of the retina apart from the macula processes the peripheral visual input. It is this area which is being influenced by the faces in this display we have seen, and where the data absorption is acquired and again passed to the visual cortex of the brain to be processed and interpreted.
"Monsters?" It may turn-out to be bias against people who have different physiological characteristics than the viewer. It may also be that the physical appearance of someone who previously caused negative experiences in one's personal development are influencing what the person sees, and that any similarly-appearing person produces an unpleasant cognitive experience during the display of faces.
There are so-many hypotheses which could explain what one perceives, i.e., faces like the viewer's categorized as-for race, ethnicity, shape, hair/eye color, etc. all are potential dependent variables, while those unlike the viewer's face are more-often perceived as "disturbing," "scary," "or "unpleasant."
There is a career's worth of work in this experiment, and years' worth of analysis and additional experimentation.