Personal head-up displays are a technology whose time ought by now to have come, but which notwithstanding attempts such as the Google Glass, have steadfastly refused to catch on. There’s an intriguing possibility in [Basel Saleh]’s CaptionIt project though, a head-up display that provides captions for everyday situations.
The hardware is a tiny I²C OLED screen with a reflector and a 3D-printed mount attached to a pair of glasses, and it’s claimed that it will work with almost any ARM v7 SBC, including more recent Raspberry Pi boards. It uses the vosc speech recognition toolkit to read audio from a USP audio device, with the resulting text being displayed on the screen.
The device is shown in action in the video below the break, and without trying it ourselves we can’t comment on its utility, but aside from the novelty we can see it could have a significant impact as an accessibility aid. But it’s as an electronic Babel fish coupled with translation software that we’d like to see it develop, so that inadvertent but hilarious international misunderstandings can be shared by all.
Regular readers will know that we’ve brought you plenty of HUD tomfoolery in the past.
Continue reading “Live Subtitles For Your Life”
Closed captioning on television and subtitles on DVD, Blu-ray, and streaming media are taken for granted today. But it wasn’t always so. In fact, it was quite a struggle for captioning to become commonplace. Back in the early 2000s, I unexpectedly found myself involved in a variety of closed captioning projects, both designing hardware and consulting with engineering teams at various consumer electronics manufacturers. I may have been the last engineer working with analog captioning as everyone else moved on to digital.
But before digging in, there is a lot of confusing and imprecise language floating around on this topic. Let’s establish some definitions. I often use the word captioning which encompasses both closed captions and subtitles:
- Closed Captions: Transmitted in a non-visible manner as textual data. Usually they can be enabled or disabled by the user. In the NTSC system, it’s often referred to as Line 21, since it was transmitted on video line number 21 in the Vertical Blanking Interval (VBI).
- Subtitles: Rendered in a graphical format and overlaid onto the video / film. Usually they cannot be turned off. Also called open or hard captions.
The text contained in captions generally falls into one of three categories. Pure dialogue (nothing more) is often the style of captioning you see in subtitles on a DVD or Blu-ray. Ordinary captioning includes the dialogue, but with the addition of occasional cues for music or a non-visible event (a doorbell ringing, for example). Finally, “Subtitles for the Deaf or Hard-of-hearing” (SDH) is a more verbose style that adds even more descriptive information about the program, including the speaker’s name, off-camera events, etc.
Roughly speaking, closed captions are targeting the deaf and hard of hearing audience. Subtitles are targeting an audience who can hear the program but want to view the dialogue for some reason, like understanding a foreign movie or learning a new language.
Continue reading “History Of Closed Captions: The Analog Era”
A research paper from Dalian University of Technology in China and City University of Hong Kong (direct PDF link) outlines a system that automatically generates comic books from videos. But how can an algorithm boil down video scenes to appropriately reflect the gravity of the scene in a still image? This impressive feat is accomplished by saving two still images per second, then segments the frames into scenes through analysis of region-of-interest and importance ranking.
For its next trick, speech for each scene is processed by combining subtitle information with the audio track of the video. The audio is analyzed for emotion to determine the appropriate speech bubble type and size of the subtitle text. Frames are even analyzed to establish which person is speaking for proper placement of the bubbles. It can then create layouts of the keyframes, determining panel sizes for each page based on the region-of-interest analysis.
The process is completed by stylizing the keyframes with flat color through quantization, for that classic cel shading look, and then populating the layouts with each frame and word balloon.
The team conducted a study with 40 users, pitting their results against previous techniques which require more human intervention and still besting them in every measure. Like any great superhero, the team still sees room for improvement. In the future, they would like to improve the accuracy of keyframe selection and propose using a neural network to do so.
Thanks to [Qes] for the tip!