Dynamic Vocabulary Support for Videos

I was watching a YouTube video this afternoon (on Advanced Password Recovery, if you’re interested). But basically, there were a number of technical terms in the video that I was not familiar with.

I was only half watching this particular video, so I wasn’t inclined to pause and take notes to reinforce and research later, as I often do if I’m more actively watching something.

But I thought it’d be great if some of those technical terms showed up on-screen, because I’m a pretty visual learner, and seeing new words helps me retain them a lot better than just hearing them does. (Of course, this is also just generally good learning design- it helps learners to expose them to target content in different ways]

However, that would take a lot of commitment on the part of show creators, post-production, to decide which words they wanted to highlight, and add the text in at the appropriate moment in the video.


So, what I propose is an app/website that scans the caption track for low frequency words in the given language. Then it superimposes the text of those words over the video at the time the word is spoken, and leaves them on the screen for, say, 3-4 seconds.


The technology for this certainly exists.

YouTube has automatic captioning now (and obviously, natural language processing technology in general is improving daily), and the API allows retrieval of the captions.

Then you’d need a database of word frequencies. Something like this from the Corpus of Contemporary American English.

Ideally, this would be a dynamic database, so that it could adapt to shifts in language (e.g., new slang). And I’m sure Google (if anyone’s out there listening) has their own interesting, robust, dynamic language data that, if they took this on themselves, would be incredibly helpful in developing this tool.



Originally (…an hour ago), I’d imagined this as a tool for video creators, but as I was thinking about it more, it’d be a great tool for viewers instead/in addition. Then the viewer doesn’t need to be dependent upon a creator making their video more learning-oriented in order to reap the benefits of it. The viewer could choose to watch a video via this tool, to help them learn the content better.


Additional features (incomplete):

  • Vocabulary text size, color and font are adjustable
  • Length of time word stays on the screen is adjustable (default: ~3 seconds)
  • Words by default are displayed in upper left corner of video screen, and stack below each other if multiple uncommon words are spoken in a row



  • People like me, who are watching a video to learn something from it and/or are obsessive note takers
  • People learning a new language (e.g., watching videos in the new language)
  • People with language difficulties in their native language
  • People with language processing disorders
  • ?


Addressing potential issues:

  • Obscuring video content: vocabulary text may obscure visual content in video
    • could make this “smarter,” and use facial recognition to avoid placing text over faces in videos, and OCR ish (“optical character recognition”- it visually identifies letters/text) technology to avoid obscuring existing text in the video, but you’d want to have it still default to the same position, for consistency for the learner
      • even avoiding text and faces, of course, the vocabulary text could still obscure important visual content
  • Accuracy of captions: this design counts on video creators using captions, which many do not, and even *if* they’ve added automatic captioning, it counts on someone double checking that the captions are correct
    • This doesn’t fix the issue, but in general, anyone posting a video with spoken words should include captions, at least for accessibility’s sake. And YouTube (where many videos are posted) makes it incredibly easy to at least have rough captions for your video. We need to work, in general, as a whole, on designing content and tools to be accessible for everyone in our society.
  • Word frequency: it’d likely be an ongoing process to determine how rare a word should be in order to be highlighted
    • maybe the viewers could adjust the level of obscurity before viewing a video
    • a way for viewers to flag words that they didn’t need highlighted (they already knew), and build up a personal database that could be used to inform the highlighting of future videos they view via this app
  • Single words vs. phrases: this design is built upon frequency of single words, but some technical jargon uses relatively common words in uncommon combinations, or uses common words with a less common meaning that’s specific to the field
    • this might require a phrase frequency database, as well, which may or may not be as available (I’m not going to research it right now)
    • tags on videos might help narrow the scope of word frequency, by helping to identify the general topic/field of the video content
  • Vulgarities: highlighting vulgar words
    • these could be blocked from being highlighted when the tool analyzes the captions

17,792 thoughts on “Dynamic Vocabulary Support for Videos”