Two students develop AI speech model to compete with NotebookLM

A pair of undergraduate students, neither with extensive AI expertise, claim to have developed an openly available AI model capable of generating podcast-style clips similar to Google’s NotebookLM.

The Market for Synthetic Speech Tools and Nari Labs’ Model

The market for synthetic speech tools is vast and growing, with ElevenLabs being one of the largest players alongside challengers like PlayAI and Sesame. Investors see immense potential in these tools, as evidenced by startups in voice AI technology raising over $398 million in VC funding last year, according to PitchBook.

Inspired by NotebookLM, Toby Kim and his fellow co-founder at Korea-based Nari Labs started learning about speech AI three months ago. They used Google’s TPU Research Cloud program to train their model, Dia, which boasts 1.6 billion parameters. Dia allows users to customize speakers’ tones, insert disfluencies, coughs, laughs, and other nonverbal cues.

Testing and Features of Dia

Dia, available on the AI development platform Hugging Face and GitHub, can run on most modern PCs with at least 10GB of VRAM. It can generate a random voice or clone a person’s voice based on user prompts. TechCrunch’s testing of Dia through Nari’s web demo showed promising results, with competitive voice quality and an easy-to-use voice cloning function.

Concerns and Future Plans of Nari Labs

While Dia offers powerful voice generation capabilities, it lacks safeguards against potential misuse. Nari warns against using the model for impersonation, deception, or illicit activities but states that they are not responsible for misuse. The origin of the data used to train Dia has also not been disclosed, raising concerns about potential copyright violations.

Despite these issues, Kim envisions a synthetic voice platform with a “social aspect” built on top of Dia and future models. Nari plans to release a technical report for Dia and expand language support beyond English in the future.