Create Viral YouTube & Instagram Shorts with AI: Step-by-Step Code Tutorial

Hands-on programming tutorial using OpenAI’s ChatGPT API and SDXL API from Segmind.com and text-to-speech API from ElevenLabs

13 min readOct 15, 2023

Programmatically create YouTube shorts with AI

Introduction

In today’s fast-paced digital landscape, capturing your audience’s attention in just a few seconds can make all the difference. Short-form videos have emerged as a powerhouse in the world of online content, with platforms like YouTube Shorts and Instagram Reels providing fertile ground for creators and marketers to thrive.

Imagine having the ability to effortlessly generate captivating short-form videos that are visually striking and narratively engaging. Thanks to the incredible advancements in Generative Artificial Intelligence (AI), this is now not only possible but surprisingly simple.

In this step-by-step code tutorial, we’ll embark on a journey to explore how AI can be harnessed to create viral YouTube and Instagram Shorts that have the potential to resonate with audiences and elevate your content to new heights.

At the end of this tutorial, you’ll have a simple user interface (UI) built with Gradio that you can run in Google Colab. With this UI, you can select a genre, click “generate,” and in just a minute or two, you’ll have a short-form video with visuals, audio, and script — all generated by AI. You can even add subtitles to make your video even more appealing. Let’s dive into the step-by-step process.

Video Tutorial link:

https://www.youtube.com/watch?v=3XlbswJm7Yg

Step 1: The Theory

We will generate our script and corresponding description for visuals using OpenAI’s ChatGPT given a genre as input.

Text and Corresponding Description using ChatGPT

Then we will take the image descriptions and generate images with SDXL stable diffusion 1.0 using Segmind’s API.

Image Description to AI image using Segmind

Next, we will take the script and generate human-like speech with emotions using ElevenLabs text-to-speech API.

Script to spoken words using Elevenlabs API

We will piece together the audio and visuals using MoviePy. We will also add a blurred background as SDXL-generated images in MoviePy are in the resolution of 1080 by 1080 and we want our short-form videos in the size of 1920 by 1080 for YouTube Shorts or Instagram Reels.

Then we’ll see how we can add captions to the video to show the words as they are spoken using OpenAI’s Whisper speech-to-text model.

Stitch Audio and visuals and add captions

Step 2: Signup for services and get the API Keys

To get started, you’ll need to sign up for three important services: OpenAI, Segmind, and Elevenlabs.

Get OpenAI’s API Key

Go to platform.openai.com
Sign up or log in.
Click on the top right to go to “View API Keys”.
Create a new secret key if you don’t have one already and save it securely.
If necessary add your card to not run out of credits.

Get Segmind’s API Key

Go to Segmind.com and Log in/Sign up
Click on the top right to go to the console.
Once in the console, click on the “API keys” tab and “Create New API Key”.
You get a few free credits daily but for interrupted usage, you can go to “billing” and “add credits” by paying with your card.

If you want to know how much cost each API call incurs for Segmind, you can go to the corresponding model’s pricing tab and see it. An example of SDXL pricing is shown here.

Get ElevenLab’s API Key

Go to Elevenlabs.io and Log in/Sign up
Click on the top right and navigate to “Profile”. You can find your API key there.
If necessary upgrade to a paid plan to get enough API credits to run this tutorial.

Step 3: The Code

The Google Colab notebook containing the full code can be found here.

Initialize OpenAI’s ChatGPT and enter your OpenAI API key when prompted!

from getpass import getpass
openaikey = getpass('Enter the openai key: ')

chatgpt_url = "https://api.openai.com/v1/chat/completions"
chatgpt_headers = {
    "content-type": "application/json",
    "Authorization":"Bearer {}".format(openaikey)}

Let’s define an important function fetch_imagedescription_and_script. This will take in a topic eg: “Success and Achievement” and a goal eg: “inspire people to overcome challenges, achieve success and celebrate their victories” as input and generate a script for a short-form video along with the descriptions of accompanying visuals.

import requests
import json
from pprint import pprint

def fetch_imagedescription_and_script(prompt,url,headers):

    # Define the payload for the chat model
    messages = [
        {"role": "system", "content": "You are an expert short form video script writer for Instagram Reels and Youtube shorts."},
        {"role": "user", "content": prompt}
    ]

    chatgpt_payload = {
        "model": "gpt-3.5-turbo-16k",
        "messages": messages,
        "temperature": 1.3,
        "max_tokens": 2000,
        "top_p": 1,
        "stop": ["###"]
    }

    # Make the request to OpenAI's API
    response = requests.post(url, json=chatgpt_payload, headers=headers)
    response_json = response.json()

    # Extract data from the API's response
    output = json.loads(response_json['choices'][0]['message']['content'].strip())
    pprint (output)
    image_prompts = [k['image_description'] for k in output]
    texts = [k['text'] for k in output]

    return image_prompts, texts

I have defined about 10–12 topic and goal pairs below that you can use to generate short-form video scripts. So you can create short-form videos in any of these genres like “Success and Achievement”, “Gratitude and Positivity”, “Mindfulness and Presence”, “Time Management and Productivity” etc.

# Daily motivation, personal growth and positivity

topic = "Success and Achievement"
goal = "inspire people to overcome challenges, achieve success and celebrate their victories"

# topic = "Morning afformations"
# goal = "Encourage viewers to start their day with a positive mindset."

# topic = "Self-Care and Wellness"
# goal = "offer tips and reminders for self-care practices, stress reduction, and maintaining overall well-being"

# topic = "Gratitude and Positivity"
# goal = "emphasize gratitude and positive thinking"

# topic = "Boost Confidence"
# goal = "help build self-confidence and self-esteem"

# topic = "Happiness and Joy"
# goal = "help people find happiness in simple moments and enjoy life's journey"

# topic = "Resilience and Adversity"
# goal = "help build resilience in the face of adversity"

# topic = "Relationships and Connections"
# goal = "help build meaningful relationships, foster connections, and spread love"

# topic = "Mindfulness and Presence"
# goal = "encourage mindfulness and being present in the moment"

# topic = "Empowerment"
# goal = "empower viewers to take control of their lives, make positive choices, and pursue their dreams"

# topic = "Time Management and Productivity"
# goal = "provide tips about managing time effectively, staying organized, and being productive"

prompt_prefix = """You are tasked with creating a script for a {} video that is about 30 seconds.
Your goal is to {}.
Please follow these instructions to create an engaging and impactful video:
1. Begin by setting the scene and capturing the viewer's attention with a captivating visual.
2. Each scene cut should occur every 5-10 seconds, ensuring a smooth flow and transition throughout the video.
3. For each scene cut, provide a detailed description of the stock image being shown.
4. Along with each image description, include a corresponding text that complements and enhances the visual. The text should be concise and powerful.
5. Ensure that the sequence of images and text builds excitement and encourages viewers to take action.
6. Strictly output your response in a JSON list format, adhering to the following sample structure:""".format(topic,goal)

sample_output="""
   [
       { "image_description": "Description of the first image here.", "text": "Text accompanying the first scene cut." },
       { "image_description": "Description of the second image here.", "text": "Text accompanying the second scene cut." },
       ...
   ]"""

prompt_postinstruction="""By following these instructions, you will create an impactful {} short-form video.
Output:""".format(topic)

prompt = prompt_prefix + sample_output + prompt_postinstruction

image_prompts, texts = fetch_imagedescription_and_script(prompt,chatgpt_url,chatgpt_headers)
print("image_prompts: ", image_prompts)
print("texts: ", texts)
print (len(texts))

The output from the above code will be something like this, a list of text and image_description pairs.

[{'image_description': 'A person standing at the edge of a cliff, arms wide '
                       'open, looking out into a mesmerizing sunset.',
  'text': 'Embrace the challenges. They are stepping stones towards '
          'greatness.'},
 {'image_description': 'People helping each other climb a steep mountain, '
                       'holding hands and supporting each other.',
  'text': 'Success is never a solitary journey. Surround yourself with a '
          'network of support.'},
 {'image_description': 'A student getting their exam results, smiling brightly '
                       'as they read their outstanding scores.',
  'text': "Celebrate your wins, no matter how big or small. You've worked "
          'hard, and you deserve it.'},
 {'image_description': 'An individual proudly holding up a trophy as confetti '
                       'showers down around them.',
  'text': 'Capture the joy of victory. Your hard work and determination have '
          'paid off.'},
 {'image_description': 'A light bulb illuminating a room with a bright idea '
                       'written on a chalkboard.',
  'text': 'Leave your comfort zone. Innovation and success await on the other '
          'side.'},
 {'image_description': 'An entrepreneur surrounded by piles of money, '
                       'symbolizing financial success.',
  'text': 'Achieve your financial goals. Take smart risks, and the rewards '
          'will follow.'},
 {'image_description': 'A group of diverse professionals in a boardroom, '
                       'brainstorming and exchanging ideas.',
  'text': 'Success thrives in collaboration. Embrace different perspectives to '
          'unlock your full potential.'},
 {'image_description': 'An athlete crossing the finish line with arms raised '
                       'in triumph.',
  'text': 'Persist. Keep pushing forward, and victory will be within your '
          'grasp.'}]
image_prompts:  ['A person standing at the edge of a cliff, arms wide open, looking out into a mesmerizing sunset.', 'People helping each other climb a steep mountain, holding hands and supporting each other.', 'A student getting their exam results, smiling brightly as they read their outstanding scores.', 'An individual proudly holding up a trophy as confetti showers down around them.', 'A light bulb illuminating a room with a bright idea written on a chalkboard.', 'An entrepreneur surrounded by piles of money, symbolizing financial success.', 'A group of diverse professionals in a boardroom, brainstorming and exchanging ideas.', 'An athlete crossing the finish line with arms raised in triumph.']
texts:  ['Embrace the challenges. They are stepping stones towards greatness.', 'Success is never a solitary journey. Surround yourself with a network of support.', "Celebrate your wins, no matter how big or small. You've worked hard, and you deserve it.", 'Capture the joy of victory. Your hard work and determination have paid off.', 'Leave your comfort zone. Innovation and success await on the other side.', 'Achieve your financial goals. Take smart risks, and the rewards will follow.', 'Success thrives in collaboration. Embrace different perspectives to unlock your full potential.', 'Persist. Keep pushing forward, and victory will be within your grasp.']

Now that we have this, we can use the text value as the script and image_description as the corresponding visual that accompanies the text!

Now we will use Segmind’s Stable Diffusion XL model to take image_prompts and generate relevant images.

segmind_apikey = getpass('Enter the openai key: ')

import uuid

current_uuid = uuid.uuid4()
current_foldername = str(current_uuid)
print (current_foldername)

import os
import io
import requests
from PIL import Image
import random

def generate_images(prompts, fname):
    url = "https://api.segmind.com/v1/sdxl1.0-txt2img"

    headers = {'x-api-key': segmind_apikey}

    # Create a folder for the UUID if it doesn't exist
    if not os.path.exists(fname):
        os.makedirs(fname)

    num_images = len(prompts)

    currentseed = random.randint(1, 1000000)
    print ("seed ",currentseed)

    for i, prompt in enumerate(prompts):

        final_prompt = "((perfect quality)), ((cinematic photo:1.3)), ((raw candid)), 4k, {}, no occlusion, Fujifilm XT3, highly detailed, bokeh, cinemascope".format(prompt.strip('.'))
        data = {
            "prompt": final_prompt,
            "negative_prompt": "((deformed)), ((limbs cut off)), ((quotes)), ((extra fingers)), ((deformed hands)), extra limbs, disfigured, blurry, bad anatomy, absent limbs, blurred, watermark, disproportionate, grainy, signature, cut off, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus, extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, extra eyes, amputation, disconnected limbs",
            "style": "hdr",
            "samples": 1,
            "scheduler": "UniPC",
            "num_inference_steps": 30,
            "guidance_scale": 8,
            "strength": 1,
            "seed": currentseed,
            "img_width": 1024,
            "img_height": 1024,
            "refiner": "yes",
            "base64": False
                  }

        response = requests.post(url, json=data, headers=headers)

        if response.status_code == 200 and response.headers.get('content-type') == 'image/jpeg':
            image_data = response.content
            image = Image.open(io.BytesIO(image_data))

            image_filename = os.path.join(fname, f"{i + 1}.jpg")
            image.save(image_filename)

            print(f"Image {i + 1}/{num_images} saved as '{image_filename}'")
        else:
            print (response.text)
            print(f"Error: Failed to retrieve or save image {i + 1}")

generate_images(image_prompts, current_foldername)

We create a unique folder name for every run and generate all the necessary images under that folder. A sample of images generated and their corresponding descriptions are shown below:

Images generated from descriptions using Segmind’s API

Now let’s use ElevenLabs API to convert the texts from the above step to speech using their text-to-speech API.

elevenlabsapi = getpass('Enter the openai key: ')

import requests

def generate_and_save_audio(text, foldername, filename, voice_id, elevenlabs_apikey, model_id="eleven_multilingual_v2", stability=0.4, similarity_boost=0.80):
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"
    headers = {
        "Accept": "audio/mpeg",
        "Content-Type": "application/json",
        "xi-api-key": elevenlabs_apikey
    }

    data = {
        "text": text,
        "model_id": model_id,
        "voice_settings": {
            "stability": stability,
            "similarity_boost": similarity_boost
        }
    }

    response = requests.post(url, json=data, headers=headers)

    if response.status_code != 200:
        print(response.text)
    else:
        file_path = f"{foldername}/{filename}.mp3"
        with open(file_path, 'wb') as f:
            f.write(response.content)


voice_id = "pNInz6obpgDQGcFmaJgB"
for i, text in enumerate(texts):
  output_filename= str(i + 1)
  print (output_filename)
  generate_and_save_audio(text, current_foldername, output_filename, voice_id, elevenlabsapi)

Stitch everything together with Moviepy.

from moviepy.editor import AudioFileClip, concatenate_audioclips, concatenate_videoclips, ImageClip
import os
import cv2
import numpy as np

def create_combined_video_audio(mp3_folder, output_filename, output_resolution=(1080, 1920), fps=24):
    mp3_files = sorted([file for file in os.listdir(mp3_folder) if file.endswith(".mp3")])
    mp3_files = sorted(mp3_files, key=lambda x: int(x.split('.')[0]))

    audio_clips = []
    video_clips = []

    for mp3_file in mp3_files:
        audio_clip = AudioFileClip(os.path.join(mp3_folder, mp3_file))
        audio_clips.append(audio_clip)

        # Load the corresponding image for each mp3 and set its duration to match the mp3's duration
        img_path = os.path.join(mp3_folder, f"{mp3_file.split('.')[0]}.jpg")
        image = cv2.imread(img_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB format

        # Resize the original image to 1080x1080
        image_resized = cv2.resize(image, (1080, 1080))

        # Blur the image
        blurred_img = cv2.GaussianBlur(image, (0, 0), 30)
        blurred_img = cv2.resize(blurred_img, output_resolution)

        # Overlay the original image on the blurred one
        y_offset = (output_resolution[1] - 1080) // 2
        blurred_img[y_offset:y_offset+1080, :] = image_resized

        video_clip = ImageClip(np.array(blurred_img), duration=audio_clip.duration)
        video_clips.append(video_clip)

    final_audio = concatenate_audioclips(audio_clips)
    final_video = concatenate_videoclips(video_clips, method="compose")
    final_video = final_video.with_audio(final_audio)
    finalpath = mp3_folder+"/"+output_filename

    final_video.write_videofile(finalpath, fps=fps, codec='libx264',audio_codec="aac")

output_filename = "combined_video.mp4"
create_combined_video_audio(current_foldername, output_filename)

The above code stitches the audio of each text piece with the corresponding image and sequentially combines them to create a full video!

Since the image is 1080 x 1080 but the vertical videos are 1920 x 1080 we even create a blurred background with the same image. A sample is shown below:

Image with blurred background overlay for vertical portrait mode

Now to make this even more appealing we can add word-scrolling subtitles on top of the video while it is playing using Moviepy and OpenAI’s Whisper (to get word-level time stamps for the speech).

import ffmpeg

def extract_audio_from_video(outvideo):
    """
    Extract audio from a video file and save it as an MP3 file.

    :param output_video_file: Path to the video file.
    :return: Path to the generated audio file.
    """

    audiofilename = outvideo.replace(".mp4",'.mp3')

    # Create the ffmpeg input stream
    input_stream = ffmpeg.input(outvideo)

    # Extract the audio stream from the input stream
    audio = input_stream.audio

    # Save the audio stream as an MP3 file
    output_stream = ffmpeg.output(audio, audiofilename)

    # Overwrite output file if it already exists
    output_stream = ffmpeg.overwrite_output(output_stream)

    ffmpeg.run(output_stream)

    return audiofilename



audiofilename = extract_audio_from_video(output_video_file)
print(audiofilename)

from faster_whisper import WhisperModel

model_size = "medium"
model = WhisperModel(model_size)

segments, info = model.transcribe(audiofilename, word_timestamps=True)
segments = list(segments)  # The transcription will actually run here.
for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

wordlevel_info = []

for segment in segments:
    for word in segment.words:
      wordlevel_info.append({'word':word.word,'start':word.start,'end':word.end})

from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip

# Load the video file
video = VideoFileClip(output_video_file)

# Function to generate text clips
def generate_text_clip(word, start, end, video):
    txt_clip = (TextClip(word,font_size=80,color='white',font = "Nimbus-Sans-Bold",stroke_width=3, stroke_color='black').with_position('center')
               .with_duration(end - start))

    return txt_clip.with_start(start)

# Generate a list of text clips based on timestamps
clips = [generate_text_clip(item['word'], item['start'], item['end'], video) for item in wordlevel_info]

# Overlay the text clips on the video
final_video = CompositeVideoClip([video] + clips)

finalvideoname = current_foldername+"/"+"final.mp4"
# Write the result to a file
final_video.write_videofile(finalvideoname, codec="libx264",audio_codec="aac")

With this we get word-level subtitles overlayed over the video and displayed as they are spoken!

Adding word-level subtitles to the video

Finally, we can combine everything and create a Gradio UI to create everything automatically in one click!

topics = [
    "Success and Achievement",
    "Morning Afformations",
    "Self-Care and Wellness",
    "Gratitude and Positivity",
    "Boost Confidence",
    "Happiness and Joy",
    "Resilience and Adversity",
    "Relationships and Connections",
    "Mindfulness and Presence",
    "Empowerment",
    "Time Management and Productivity"
]

topics_goals = {
    "Success and Achievement": "Inspire people to overcome challenges, achieve success, and celebrate their victories",
    "Morning Afformations": "Encourage viewers to start their day with a positive mindset.",
    "Self-Care and Wellness": "Offer tips and reminders for self-care practices, stress reduction, and maintaining overall well-being",
    "Gratitude and Positivity": "Emphasize gratitude and positive thinking",
    "Boost Confidence": "Help build self-confidence and self-esteem",
    "Happiness and Joy": "Help people find happiness in simple moments and enjoy life's journey",
    "Resilience and Adversity": "Help build resilience in the face of adversity",
    "Relationships and Connections": "Help build meaningful relationships, foster connections, and spread love",
    "Mindfulness and Presence": "Encourage mindfulness and being present in the moment",
    "Empowerment": "Empower viewers to take control of their lives, make positive choices, and pursue their dreams",
    "Time Management and Productivity": "Provide tips about managing time effectively, staying organized, and being productive"
}

import gradio as gr


with gr.Blocks() as demo:
  gr.Markdown("# Generate shortform videos for Youtube Shorts or Instagram Reels")
  genre = gr.Dropdown(topics,value="Success and Achievement")
  btn_create_video = gr.Button('Generate Video')
  with gr.Row():
    with gr.Column():
      video = gr.Video(format='mp4',height=720,width = 405)
    with gr.Column():
      btn_add_captions = gr.Button('Add Captions')
    with gr.Column():
      finalvideo = gr.Video(format='mp4',height=720,width = 405)
  # btn_create_video.click(fn=create_video,inputs=[genre],outputs=[gallery])
  # gallery = gr.Gallery(label="images",columns=3)
  btn_create_video.click(fn=create_video,inputs=[genre],outputs=[video])
  btn_add_captions.click(fn=add_captions,inputs=[video],outputs=[finalvideo])

demo.launch(debug=True,enable_queue=True)

A sample video created with the above code can be seen here at timestamp 17 seconds: https://youtu.be/3XlbswJm7Yg?t=17

In this tutorial, we’ve explored the exciting world of AI-powered short-form video creation. By following these steps, you can quickly generate captivating short-form videos for platforms like YouTube Shorts and Instagram Reels.

Happy AI exploration and if you loved the content, feel free to follow me on Twitter for daily AI content!