We Removed 335 LOC with one Go package

TLDR: Our new service (Podio) puts a Fluent API on top of audio management with features we couldn't find in FFmpeg.

At Prayershub, we were facing significant challenges with FFmpeg. The process of optimizing it for efficiency, along with managing complex audio pipelines for various audio processing tasks, was becoming increasingly tedious and hard to maintain.

As the codebase grew, the overhead of managing these processes was taking a toll on our productivity.

To solve this, we decided to abstract the difficult parts into a separate platform — Podio (as a service under Poxate).

The problem

Prayershub is a worship service helper that lets users create worship (practically, playlists) with songs, bible verses, and prayers. As the complexity grew, so did the need for efficient audio processing—looping tracks, applying fades, and managing background music.

However, we running into walls with FFMpeg. Here's an illustration:

Preview of our compilation setup

We'll have repeating background music, with 2 seconds before and after the speech/voice.

Click here to view a sample of what that sounds like.

That's not the end of it, however. Sometimes, there will be multiple "Long speeches", such as "a reading of a bible chapter" followed by a prayer.

enter image description here

Thus we had one concat operation, a stream_loop, with the results of both piped into amerge, resulting in this: (much has been stripped from this code for the sake of clarity, don't do most of this in production):

func writeVoiceFragment(ctx context.Context, resultFilePath string, group []models.WorshipContent, backgroundMusic string) error {
	tmpSilentFilename := fmt.Sprintf("silent-%s.mp3", uuid.NewString())
	{
		// Step 1: Download 1-second-of-silence file
		req, err := http.Get("https://s3.us-east-2.amazonaws.com/cdn.prayershub.com/assets/1-second-of-silence.mp3")
		if err != nil {
			return fmt.Errorf("failed to download silent file: %w", err)
		}
		contents, err := io.ReadAll(req.Body)
		if err != nil {
			return fmt.Errorf("failed to read silent file: %w", err)
		}
		if err := os.WriteFile("tmp/"+tmpSilentFilename, contents, 0777); err != nil {
			return fmt.Errorf("os.WriteFile failed for silent file: %w", err)
		}
		defer os.Remove("tmp/" + tmpSilentFilename)
	}

	filenames := make([]string, len(group))
	{
		// Step 2: Reencode soundtracks
		g, _ := errgroup.WithContext(ctx)
		for i, content := range group {
			i := i
			content := content
			g.Go(func() error {
				tmpFilename := fmt.Sprintf("voice-%s.mp3", uuid.NewString())
				remoteUrl := models.SafelyGetSoundtrack(content.Type, content.Data, map[int]models.Song{})
				errBuf := bytes.NewBuffer(nil)
				cmd := exec.CommandContext(ctx, "ffmpeg",
					"-i", remoteUrl, "-ar", "44100", "-ac", "2", "-b:a", "192k", "tmp/"+tmpFilename,
				)
				cmd.Stderr = errBuf
				if err := cmd.Run(); err != nil {
					return fmt.Errorf("failed to reencode: %w || %s", err, errBuf.String())
				}
				filenames[i] = tmpFilename
				return nil
			})
		}

		if err := g.Wait(); err != nil {
			return fmt.Errorf("first g.Wait failed: %w", err)
		}

		for _, filename := range filenames {
			defer os.Remove("tmp/" + filename)
		}
	}

	tmpConcatTextFilepath := fmt.Sprintf("tmp/concat-%s.txt", uuid.NewString())
	{
		// Step 3: Create concat file
		concatText := strings.Repeat(fmt.Sprintf("file '%s'\n", tmpSilentFilename), 2)
		for _, filename := range filenames {
			concatText += "file '" + strings.ReplaceAll(filename, "'", "'\\'") + "'\n"
			concatText += strings.Repeat(fmt.Sprintf("file '%s'\n", tmpSilentFilename), 2)
		}
		if err := os.WriteFile(tmpConcatTextFilepath, []byte(concatText), 0777); err != nil {
			return fmt.Errorf("os.WriteFile failed to create tmpConcatTextFilepath: %w", err)
		}
		defer os.Remove(tmpConcatTextFilepath)
	}

	concattedBuf := bytes.NewBuffer(nil)
	{
		// Step 3: Concat demuxer
		errBuf := bytes.NewBuffer(nil)
		cmd := exec.CommandContext(ctx, "ffmpeg", "-f", "concat", "-i", tmpConcatTextFilepath, "-c", "copy", "-f", "mp3", "pipe:")
		cmd.Stdout = concattedBuf
		cmd.Stderr = errBuf
		if err := cmd.Run(); err != nil {
			return fmt.Errorf("failed to concat: %w || %s", err, errBuf.String())
		}
	}

	{
		// Step 4: Overlay background audio
		errBuf := bytes.NewBuffer(nil)
		cmd := exec.CommandContext(ctx, "ffmpeg",
			"-stream_loop", "-1", "-i", backgroundMusic,
			"-i", "pipe:",
			"-shortest", "-filter_complex", "[0:a]volume=0.25[a0];[a0][1:a]amerge=inputs=2[out]",
			"-ar", "44100", "-ac", "2", "-b:a", "192k", "-map", "[out]", resultFilePath)
		cmd.Stdin = concattedBuf
		cmd.Stderr = errBuf
		if err := cmd.Run(); err != nil {
			return fmt.Errorf("failed to overlay background music: %w || %s", err, errBuf.String())
		}
	}

	return nil
}

You'll notice a few problems immediatly:

  1. Each step is done in blocking fashion, parallelizing would require even more work
  2. Many intermediate steps
  3. The result is held in a in-memory buffer (this is not Go's fault, nor the fault of FFmpeg, but a descision made to get around restrictions that will not be discussed in this article).

There is one feature still missing in this implementation, Fading. With FFmpeg, fading in the first 2 seconds is simple enough.

Fading out proved to be more complicated, as FFmpeg requires a time in the audio to start fading out, which we won't know. There's a whole StackOverflow post over this.

Workaround #1:

  1. Compile the audio
  2. fade first 2 seconds
  3. Reverse entire audio with areverse
  4. fade first 2 seconds
  5. Reverse again with areverse This was obviously an inefficient workaround, and proved infeasible in the time constraints of our application.

Workaround #2:

  1. Compile the audio
  2. Find duration through ffprobe
  3. Subtract duration by 2 seconds, then fade out with ffmpeg

Although much better than workaround #2, it still violated the time constraints of our application. In the end, we decided to drop fading entirely, we significantly decreased the user experience.

The solution

What if there was an easy way to work with audio in Golang, without the hassle of dealing with command-line tools, FFmpeg, or struggling with clunky filters and missing features?

Something a bit more native to Go than Fluent FFMpeg, with the addition of offloading work to a distributed but transparent edge network to avoid overloading our server.

That's why we built Podio – a simple, powerful audio processing library for Go. Podio abstracts the complexity of audio transformations and offloads the heavy lifting to the cloud, allowing you to easily concatenate, loop, fade, and manipulate audio. With an intuitive API and no server management required, you can focus on building your app while Podio handles the audio processing.

The above code rewritten with Podio:

func writeVoiceFragment(ctx context.Context, filepath string, group []models.WorshipContent, backgroundMusic string) error {
	voices := []podio.AudioBuilder{}
	for _, content := range group {
		remoteUrl := models.GetSoundtrack(content.Type, content.Data)
		voices = append(voices, podio.Remote(remoteUrl).PadLeft(2*time.Second))
	}

	ab := podio.Concat(voices...).PadRight(2 * time.Second).WithBackground(
		podio.Remote(backgroundMusic).Volume(0.25),
	)

	outFile, err := os.Open(filepath)
	if err != nil {
		return err
	}

	if err := podio.NewClient(os.Getenv("PODIO_API_KEY")).Compile(ctx, podio.MP3, ab, outFile); err != nil {
		return err
	}

	return nil
}

Carrying this across our entire codebase deleted about 335 lines of code, while improving our audio compilation speed by more than 2.4x.

That's not all, we get additional features that would've been a headache with FFmpeg. Let's say, for example, we wanted to get the duration of the final result.

With Podio, it's simple, just pass *time.Duration, and Podio will update it upon compiling.

func writeVoiceFragment(ctx context.Context, filepath string, group []models.WorshipContent, backgroundMusic string) error {
	...

	var finalDuration time.Duration

	ab := podio.Concat(voices...).PadRight(2 * time.Second).WithBackground(
		podio.Remote(backgroundMusic).Volume(0.25),
	).SaveDuration(&finalDuration)

	...

	if err := podio.NewClient(os.Getenv("PODIO_API_KEY")).Compile(ctx, podio.MP3, ab, outFile); err != nil {
		return err
	}

	fmt.Println("The final duration is:", finalDuration)

	return nil
}

What about fading in/out? Just as simple:

ab := podio.
	Concat(voices...).PadRight(2 * time.Second).
	WithBackground(podio.Remote(backgroundMusic).Volume(0.25)).
	FadeIn(2 * time.Second).
	FadeOut(2 * time.Second).
	SaveDuration(&finalDuration)

What if, for some reason, we wanted the duration of each voice as well?

func writeVoiceFragment(ctx context.Context, filepath string, group []models.WorshipContent, backgroundMusic string) error {
	voiceDurations := make([]time.Duration, len(group))
	voices := []podio.AudioBuilder{}

	for i, content := range group {
		remoteUrl := models.GetSoundtrack(content.Type, content.Data)
		voices = append(voices,
			podio.
				Remote(remoteUrl).
				SaveDuration(&voiceDurations[i]).
				PadLeft(2*time.Second),
		)
	}

	var finalDuration time.Duration

	ab := podio.
		Concat(voices...).PadRight(2 * time.Second).
		WithBackground(podio.Remote(backgroundMusic).Volume(0.25)).
		FadeIn(2 * time.Second).
		FadeOut(2 * time.Second).
		SaveDuration(&finalDuration)

	outFile, err := os.Open(filepath)
	if err != nil {
		return err
	}

	if err := podio.NewClient(os.Getenv("PODIO_API_KEY")).Compile(ctx, podio.MP3, ab, outFile); err != nil {
		return err
	}

	fmt.Println("The final duration is:", finalDuration)

	return nil
}

Conclusion

Switching to Podio has significantly streamlined our audio processing at Prayershub. What once took complex FFmpeg pipelines and 335 lines of code is now simplified into just a few method calls using Podio’s intuitive API.

By offloading heavy lifting to a distributed network, we avoided server strain while scaling effortlessly. This not only saved us time but also improved performance and the user experience.

If you're tired of dealing with FFmpeg complexities and want a more efficient, Go-native(ish) solution, Podio is the answer.