The stock GPT model, because it uses dense attention which works best at hundreds / thousands length, isn't suitable for any kind of raw audio, which involves extremely long sequences of millions of tokens at the millisecond level. (A WAV may be scores of megabytes long; even a highly optimized lossy encoding like MP3 or Vorbis is still megabytes for most music.) If you tried, it'd be a failure because 1024 or 2048 tokens would encode all of a few milliseconds of audio at best, and it's impossible to meaningful predict based on a few milliseconds; most sounds or phonemes or musical notes are far longer than that! You can use it for very high level encodings like ABC notation or, if you brute force it a bit, you can generate MIDI via ABC. See https://www.gwern.net/GPT-2-music This will let you generate folk or instrumental style music with a few instruments in a simple style. (Note the hack that iGPT resorts to, with pixel-encoding, to make even tiny images of 64px workable with enormous compute - because that's a 64^2^ RGB image is a 'sequence' of l=64*64*3=12,288, which is well into the painful territory for dense GPT.)
MuseNet goes one level below ABC by operating on a MIDI encoding of music. This requires shifting from dense attention to a more scalable attention, in its case, Sparse Transformers, which can handle lengths of tens of thousands with acceptable compute requirements & quality. MuseNet was better but still fairly limited. (Not raw audio, a few instruments, definitely no voices etc.)
Jukebox operates at the raw audio level, and it does this by using much larger models scaled up (<10b parameters), conditioned on lyrics/artist metadata (from n~1m songs, IIRC), and a hybrid architecture: not just Sparse Transformers, but VAE-style codebooks providing discrete embeddings of the music style for more global consistency compared to a pure autoregressive token-by-token approach like GPT/MuseNet. Jukebox is extremely impressive: it generates raw audio, for most genres of music, in the style of specific artists, and it even learns to synthesize singing voices (!). It doesn't quite have the global coherency that GPT or MuseNet samples can achieve, like choruses, because I think its attention window is still de facto limited to something like 20 seconds, which limits learning & long-range coherency; but I think fixing that's just a matter of adding on another layer in the hierarchy and maybe another order parameters, and that would fix much of the remaining quality gap.
Jukebox suggests that if you created a large enough model, you could probably dispense with the VAE part and just use pure Transformers.
Since the same transformer architecture works on images with basically no modification, I suspect it would do well on audio prediction too. Finding a really broad representative dataset for speech might be difficult, but I guess audiobooks are a good start. The context window might cause problems, because 2000 byte pairs of text takes up a lot more than 4000 bytes in audio form. But I bet it would be able to mimic voices pretty well even with a small window. (edit: Actually probably not, see Gwern's answer.)
If your question is whether the trained GPT-3 model could be modified to work with audio, I suspect not. In principle there are layers of abstraction that a transformer should be able to take advantage of, so that word prediction is mostly uncoupled from audio processing, but there's not a perfect separation, and we wouldn't know how to interface them. Maybe you could train a separate transformer model that just transcribes audio into text, and stitch them together that way, but there's not much reason to think it would be a big improvement over existing speech recognition systems.
GPT-3 currently only works on text. If OpenAI would desire to make it work with similar performance for audio, how much work would that likely be?