A couple weeks ago I wrote a short sequencing intro. Here's a bit more,
starting with a minor puzzle. I'm working with
California wastewater sequencing data (
Rothman
et al 2021) and I found a read that was a partial match for HIV:
This dataset was sequenced with paired-end reads, which means there's
more information we can get on this particular genetic snippet. This
was the 'reverse' read, so let's look at the corresponding forward
read:
Because they're reading in different directions you need to reverse
one of the reads, and since they're reading complementary strands you
need to take the genetic complement. Here's the reverse complement of
the forward read, to match the reverse read we were already looking
at:
Most commonly your paired end reads go together like:
[read 1] [gap you didn't sequence] [read 2]
In this case, however, they overlap, allowing us to assemble a larger
sequence. Sometimes you might have a read error, where the two don't
perfectly match, but we're lucky here and there's no disagreement:
Now here's the puzzle: the overlapping portion of the two
happens to be exactly the sequence that matches HIV. This isn't
something you'd expect to see by chance, right? How much overlap (or
distance) you get between two sequences should be unpredictable. So,
why is this happening?
In the sequencing process, your input DNA fragment gets more bits of
DNA ("adapters") stuck on its ends, to allow the sequencer to
manipulate it. At the beginning (5' end) of the target sequence this
works well: sequencing uses the adapter to determine where to start
the read, which then will nearly always start with the first base of
your original fragment. If your initial fragment is very short,
however, it will run past the end of the original sequence and into
the adapter. Illumina has some documentation
with figures explaining the process.
Here are the original reads again with the
portion immediately following the HIV match highlighted:
This is now an exact match for HIV, without any junk at the end. Most
quality control pipelines contain a step where you remove adapters,
just like they remove the poly-G sequences I described last time.
A couple weeks ago I wrote a short sequencing intro. Here's a bit more, starting with a minor puzzle. I'm working with California wastewater sequencing data ( Rothman et al 2021) and I found a read that was a partial match for HIV:
The 83 highlighted bases at the beginning of the read are an exact match for this section near the beginning of the HIV genome:
This dataset was sequenced with paired-end reads, which means there's more information we can get on this particular genetic snippet. This was the 'reverse' read, so let's look at the corresponding forward read:
When working with paired-end reads, they're sequenced in opposite directions working towards each other:
Because they're reading in different directions you need to reverse one of the reads, and since they're reading complementary strands you need to take the genetic complement. Here's the reverse complement of the forward read, to match the reverse read we were already looking at:
Most commonly your paired end reads go together like:
In this case, however, they overlap, allowing us to assemble a larger sequence. Sometimes you might have a read error, where the two don't perfectly match, but we're lucky here and there's no disagreement:
Now here's the puzzle: the overlapping portion of the two happens to be exactly the sequence that matches HIV. This isn't something you'd expect to see by chance, right? How much overlap (or distance) you get between two sequences should be unpredictable. So, why is this happening?
In the sequencing process, your input DNA fragment gets more bits of DNA ("adapters") stuck on its ends, to allow the sequencer to manipulate it. At the beginning (5' end) of the target sequence this works well: sequencing uses the adapter to determine where to start the read, which then will nearly always start with the first base of your original fragment. If your initial fragment is very short, however, it will run past the end of the original sequence and into the adapter. Illumina has some documentation with figures explaining the process.
Here are the original reads again with the portion immediately following the HIV match highlighted:
For the kit used in this paper this sequence is the start of the adapter, and nothing from this bit on is part of our input fragment.
With the adapters removed, we're left with just:
This is now an exact match for HIV, without any junk at the end. Most quality control pipelines contain a step where you remove adapters, just like they remove the poly-G sequences I described last time.