I'm seeing some exceptionally strange behavior, that I find somewhat concerning. I'm afraid that to explain the situation well I'll need to write a fair bit, but this should give you all the information I have.
I have a DataFrame clinvar_confident_pathogenic_snvs
which I process with two function, the first (get_exon
) is impure but only relies on a constant (it looks up a match) and the second is a pure function (rotate_attributes
) which merges the .attributes
Dicts of some data into a DataFrame.
I apply these functions with Transducers.jl,
genes_from_exons = clinvar_confident_pathogenic_snvs |> eachrow |> withprogress |>
Map(get_exon) |> Map(rotate_attributes) |> collect
This is all very well and good, but it gave me a strange result:
julia> genes_from_exons[79]
2×15 DataFrame
Row │ ID Parent exon_id exon_number gene_id gene_name gene_type havana_gene havana_transcript ⋯
│ String String String String String String String String String ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ exon:ENST00000400490.2:1 ENST00000400490.2 ENSE00001946236.1 1 ENSG00000158828.7 PINK1 protein_coding OTTHUMG00000002841.1 OTTHUMT00000007956. ⋯
2 │ exon:ENST00000451424.1:2 ENST00000451424.1 ENSE00001711022.1 2 ENSG00000117242.7 PINK1-AS protein_coding OTTHUMG00000002840.1 OTTHUMT00000007953.
7 columns omitted
PINK1-AS
is an antisense gene, not a protein coding gene.
We can verify that PINK1-AS
is reported as such by the data file I've processed,
❯ cat gencode.v26.annotation.gff3 | grep 'exon:ENST00000451424.1:2'
chr1 HAVANA exon 20645585 20645686 . - . ID=exon:ENST00000451424.1:2;Parent=ENST00000451424.1;gene_id=ENSG00000117242.7;transcript_id=ENST00000451424.1;gene_type=antisense;gene_name=PINK1-AS;transcript_type=antisense;transcript_name=PINK1-AS-001;exon_number=2;exon_id=ENSE00001711022.1;level=2;transcript_support_level=2;tag=basic;havana_gene=OTTHUMG00000002840.1;havana_transcript=OTTHUMT00000007953.1
Note "gene_type=antisense".
Investigating this correctness error, I looked at the result if I stop at Map(get_exon)
, and look at the exon in question.
julia> (clinvar_confident_pathogenic_snvs |> eachrow |> withprogress |>
Map(get_exon) |> collect)[79][3, :attributes]
Dict{String, Vector{String}} with 15 entries:
"havana_gene" => ["OTTHUMG00000002840.1"]
"exon_id" => ["ENSE00001711022.1"]
"exon_number" => ["2"]
"havana_transcript" => ["OTTHUMT00000007953.1"]
"transcript_support_level" => ["2"]
"gene_id" => ["ENSG00000117242.7"]
"transcript_id" => ["ENST00000451424.1"]
"Parent" => ["ENST00000451424.1"]
"gene_type" => ["antisense"]
"level" => ["2"]
"gene_name" => ["PINK1-AS"]
"transcript_type" => ["antisense"]
"tag" => ["basic"]
"transcript_name" => ["PINK1-AS-001"]
"ID" => ["exon:ENST00000451424.1:2"]
This correctly reports exon:ENST00000451424.1:2
as being from an antisense gene. No "protein_coding" in sight. Second guessing my sanity, I tried adding the Map(rotate_attributes)
step back in:
julia> (clinvar_confident_pathogenic_snvs |> eachrow |> withprogress |>
Map(get_exon) |> Map(rotate_attributes) |> collect)[79]
2×15 DataFrame
Row │ ID Parent exon_id exon_number gene_id gene_name gene_type havana_gene havana_transcript ⋯
│ String String String String String String String String String ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ exon:ENST00000400490.2:1 ENST00000400490.2 ENSE00001946236.1 1 ENSG00000158828.7 PINK1 protein_coding OTTHUMG00000002841.1 OTTHUMT00000007956. ⋯
2 │ exon:ENST00000451424.1:2 ENST00000451424.1 ENSE00001711022.1 2 ENSG00000117242.7 PINK1-AS antisense OTTHUMG00000002840.1 OTTHUMT00000007953.
7 columns omitted
... and it's antisense now.
This is now the exact same expression I used to set genes_from_exons
, but the result has changed!?! Checking my REPL history, I didn't do anything to genes_from_exons
other than inspect some of the values. I can't imagine any issue with get_exon
, as if that behaved any differently we'd see different exons entirely. Perhaps there's something fishy about rotate_attributes
, but looking at my definition again I can't see any hints of non-determinism.
function rotate_attributes(ginfo)
if !isempty(ginfo)
filter((k, v)::Pair -> length(v) == length(ginfo.attributes),
mergewith(vcat, ginfo.attributes...)) |> DataFrame |>
df -> unique(df, :gene_id)
else DataFrame() end
end
So, what on earth could be going on here??
Last updated: Oct 02 2023 at 04:34 UTC