Stream: helpdesk (published)

Topic: Calling pure function twice produces different results


view this post on Zulip Timothy (Jun 10 2022 at 03:21):

I'm seeing some exceptionally strange behavior, that I find somewhat concerning. I'm afraid that to explain the situation well I'll need to write a fair bit, but this should give you all the information I have.

I have a DataFrame clinvar_confident_pathogenic_snvs which I process with two function, the first (get_exon) is impure but only relies on a constant (it looks up a match) and the second is a pure function (rotate_attributes) which merges the .attributes Dicts of some data into a DataFrame.

I apply these functions with Transducers.jl,

genes_from_exons = clinvar_confident_pathogenic_snvs |> eachrow |> withprogress |>
    Map(get_exon) |> Map(rotate_attributes) |> collect

This is all very well and good, but it gave me a strange result:

julia> genes_from_exons[79]
2×15 DataFrame
 Row │ ID                        Parent             exon_id            exon_number  gene_id            gene_name  gene_type       havana_gene           havana_transcript   ⋯
     │ String                    String             String             String       String             String     String          String                String              ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ exon:ENST00000400490.2:1  ENST00000400490.2  ENSE00001946236.1  1            ENSG00000158828.7  PINK1      protein_coding  OTTHUMG00000002841.1  OTTHUMT00000007956. ⋯
   2 │ exon:ENST00000451424.1:2  ENST00000451424.1  ENSE00001711022.1  2            ENSG00000117242.7  PINK1-AS   protein_coding  OTTHUMG00000002840.1  OTTHUMT00000007953.
                                                                                                                                                            7 columns omitted

PINK1-AS is an antisense gene, not a protein coding gene.

We can verify that PINK1-AS is reported as such by the data file I've processed,

❯ cat gencode.v26.annotation.gff3 | grep 'exon:ENST00000451424.1:2'
chr1    HAVANA  exon    20645585    20645686    .   -   .   ID=exon:ENST00000451424.1:2;Parent=ENST00000451424.1;gene_id=ENSG00000117242.7;transcript_id=ENST00000451424.1;gene_type=antisense;gene_name=PINK1-AS;transcript_type=antisense;transcript_name=PINK1-AS-001;exon_number=2;exon_id=ENSE00001711022.1;level=2;transcript_support_level=2;tag=basic;havana_gene=OTTHUMG00000002840.1;havana_transcript=OTTHUMT00000007953.1

Note "gene_type=antisense".

Investigating this correctness error, I looked at the result if I stop at Map(get_exon), and look at the exon in question.

julia> (clinvar_confident_pathogenic_snvs |> eachrow |> withprogress |>
           Map(get_exon) |> collect)[79][3, :attributes]
Dict{String, Vector{String}} with 15 entries:
  "havana_gene"              => ["OTTHUMG00000002840.1"]
  "exon_id"                  => ["ENSE00001711022.1"]
  "exon_number"              => ["2"]
  "havana_transcript"        => ["OTTHUMT00000007953.1"]
  "transcript_support_level" => ["2"]
  "gene_id"                  => ["ENSG00000117242.7"]
  "transcript_id"            => ["ENST00000451424.1"]
  "Parent"                   => ["ENST00000451424.1"]
  "gene_type"                => ["antisense"]
  "level"                    => ["2"]
  "gene_name"                => ["PINK1-AS"]
  "transcript_type"          => ["antisense"]
  "tag"                      => ["basic"]
  "transcript_name"          => ["PINK1-AS-001"]
  "ID"                       => ["exon:ENST00000451424.1:2"]

This correctly reports exon:ENST00000451424.1:2 as being from an antisense gene. No "protein_coding" in sight. Second guessing my sanity, I tried adding the Map(rotate_attributes) step back in:

julia> (clinvar_confident_pathogenic_snvs |> eachrow |> withprogress |>
           Map(get_exon) |> Map(rotate_attributes) |> collect)[79]
2×15 DataFrame
 Row │ ID                        Parent             exon_id            exon_number  gene_id            gene_name  gene_type       havana_gene           havana_transcript   ⋯
     │ String                    String             String             String       String             String     String          String                String              ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ exon:ENST00000400490.2:1  ENST00000400490.2  ENSE00001946236.1  1            ENSG00000158828.7  PINK1      protein_coding  OTTHUMG00000002841.1  OTTHUMT00000007956. ⋯
   2 │ exon:ENST00000451424.1:2  ENST00000451424.1  ENSE00001711022.1  2            ENSG00000117242.7  PINK1-AS   antisense       OTTHUMG00000002840.1  OTTHUMT00000007953.
                                                                                                                                                            7 columns omitted

... and it's antisense now.

This is now the exact same expression I used to set genes_from_exons, but the result has changed!?! Checking my REPL history, I didn't do anything to genes_from_exons other than inspect some of the values. I can't imagine any issue with get_exon, as if that behaved any differently we'd see different exons entirely. Perhaps there's something fishy about rotate_attributes, but looking at my definition again I can't see any hints of non-determinism.

function rotate_attributes(ginfo)
    if !isempty(ginfo)
        filter((k, v)::Pair -> length(v) == length(ginfo.attributes),
               mergewith(vcat, ginfo.attributes...)) |> DataFrame |>
                   df -> unique(df, :gene_id)
    else DataFrame() end
end

So, what on earth could be going on here??


Last updated: Oct 02 2023 at 04:34 UTC