From 2ed06e0ce2b167ef230783238e1e1098e9bc0381 Mon Sep 17 00:00:00 2001
From: Charles Plessy <https://launchpad.net/~plessy>
Date: Sun, 21 Dec 2014 13:43:15 +0900
Subject: [PATCH] Premiers pas en Haskell.

---
 Haskell/refSeqIdSymbol.html | 143 ++++++++++++++++++++
 Haskell/refSeqIdSymbol.lhs  | 263 ++++++++++++++++++++++++++++++++++++
 2 files changed, 406 insertions(+)
 create mode 100644 Haskell/refSeqIdSymbol.html
 create mode 100644 Haskell/refSeqIdSymbol.lhs
diff --git a/Haskell/refSeqIdSymbol.html b/Haskell/refSeqIdSymbol.html
new file mode 100644
index 00000000..17897f44
--- /dev/null
+++ b/Haskell/refSeqIdSymbol.html
@@ -0,0 +1,143 @@
+<h1 id="a-refseq-parser-that-outputs-the-gene-symbol-for-each-id">A RefSeq parser that outputs the gene symbol for each ID</h1>
+<h2 id="the-goal">The goal</h2>
+<p><a href="https://www.ncbi.nlm.nih.gov/refseq/">RefSeq</a> is a database of <em>ref</em>erence biological (DNA, RNA or protein) <em>seq</em>ences made by the National Center for Biological Information (NCBI, USA). In a project at work we wanted to list for each entry in the RNA database the unique version number of the entry and its gene symbol (a short human-readable name).</p>
+<p>This looked like an excellent excuse to practice <a href="https://www.haskell.org">Haskell</a>... this is the first program I wrote in that language. I did not yet manage to make it run fast enough compared to a minimalistic strategy using shell commands.</p>
+<p>Here I describe the format that I want to parse, the Haskell program I wrote, and the quick-and-dirty processing with shell commands.</p>
+<h2 id="the-genbank-format">The GenBank format</h2>
+<p>Human sequences can be downloaded from NCBI's <a href="ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/">FTP</a> site, in a structured format called <a href="https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html">GenBank</a>. RefSeq entries typically start like the following one.</p>
+<pre><code>LOCUS       NM_131041               1399 bp    mRNA    linear   VRT 28-SEP-2014
+DEFINITION  Danio rerio neurogenin 1 (neurog1), mRNA.
+ACCESSION   NM_131041
+VERSION     NM_131041.1  GI:18859080
+KEYWORDS    RefSeq.
+SOURCE      Danio rerio (zebrafish)
+  ORGANISM  Danio rerio
+            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
+            Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
+            Cypriniformes; Cyprinidae; Danio.</code></pre>
+<p>The record is made of fields, which start with their name in upper case at the beginning of a line. Similarly to <a href="https://tools.ietf.org/html/rfc5322#section-2.2.3">e-mail headers</a>, a field in the GenBank format includes the next line if this line starts with a space. Fields can be nested by indentation, like for the ORGANISM field above.</p>
+<p>In the VERSION field, the version number is the first element (NM_131041.1 in the example above).</p>
+<p>The gene symbol is contained in the FEATURE field. In the record NM_131041.1, it starts like the following.</p>
+<pre><code>FEATURES             Location/Qualifiers
+     source          1..1399
+                     /organism=&quot;Danio rerio&quot;
+                     /mol_type=&quot;mRNA&quot;
+                     /db_xref=&quot;taxon:7955&quot;
+                     /chromosome=&quot;14&quot;
+                     /map=&quot;14&quot;
+     gene            1..1399
+                     /gene=&quot;neurog1&quot;
+                     /gene_synonym=&quot;cb260; chunp6899; neurod3; ngn1; ngr1;
+                     zNgn1&quot;
+                     /note=&quot;neurogenin 1&quot;
+                     /db_xref=&quot;GeneID:30239&quot;
+                     /db_xref=&quot;ZFIN:ZDB-GENE-990415-174&quot;</code></pre>
+<p>The FEATURES field also contains indented sub-fields, but their name is not restricted to upper-case characters. These subfields are structured: their value starts with sequence coordinates, followed by a list of keys and values, where each pair of keys and values is separated by spaces and a slash '/'.</p>
+<p>Lastly, GenBank records are terminated by a line that contains exactly two slashes (<code>//</code>) and nothing else.</p>
+<h2 id="the-program">The program</h2>
+<p>This whole file is written using <a href="https://www.haskell.org/haskellwiki/Literate_programming">literate Haskell</a>. It can actually be compiled! In literate Haskell, everything is comment by default and the code is prefixed by '&gt; '.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell"><span class="kw">import </span><span class="dt">Text.Parsec</span>
+<span class="kw">import </span><span class="dt">Text.Parsec.String</span>
+<span class="kw">import </span><span class="dt">Data.List</span> (intercalate)
+<span class="kw">import </span><span class="dt">Data.List.Split</span> (splitOn)</code></pre>
+<p>I used the <a href="http://hackage.haskell.org/package/parsec">Parsec</a> package for parsing, as well as two helper functions from packages related to list handling.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">main <span class="fu">=</span> <span class="kw">do</span>
+  r <span class="ot">&lt;-</span> getContents
+  <span class="kw">let</span> rs <span class="fu">=</span> splitOn <span class="st">&quot;//\n&quot;</span> r
+  mapM_ parseGbRecord rs</code></pre>
+<p>It does not seem possible to parse a stream of text: the whole parsing must be finished before a result is returned, which was not feasable with hundreds of megabytes of data. Therefore, in the main loop getting the contents from <em>stdin</em>, I split the records one after the other, and run the parser on each of them.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell"><span class="ot">parseGbRecord ::</span> <span class="dt">String</span> <span class="ot">-&gt;</span> <span class="dt">IO</span> ()
+parseGbRecord r <span class="fu">=</span> <span class="kw">case</span> parse gbRecord <span class="st">&quot;(stdin)&quot;</span> r <span class="kw">of</span>
+            <span class="dt">Left</span> e  <span class="ot">-&gt;</span> <span class="kw">do</span> putStrLn <span class="st">&quot;Error parsing input:&quot;</span>
+                          print e
+            <span class="dt">Right</span> r <span class="ot">-&gt;</span>  putStrLn r</code></pre>
+<p>Parsec returns either an error message or the result of the parsing.</p>
+<p>TODO: how about printing always ? What is the difference between pring and putStrLn ?</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">gbRecord <span class="fu">=</span> <span class="kw">do</span>
+  fs <span class="ot">&lt;-</span> many field
+  return <span class="fu">.</span> intercalate <span class="st">&quot;\t&quot;</span> <span class="fu">$</span> filter (<span class="fu">/=</span> <span class="st">&quot;&quot;</span>) fs</code></pre>
+<p>A GenBank record is made of many fields. For each field the parser reports a string, so the result will be a list of strings. Some strings will be empty and discarded. The resulting list is collapsed in a tab-separated string.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">field <span class="fu">=</span> <span class="kw">do</span>
+  f <span class="ot">&lt;-</span> fieldName
+  many1 space
+  <span class="kw">case</span> f <span class="kw">of</span>
+    <span class="st">&quot;VERSION&quot;</span>  <span class="ot">-&gt;</span> getVersionNumber
+    <span class="st">&quot;FEATURES&quot;</span> <span class="ot">-&gt;</span> getGeneSymbol
+    _          <span class="ot">-&gt;</span> endField <span class="fu">&gt;&gt;</span> return <span class="st">&quot;&quot;</span></code></pre>
+<p>A field starts with a field name, which is recorded. It is followed with multiple spaces. Since I am only intersted in version and gene symbol, if the field name was VERSION or FEATURES, more information is extracted, otherwise an empty string is returned.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">fieldName <span class="fu">=</span> many1 upper
+endField  <span class="fu">=</span> manyTill anyChar (try separator <span class="fu">&lt;|&gt;</span> try eof)
+separator <span class="fu">=</span> newline <span class="fu">&gt;&gt;</span> notFollowedBy (char <span class="ch">&#39; &#39;</span>)</code></pre>
+<p>A field name is upper case. Fields continue with any character until a separator or the end of the file is reached. A separator is a newline character not followed by a space.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">getVersionNumber <span class="fu">=</span> <span class="kw">do</span>
+  <span class="kw">let</span> versionNumberChar <span class="fu">=</span> oneOf <span class="fu">$</span> <span class="st">&quot;NXRM_&quot;</span> <span class="fu">++</span> [<span class="ch">&#39;0&#39;</span><span class="fu">..</span><span class="ch">&#39;9&#39;</span>] <span class="fu">++</span> <span class="st">&quot;.&quot;</span>
+  v <span class="ot">&lt;-</span> many versionNumberChar
+  endField
+  return v</code></pre>
+<p>A version number is a string containing the letters N, X, R or M, underscores, digits and dots. The precise definition is actually stricter: the version numbers start by NM_, NR_, XM_ or XR_, but I was worried of the performance hit if testing for this precisely.</p>
+<p>Once the version number is read, it is recorded, and the rest of the field is discarded.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">getGeneSymbol <span class="fu">=</span> <span class="kw">do</span>
+  <span class="kw">let</span> geneSymbolChar <span class="fu">=</span> oneOf <span class="fu">$</span> [<span class="ch">&#39;A&#39;</span><span class="fu">..</span><span class="ch">&#39;Z&#39;</span>] <span class="fu">++</span> <span class="st">&quot;orf&quot;</span> <span class="fu">++</span> <span class="st">&quot;p&quot;</span> <span class="fu">++</span> <span class="st">&quot;-&quot;</span> <span class="fu">++</span> [<span class="ch">&#39;0&#39;</span><span class="fu">..</span><span class="ch">&#39;9&#39;</span>]
+  manyTill anyChar (try (string <span class="st">&quot;/gene=\&quot;&quot;</span>))
+  g <span class="ot">&lt;-</span> many geneSymbolChar
+  char <span class="ch">&#39;&quot;&#39;</span>
+  endField
+  return g</code></pre>
+<p>Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or p, dashes and digits. It is recorded after finding the string <code>/gene=&quot;</code>. The parser then checks that the closing double quote is present, and discards it together with the rest of the field.</p>
+<h2 id="quick-and-dirty-parsing-with-unix-tools">Quick-and-dirty parsing with Unix tools</h2>
+<p>With the following command, I could produce a list of version numbers and gene symbols in a minute or so. However, what it does is not very obvious, nor flexible as it does not follow the format's syntax, and is probably very fragile.</p>
+<pre><code>zcat human.*.rna.gbff.gz |
+  grep -e VERSION -e gene= |
+  uniq |
+  sed &#39;s/=/ /&#39; |
+  awk &#39;{print $2}&#39; |
+  tr &#39;\n&#39; &#39;\t&#39; |
+  sed -e &#39;s/&quot;\t/\n/g&#39; -e &#39;s/&quot;//g&#39; &gt; human.rna.ID-Symbol.txt</code></pre>
+<p>In brief, it:</p>
+<ul class="incremental">
+<li><p>filters lines containing <code>VERSION</code> or <code>gene=</code> and discards the other ones;</p></li>
+<li><p>removes consecutive duplicates (there are multiple features per locus crosslinking to gene IDs), assuming that only one gene symbol is used per record;</p></li>
+<li><p>replaces <code>=</code> with a space so that the relevant information appears to be on the second column if one considers the flowing text to be a space-separated table;</p></li>
+<li><p>keeps only the second column (at that point, the version numbers and gene symbols alternate on successive lines;</p></li>
+<li><p>replaces newlines by tabulations, so that the data is now a gigantic lines where tab-separated fields alternate for version numbers and symbols;</p></li>
+<li><p>uses the double quotes around the gene symbol as a landmark to transform the flowing text into a tab-separated file with the two wanted fields;</p></li>
+<li><p>cleans up the remaining double quotes.</p></li>
+</ul>
+<h2 id="speed">Speed</h2>
+<p>Unfortunately, the Haskell version is way too slow to process the full RefSeq data. Here is a comparison using a test file of only 476 Kib.</p>
+<pre><code>$ ghc -O2 refSeqIdSymbol.lhs
+[1 of 1] Compiling Main             ( refSeqIdSymbol.lhs, refSeqIdSymbol.o )
+Linking refSeqIdSymbol ...
+choucaâ~â$ time ./refSeqIdSymbol &lt; hopla.gb 
+NM_001142483.1  NREP
+NM_001142481.1  NREP
+NM_001142480.1  NREP
+NM_001142477.1  NREP
+NM_001142475.1  NREP
+NM_001142474.1  NREP
+NM_004772.2 NREP
+NM_001142466.1  GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+
+real    0m0.701s
+user    0m0.664s
+sys 0m0.028s</code></pre>
+<pre><code>$ time cat hopla.gb | grep -e VERSION -e gene= |   uniq |   sed &#39;s/=/ /&#39; |   awk &#39;{print $2}&#39; |   tr &#39;\n&#39; &#39;\t&#39; |   sed -e &#39;s/&quot;\t/\n/g&#39; -e &#39;s/&quot;//g&#39;
+NM_001142483.1  NREP
+NM_001142481.1  NREP
+NM_001142480.1  NREP
+NM_001142477.1  NREP
+NM_001142475.1  NREP
+NM_001142474.1  NREP
+NM_004772.2 NREP
+NM_001142466.1  GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+real    0m0.015s
+user    0m0.004s
+sys 0m0.004s</code></pre>
diff --git a/Haskell/refSeqIdSymbol.lhs b/Haskell/refSeqIdSymbol.lhs
new file mode 100644
index 00000000..d4e33f9c
--- /dev/null
+++ b/Haskell/refSeqIdSymbol.lhs
@@ -0,0 +1,263 @@
+A RefSeq parser that outputs the gene symbol for each ID
+========================================================
+
+The goal
+--------
+
+[RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) is a database of _ref_erence
+biological (DNA, RNA or protein) _seq_ences made by the National Center for
+Biological Information (NCBI, USA).  In a project at work we wanted to list for
+each entry in the RNA database the unique version number of the entry and its
+gene symbol (a short human-readable name).
+
+This looked like an excellent excuse to practice [Haskell](https://www.haskell.org)...
+this is the first program I wrote in that language.  I did not yet manage to make
+it run fast enough compared to a minimalistic strategy using shell commands.
+
+Here I describe the format that I want to parse, the Haskell program I wrote,
+and the quick-and-dirty processing with shell commands.
+
+
+The GenBank format
+------------------
+
+Human sequences can be downloaded from NCBI's
+[FTP](ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/) site, in a
+structured format called
+[GenBank](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html).  RefSeq
+entries typically start like the following one. 
+
+~~~~
+LOCUS       NM_131041               1399 bp    mRNA    linear   VRT 28-SEP-2014
+DEFINITION  Danio rerio neurogenin 1 (neurog1), mRNA.
+ACCESSION   NM_131041
+VERSION     NM_131041.1  GI:18859080
+KEYWORDS    RefSeq.
+SOURCE      Danio rerio (zebrafish)
+  ORGANISM  Danio rerio
+            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
+            Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
+            Cypriniformes; Cyprinidae; Danio.
+~~~~
+
+The record is made of fields, which start with their name in upper case at the
+beginning of a line.  Similarly to [e-mail
+headers](https://tools.ietf.org/html/rfc5322#section-2.2.3), a field in the
+GenBank format includes the next line if this line starts with a space.  Fields
+can be nested by indentation, like for the ORGANISM field above.
+
+In the VERSION field, the version number is the first element (NM_131041.1 in
+the example above).
+
+The gene symbol is contained in the FEATURE field.  In the record NM_131041.1, it
+starts like the following.
+
+~~~~
+FEATURES             Location/Qualifiers
+     source          1..1399
+                     /organism="Danio rerio"
+                     /mol_type="mRNA"
+                     /db_xref="taxon:7955"
+                     /chromosome="14"
+                     /map="14"
+     gene            1..1399
+                     /gene="neurog1"
+                     /gene_synonym="cb260; chunp6899; neurod3; ngn1; ngr1;
+                     zNgn1"
+                     /note="neurogenin 1"
+                     /db_xref="GeneID:30239"
+                     /db_xref="ZFIN:ZDB-GENE-990415-174"
+~~~~
+
+The FEATURES field also contains indented sub-fields, but their name is not
+restricted to upper-case characters.  These subfields are structured: their
+value starts with sequence coordinates, followed by a list of keys and values,
+where each pair of keys and values is separated by spaces and a slash '/'.
+
+Lastly, GenBank records are terminated by a line that contains exactly two
+slashes (`//`) and nothing else.
+
+
+The program
+-----------
+
+This whole file is written using [literate
+Haskell](https://www.haskell.org/haskellwiki/Literate_programming).  It can
+actually be compiled!  In literate Haskell, everything is comment by default
+and the code is prefixed by '> '.
+
+> import Text.Parsec
+> import Text.Parsec.String
+> import Data.List (intercalate)
+> import Data.List.Split (splitOn)
+
+I used the [Parsec](http://hackage.haskell.org/package/parsec) package for
+parsing, as well as two helper functions from packages related to list
+handling.
+
+> main = do
+>   r <- getContents
+>   let rs = splitOn "//\n" r
+>   mapM_ parseGbRecord rs
+
+It does not seem possible to parse a stream of text: the whole parsing must be
+finished before a result is returned, which was not feasable with hundreds of
+megabytes of data.  Therefore, in the main loop getting the contents from
+_stdin_, I split the records one after the other, and run the parser on each of
+them.
+
+> parseGbRecord :: String -> IO ()
+> parseGbRecord r = case parse gbRecord "(stdin)" r of
+>             Left e  -> do putStrLn "Error parsing input:"
+>                           print e
+>             Right r ->  putStrLn r
+
+Parsec returns either an error message or the result of the parsing.
+
+TODO: how about printing always ?  What is the difference between pring and putStrLn ?
+
+> gbRecord = do
+>   fs <- many field
+>   return . intercalate "\t" $ filter (/= "") fs
+
+A GenBank record is made of many fields.  For each field the parser reports a
+string, so the result will be a list of strings.  Some strings will be empty
+and discarded.  The resulting list is collapsed in a tab-separated string.
+
+> field = do
+>   f <- fieldName
+>   many1 space
+>   case f of
+>     "VERSION"  -> getVersionNumber
+>     "FEATURES" -> getGeneSymbol
+>     _          -> endField >> return ""
+
+A field starts with a field name, which is recorded.  It is followed with
+multiple spaces.  Since I am only intersted in version and gene symbol, if the
+field name was VERSION or FEATURES, more information is extracted, otherwise
+an empty string is returned.
+
+> fieldName = many1 upper
+> endField  = manyTill anyChar (try separator <|> try eof)
+> separator = newline >> notFollowedBy (char ' ')
+
+A field name is upper case.  Fields continue with any character until a
+separator or the end of the file is reached.  A separator is a newline
+character not followed by a space.
+
+> getVersionNumber = do
+>   let versionNumberChar = oneOf $ "NXRM_" ++ ['0'..'9'] ++ "."
+>   v <- many versionNumberChar
+>   endField
+>   return v
+
+A version number is a string containing the letters N, X, R or M, underscores,
+digits and dots.  The precise definition is actually stricter: the version
+numbers start by NM_, NR_, XM_ or XR_, but I was worried of the performance hit
+if testing for this precisely.
+
+Once the version number is read, it is recorded, and the rest of the field is
+discarded.
+
+> getGeneSymbol = do
+>   let geneSymbolChar = oneOf $ ['A'..'Z'] ++ "orf" ++ "p" ++ "-" ++ ['0'..'9']
+>   manyTill anyChar (try (string "/gene=\""))
+>   g <- many geneSymbolChar
+>   char '"'
+>   endField
+>   return g
+
+Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or p,
+dashes and digits.  It is recorded after finding the string `/gene="`.  The
+parser then checks that the closing double quote is present, and discards it
+together with the rest of the field.
+
+
+Quick-and-dirty parsing with Unix tools
+---------------------------------------
+
+With the following command, I could produce a list of version numbers and gene
+symbols in a minute or so.  However, what it does is not very obvious, nor
+flexible as it does not follow the format's syntax, and is probably very
+fragile.
+
+~~~~
+zcat human.*.rna.gbff.gz |
+  grep -e VERSION -e gene= |
+  uniq |
+  sed 's/=/ /' |
+  awk '{print $2}' |
+  tr '\n' '\t' |
+  sed -e 's/"\t/\n/g' -e 's/"//g' > human.rna.ID-Symbol.txt
+~~~~
+
+In brief, it:
+
+ - filters lines containing `VERSION` or `gene=` and discards the other ones;
+
+ - removes consecutive duplicates (there are multiple features per locus
+   crosslinking to gene IDs), assuming that only one gene symbol is used per
+   record;
+
+ - replaces `=` with a space so that the relevant information appears to be on
+   the second column if one considers the flowing text to be a space-separated
+   table;
+
+ - keeps only the second column (at that point, the version numbers and gene symbols
+   alternate on successive lines;
+
+ - replaces newlines by tabulations, so that the data is now a gigantic lines
+   where tab-separated fields alternate for version numbers and symbols;
+
+ - uses the double quotes around the gene symbol as a landmark to transform the
+   flowing text into a tab-separated file with the two wanted fields;
+
+ - cleans up the remaining double quotes.
+
+Speed
+-----
+
+Unfortunately, the Haskell version is way too slow to process the full RefSeq
+data.  Here is a comparison using a test file of only 476 Kib.
+
+~~~~~
+$ ghc -O2 refSeqIdSymbol.lhs
+[1 of 1] Compiling Main             ( refSeqIdSymbol.lhs, refSeqIdSymbol.o )
+Linking refSeqIdSymbol ...
+choucaâ~â$ time ./refSeqIdSymbol < hopla.gb 
+NM_001142483.1	NREP
+NM_001142481.1	NREP
+NM_001142480.1	NREP
+NM_001142477.1	NREP
+NM_001142475.1	NREP
+NM_001142474.1	NREP
+NM_004772.2	NREP
+NM_001142466.1	GPT2
+NM_133443.2	GPT2
+NM_173685.2	NSMCE2
+NM_007058.3	CAPN11
+
+
+real	0m0.701s
+user	0m0.664s
+sys	0m0.028s
+~~~~~
+
+~~~~~
+$ time cat hopla.gb | grep -e VERSION -e gene= |   uniq |   sed 's/=/ /' |   awk '{print $2}' |   tr '\n' '\t' |   sed -e 's/"\t/\n/g' -e 's/"//g'
+NM_001142483.1	NREP
+NM_001142481.1	NREP
+NM_001142480.1	NREP
+NM_001142477.1	NREP
+NM_001142475.1	NREP
+NM_001142474.1	NREP
+NM_004772.2	NREP
+NM_001142466.1	GPT2
+NM_133443.2	GPT2
+NM_173685.2	NSMCE2
+NM_007058.3	CAPN11
+
+real	0m0.015s
+user	0m0.004s
+sys	0m0.004s
+~~~~~
-- 
2.47.3