--- /dev/null
+<h1 id="a-refseq-parser-that-outputs-the-gene-symbol-for-each-id">A RefSeq parser that outputs the gene symbol for each ID</h1>
+<h2 id="the-goal">The goal</h2>
+<p><a href="https://www.ncbi.nlm.nih.gov/refseq/">RefSeq</a> is a database of <em>ref</em>erence biological (DNA, RNA or protein) <em>seq</em>ences made by the National Center for Biological Information (NCBI, USA). In a project at work we wanted to list for each entry in the RNA database the unique version number of the entry and its gene symbol (a short human-readable name).</p>
+<p>This looked like an excellent excuse to practice <a href="https://www.haskell.org">Haskell</a>... this is the first program I wrote in that language. I did not yet manage to make it run fast enough compared to a minimalistic strategy using shell commands.</p>
+<p>Here I describe the format that I want to parse, the Haskell program I wrote, and the quick-and-dirty processing with shell commands.</p>
+<h2 id="the-genbank-format">The GenBank format</h2>
+<p>Human sequences can be downloaded from NCBI's <a href="ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/">FTP</a> site, in a structured format called <a href="https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html">GenBank</a>. RefSeq entries typically start like the following one.</p>
+<pre><code>LOCUS NM_131041 1399 bp mRNA linear VRT 28-SEP-2014
+DEFINITION Danio rerio neurogenin 1 (neurog1), mRNA.
+ACCESSION NM_131041
+VERSION NM_131041.1 GI:18859080
+KEYWORDS RefSeq.
+SOURCE Danio rerio (zebrafish)
+ ORGANISM Danio rerio
+ Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
+ Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
+ Cypriniformes; Cyprinidae; Danio.</code></pre>
+<p>The record is made of fields, which start with their name in upper case at the beginning of a line. Similarly to <a href="https://tools.ietf.org/html/rfc5322#section-2.2.3">e-mail headers</a>, a field in the GenBank format includes the next line if this line starts with a space. Fields can be nested by indentation, like for the ORGANISM field above.</p>
+<p>In the VERSION field, the version number is the first element (NM_131041.1 in the example above).</p>
+<p>The gene symbol is contained in the FEATURE field. In the record NM_131041.1, it starts like the following.</p>
+<pre><code>FEATURES Location/Qualifiers
+ source 1..1399
+ /organism="Danio rerio"
+ /mol_type="mRNA"
+ /db_xref="taxon:7955"
+ /chromosome="14"
+ /map="14"
+ gene 1..1399
+ /gene="neurog1"
+ /gene_synonym="cb260; chunp6899; neurod3; ngn1; ngr1;
+ zNgn1"
+ /note="neurogenin 1"
+ /db_xref="GeneID:30239"
+ /db_xref="ZFIN:ZDB-GENE-990415-174"</code></pre>
+<p>The FEATURES field also contains indented sub-fields, but their name is not restricted to upper-case characters. These subfields are structured: their value starts with sequence coordinates, followed by a list of keys and values, where each pair of keys and values is separated by spaces and a slash '/'.</p>
+<p>Lastly, GenBank records are terminated by a line that contains exactly two slashes (<code>//</code>) and nothing else.</p>
+<h2 id="the-program">The program</h2>
+<p>This whole file is written using <a href="https://www.haskell.org/haskellwiki/Literate_programming">literate Haskell</a>. It can actually be compiled! In literate Haskell, everything is comment by default and the code is prefixed by '> '.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell"><span class="kw">import </span><span class="dt">Text.Parsec</span>
+<span class="kw">import </span><span class="dt">Text.Parsec.String</span>
+<span class="kw">import </span><span class="dt">Data.List</span> (intercalate)
+<span class="kw">import </span><span class="dt">Data.List.Split</span> (splitOn)</code></pre>
+<p>I used the <a href="http://hackage.haskell.org/package/parsec">Parsec</a> package for parsing, as well as two helper functions from packages related to list handling.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">main <span class="fu">=</span> <span class="kw">do</span>
+ r <span class="ot"><-</span> getContents
+ <span class="kw">let</span> rs <span class="fu">=</span> splitOn <span class="st">"//\n"</span> r
+ mapM_ parseGbRecord rs</code></pre>
+<p>It does not seem possible to parse a stream of text: the whole parsing must be finished before a result is returned, which was not feasable with hundreds of megabytes of data. Therefore, in the main loop getting the contents from <em>stdin</em>, I split the records one after the other, and run the parser on each of them.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell"><span class="ot">parseGbRecord ::</span> <span class="dt">String</span> <span class="ot">-></span> <span class="dt">IO</span> ()
+parseGbRecord r <span class="fu">=</span> <span class="kw">case</span> parse gbRecord <span class="st">"(stdin)"</span> r <span class="kw">of</span>
+ <span class="dt">Left</span> e <span class="ot">-></span> <span class="kw">do</span> putStrLn <span class="st">"Error parsing input:"</span>
+ print e
+ <span class="dt">Right</span> r <span class="ot">-></span> putStrLn r</code></pre>
+<p>Parsec returns either an error message or the result of the parsing.</p>
+<p>TODO: how about printing always ? What is the difference between pring and putStrLn ?</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">gbRecord <span class="fu">=</span> <span class="kw">do</span>
+ fs <span class="ot"><-</span> many field
+ return <span class="fu">.</span> intercalate <span class="st">"\t"</span> <span class="fu">$</span> filter (<span class="fu">/=</span> <span class="st">""</span>) fs</code></pre>
+<p>A GenBank record is made of many fields. For each field the parser reports a string, so the result will be a list of strings. Some strings will be empty and discarded. The resulting list is collapsed in a tab-separated string.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">field <span class="fu">=</span> <span class="kw">do</span>
+ f <span class="ot"><-</span> fieldName
+ many1 space
+ <span class="kw">case</span> f <span class="kw">of</span>
+ <span class="st">"VERSION"</span> <span class="ot">-></span> getVersionNumber
+ <span class="st">"FEATURES"</span> <span class="ot">-></span> getGeneSymbol
+ _ <span class="ot">-></span> endField <span class="fu">>></span> return <span class="st">""</span></code></pre>
+<p>A field starts with a field name, which is recorded. It is followed with multiple spaces. Since I am only intersted in version and gene symbol, if the field name was VERSION or FEATURES, more information is extracted, otherwise an empty string is returned.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">fieldName <span class="fu">=</span> many1 upper
+endField <span class="fu">=</span> manyTill anyChar (try separator <span class="fu"><|></span> try eof)
+separator <span class="fu">=</span> newline <span class="fu">>></span> notFollowedBy (char <span class="ch">' '</span>)</code></pre>
+<p>A field name is upper case. Fields continue with any character until a separator or the end of the file is reached. A separator is a newline character not followed by a space.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">getVersionNumber <span class="fu">=</span> <span class="kw">do</span>
+ <span class="kw">let</span> versionNumberChar <span class="fu">=</span> oneOf <span class="fu">$</span> <span class="st">"NXRM_"</span> <span class="fu">++</span> [<span class="ch">'0'</span><span class="fu">..</span><span class="ch">'9'</span>] <span class="fu">++</span> <span class="st">"."</span>
+ v <span class="ot"><-</span> many versionNumberChar
+ endField
+ return v</code></pre>
+<p>A version number is a string containing the letters N, X, R or M, underscores, digits and dots. The precise definition is actually stricter: the version numbers start by NM_, NR_, XM_ or XR_, but I was worried of the performance hit if testing for this precisely.</p>
+<p>Once the version number is read, it is recorded, and the rest of the field is discarded.</p>
+<pre class="sourceCode literate haskell"><code class="sourceCode haskell">getGeneSymbol <span class="fu">=</span> <span class="kw">do</span>
+ <span class="kw">let</span> geneSymbolChar <span class="fu">=</span> oneOf <span class="fu">$</span> [<span class="ch">'A'</span><span class="fu">..</span><span class="ch">'Z'</span>] <span class="fu">++</span> <span class="st">"orf"</span> <span class="fu">++</span> <span class="st">"p"</span> <span class="fu">++</span> <span class="st">"-"</span> <span class="fu">++</span> [<span class="ch">'0'</span><span class="fu">..</span><span class="ch">'9'</span>]
+ manyTill anyChar (try (string <span class="st">"/gene=\""</span>))
+ g <span class="ot"><-</span> many geneSymbolChar
+ char <span class="ch">'"'</span>
+ endField
+ return g</code></pre>
+<p>Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or p, dashes and digits. It is recorded after finding the string <code>/gene="</code>. The parser then checks that the closing double quote is present, and discards it together with the rest of the field.</p>
+<h2 id="quick-and-dirty-parsing-with-unix-tools">Quick-and-dirty parsing with Unix tools</h2>
+<p>With the following command, I could produce a list of version numbers and gene symbols in a minute or so. However, what it does is not very obvious, nor flexible as it does not follow the format's syntax, and is probably very fragile.</p>
+<pre><code>zcat human.*.rna.gbff.gz |
+ grep -e VERSION -e gene= |
+ uniq |
+ sed 's/=/ /' |
+ awk '{print $2}' |
+ tr '\n' '\t' |
+ sed -e 's/"\t/\n/g' -e 's/"//g' > human.rna.ID-Symbol.txt</code></pre>
+<p>In brief, it:</p>
+<ul class="incremental">
+<li><p>filters lines containing <code>VERSION</code> or <code>gene=</code> and discards the other ones;</p></li>
+<li><p>removes consecutive duplicates (there are multiple features per locus crosslinking to gene IDs), assuming that only one gene symbol is used per record;</p></li>
+<li><p>replaces <code>=</code> with a space so that the relevant information appears to be on the second column if one considers the flowing text to be a space-separated table;</p></li>
+<li><p>keeps only the second column (at that point, the version numbers and gene symbols alternate on successive lines;</p></li>
+<li><p>replaces newlines by tabulations, so that the data is now a gigantic lines where tab-separated fields alternate for version numbers and symbols;</p></li>
+<li><p>uses the double quotes around the gene symbol as a landmark to transform the flowing text into a tab-separated file with the two wanted fields;</p></li>
+<li><p>cleans up the remaining double quotes.</p></li>
+</ul>
+<h2 id="speed">Speed</h2>
+<p>Unfortunately, the Haskell version is way too slow to process the full RefSeq data. Here is a comparison using a test file of only 476 Kib.</p>
+<pre><code>$ ghc -O2 refSeqIdSymbol.lhs
+[1 of 1] Compiling Main ( refSeqIdSymbol.lhs, refSeqIdSymbol.o )
+Linking refSeqIdSymbol ...
+chouca⁅~⁆$ time ./refSeqIdSymbol < hopla.gb
+NM_001142483.1 NREP
+NM_001142481.1 NREP
+NM_001142480.1 NREP
+NM_001142477.1 NREP
+NM_001142475.1 NREP
+NM_001142474.1 NREP
+NM_004772.2 NREP
+NM_001142466.1 GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+
+real 0m0.701s
+user 0m0.664s
+sys 0m0.028s</code></pre>
+<pre><code>$ time cat hopla.gb | grep -e VERSION -e gene= | uniq | sed 's/=/ /' | awk '{print $2}' | tr '\n' '\t' | sed -e 's/"\t/\n/g' -e 's/"//g'
+NM_001142483.1 NREP
+NM_001142481.1 NREP
+NM_001142480.1 NREP
+NM_001142477.1 NREP
+NM_001142475.1 NREP
+NM_001142474.1 NREP
+NM_004772.2 NREP
+NM_001142466.1 GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+real 0m0.015s
+user 0m0.004s
+sys 0m0.004s</code></pre>
--- /dev/null
+A RefSeq parser that outputs the gene symbol for each ID
+========================================================
+
+The goal
+--------
+
+[RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) is a database of _ref_erence
+biological (DNA, RNA or protein) _seq_ences made by the National Center for
+Biological Information (NCBI, USA). In a project at work we wanted to list for
+each entry in the RNA database the unique version number of the entry and its
+gene symbol (a short human-readable name).
+
+This looked like an excellent excuse to practice [Haskell](https://www.haskell.org)...
+this is the first program I wrote in that language. I did not yet manage to make
+it run fast enough compared to a minimalistic strategy using shell commands.
+
+Here I describe the format that I want to parse, the Haskell program I wrote,
+and the quick-and-dirty processing with shell commands.
+
+
+The GenBank format
+------------------
+
+Human sequences can be downloaded from NCBI's
+[FTP](ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/) site, in a
+structured format called
+[GenBank](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html). RefSeq
+entries typically start like the following one.
+
+~~~~
+LOCUS NM_131041 1399 bp mRNA linear VRT 28-SEP-2014
+DEFINITION Danio rerio neurogenin 1 (neurog1), mRNA.
+ACCESSION NM_131041
+VERSION NM_131041.1 GI:18859080
+KEYWORDS RefSeq.
+SOURCE Danio rerio (zebrafish)
+ ORGANISM Danio rerio
+ Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
+ Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
+ Cypriniformes; Cyprinidae; Danio.
+~~~~
+
+The record is made of fields, which start with their name in upper case at the
+beginning of a line. Similarly to [e-mail
+headers](https://tools.ietf.org/html/rfc5322#section-2.2.3), a field in the
+GenBank format includes the next line if this line starts with a space. Fields
+can be nested by indentation, like for the ORGANISM field above.
+
+In the VERSION field, the version number is the first element (NM_131041.1 in
+the example above).
+
+The gene symbol is contained in the FEATURE field. In the record NM_131041.1, it
+starts like the following.
+
+~~~~
+FEATURES Location/Qualifiers
+ source 1..1399
+ /organism="Danio rerio"
+ /mol_type="mRNA"
+ /db_xref="taxon:7955"
+ /chromosome="14"
+ /map="14"
+ gene 1..1399
+ /gene="neurog1"
+ /gene_synonym="cb260; chunp6899; neurod3; ngn1; ngr1;
+ zNgn1"
+ /note="neurogenin 1"
+ /db_xref="GeneID:30239"
+ /db_xref="ZFIN:ZDB-GENE-990415-174"
+~~~~
+
+The FEATURES field also contains indented sub-fields, but their name is not
+restricted to upper-case characters. These subfields are structured: their
+value starts with sequence coordinates, followed by a list of keys and values,
+where each pair of keys and values is separated by spaces and a slash '/'.
+
+Lastly, GenBank records are terminated by a line that contains exactly two
+slashes (`//`) and nothing else.
+
+
+The program
+-----------
+
+This whole file is written using [literate
+Haskell](https://www.haskell.org/haskellwiki/Literate_programming). It can
+actually be compiled! In literate Haskell, everything is comment by default
+and the code is prefixed by '> '.
+
+> import Text.Parsec
+> import Text.Parsec.String
+> import Data.List (intercalate)
+> import Data.List.Split (splitOn)
+
+I used the [Parsec](http://hackage.haskell.org/package/parsec) package for
+parsing, as well as two helper functions from packages related to list
+handling.
+
+> main = do
+> r <- getContents
+> let rs = splitOn "//\n" r
+> mapM_ parseGbRecord rs
+
+It does not seem possible to parse a stream of text: the whole parsing must be
+finished before a result is returned, which was not feasable with hundreds of
+megabytes of data. Therefore, in the main loop getting the contents from
+_stdin_, I split the records one after the other, and run the parser on each of
+them.
+
+> parseGbRecord :: String -> IO ()
+> parseGbRecord r = case parse gbRecord "(stdin)" r of
+> Left e -> do putStrLn "Error parsing input:"
+> print e
+> Right r -> putStrLn r
+
+Parsec returns either an error message or the result of the parsing.
+
+TODO: how about printing always ? What is the difference between pring and putStrLn ?
+
+> gbRecord = do
+> fs <- many field
+> return . intercalate "\t" $ filter (/= "") fs
+
+A GenBank record is made of many fields. For each field the parser reports a
+string, so the result will be a list of strings. Some strings will be empty
+and discarded. The resulting list is collapsed in a tab-separated string.
+
+> field = do
+> f <- fieldName
+> many1 space
+> case f of
+> "VERSION" -> getVersionNumber
+> "FEATURES" -> getGeneSymbol
+> _ -> endField >> return ""
+
+A field starts with a field name, which is recorded. It is followed with
+multiple spaces. Since I am only intersted in version and gene symbol, if the
+field name was VERSION or FEATURES, more information is extracted, otherwise
+an empty string is returned.
+
+> fieldName = many1 upper
+> endField = manyTill anyChar (try separator <|> try eof)
+> separator = newline >> notFollowedBy (char ' ')
+
+A field name is upper case. Fields continue with any character until a
+separator or the end of the file is reached. A separator is a newline
+character not followed by a space.
+
+> getVersionNumber = do
+> let versionNumberChar = oneOf $ "NXRM_" ++ ['0'..'9'] ++ "."
+> v <- many versionNumberChar
+> endField
+> return v
+
+A version number is a string containing the letters N, X, R or M, underscores,
+digits and dots. The precise definition is actually stricter: the version
+numbers start by NM_, NR_, XM_ or XR_, but I was worried of the performance hit
+if testing for this precisely.
+
+Once the version number is read, it is recorded, and the rest of the field is
+discarded.
+
+> getGeneSymbol = do
+> let geneSymbolChar = oneOf $ ['A'..'Z'] ++ "orf" ++ "p" ++ "-" ++ ['0'..'9']
+> manyTill anyChar (try (string "/gene=\""))
+> g <- many geneSymbolChar
+> char '"'
+> endField
+> return g
+
+Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or p,
+dashes and digits. It is recorded after finding the string `/gene="`. The
+parser then checks that the closing double quote is present, and discards it
+together with the rest of the field.
+
+
+Quick-and-dirty parsing with Unix tools
+---------------------------------------
+
+With the following command, I could produce a list of version numbers and gene
+symbols in a minute or so. However, what it does is not very obvious, nor
+flexible as it does not follow the format's syntax, and is probably very
+fragile.
+
+~~~~
+zcat human.*.rna.gbff.gz |
+ grep -e VERSION -e gene= |
+ uniq |
+ sed 's/=/ /' |
+ awk '{print $2}' |
+ tr '\n' '\t' |
+ sed -e 's/"\t/\n/g' -e 's/"//g' > human.rna.ID-Symbol.txt
+~~~~
+
+In brief, it:
+
+ - filters lines containing `VERSION` or `gene=` and discards the other ones;
+
+ - removes consecutive duplicates (there are multiple features per locus
+ crosslinking to gene IDs), assuming that only one gene symbol is used per
+ record;
+
+ - replaces `=` with a space so that the relevant information appears to be on
+ the second column if one considers the flowing text to be a space-separated
+ table;
+
+ - keeps only the second column (at that point, the version numbers and gene symbols
+ alternate on successive lines;
+
+ - replaces newlines by tabulations, so that the data is now a gigantic lines
+ where tab-separated fields alternate for version numbers and symbols;
+
+ - uses the double quotes around the gene symbol as a landmark to transform the
+ flowing text into a tab-separated file with the two wanted fields;
+
+ - cleans up the remaining double quotes.
+
+Speed
+-----
+
+Unfortunately, the Haskell version is way too slow to process the full RefSeq
+data. Here is a comparison using a test file of only 476 Kib.
+
+~~~~~
+$ ghc -O2 refSeqIdSymbol.lhs
+[1 of 1] Compiling Main ( refSeqIdSymbol.lhs, refSeqIdSymbol.o )
+Linking refSeqIdSymbol ...
+chouca⁅~⁆$ time ./refSeqIdSymbol < hopla.gb
+NM_001142483.1 NREP
+NM_001142481.1 NREP
+NM_001142480.1 NREP
+NM_001142477.1 NREP
+NM_001142475.1 NREP
+NM_001142474.1 NREP
+NM_004772.2 NREP
+NM_001142466.1 GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+
+real 0m0.701s
+user 0m0.664s
+sys 0m0.028s
+~~~~~
+
+~~~~~
+$ time cat hopla.gb | grep -e VERSION -e gene= | uniq | sed 's/=/ /' | awk '{print $2}' | tr '\n' '\t' | sed -e 's/"\t/\n/g' -e 's/"//g'
+NM_001142483.1 NREP
+NM_001142481.1 NREP
+NM_001142480.1 NREP
+NM_001142477.1 NREP
+NM_001142475.1 NREP
+NM_001142474.1 NREP
+NM_004772.2 NREP
+NM_001142466.1 GPT2
+NM_133443.2 GPT2
+NM_173685.2 NSMCE2
+NM_007058.3 CAPN11
+
+real 0m0.015s
+user 0m0.004s
+sys 0m0.004s
+~~~~~