+++ /dev/null
-# SOME DESCRIPTIVE TITLE
-# Copyright (C) YEAR Free Software Foundation, Inc.
-# This file is distributed under the same license as the PACKAGE package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: PACKAGE VERSION\n"
-"POT-Creation-Date: 2014-12-21 04:48+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: LANGUAGE <LL@li.org>\n"
-"Language: \n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=UTF-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-
-#. type: Content of: <h1>
-msgid "A RefSeq parser that outputs the gene symbol for each ID"
-msgstr ""
-
-#. type: Content of: <h2>
-msgid "The goal"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"<a href=\"https://www.ncbi.nlm.nih.gov/refseq/\">RefSeq</a> is a database of "
-"<em>ref</em>erence biological (DNA, RNA or protein) <em>seq</em>ences made "
-"by the National Center for Biological Information (NCBI, USA). In a project "
-"at work we wanted to list for each entry in the RNA database the unique "
-"version number of the entry and its gene symbol (a short human-readable "
-"name)."
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"This looked like an excellent excuse to practice <a "
-"href=\"https://www.haskell.org\">Haskell</a>... this is the first program I "
-"wrote in that language. I did not yet manage to make it run fast enough "
-"compared to a minimalistic strategy using shell commands."
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"Here I describe the format that I want to parse, the Haskell program I "
-"wrote, and the quick-and-dirty processing with shell commands."
-msgstr ""
-
-#. type: Content of: <h2>
-msgid "The GenBank format"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"Human sequences can be downloaded from NCBI's <a "
-"href=\"ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/\">FTP</a> "
-"site, in a structured format called <a "
-"href=\"https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html\">GenBank</a>. "
-"RefSeq entries typically start like the following one."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code>LOCUS NM_131041 1399 bp mRNA linear VRT "
-"28-SEP-2014\n"
-"DEFINITION Danio rerio neurogenin 1 (neurog1), mRNA.\n"
-"ACCESSION NM_131041\n"
-"VERSION NM_131041.1 GI:18859080\n"
-"KEYWORDS RefSeq.\n"
-"SOURCE Danio rerio (zebrafish)\n"
-" ORGANISM Danio rerio\n"
-" Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; "
-"Euteleostomi;\n"
-" Actinopterygii; Neopterygii; Teleostei; Ostariophysi;\n"
-" Cypriniformes; Cyprinidae; Danio.</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"The record is made of fields, which start with their name in upper case at "
-"the beginning of a line. Similarly to <a "
-"href=\"https://tools.ietf.org/html/rfc5322#section-2.2.3\">e-mail "
-"headers</a>, a field in the GenBank format includes the next line if this "
-"line starts with a space. Fields can be nested by indentation, like for the "
-"ORGANISM field above."
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"In the VERSION field, the version number is the first element (NM_131041.1 "
-"in the example above)."
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"The gene symbol is contained in the FEATURE field. In the record "
-"NM_131041.1, it starts like the following."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code>FEATURES Location/Qualifiers\n"
-" source 1..1399\n"
-" /organism="Danio rerio"\n"
-" /mol_type="mRNA"\n"
-" /db_xref="taxon:7955"\n"
-" /chromosome="14"\n"
-" /map="14"\n"
-" gene 1..1399\n"
-" /gene="neurog1"\n"
-" /gene_synonym="cb260; chunp6899; neurod3; ngn1; "
-"ngr1;\n"
-" zNgn1"\n"
-" /note="neurogenin 1"\n"
-" /db_xref="GeneID:30239"\n"
-" /db_xref="ZFIN:ZDB-GENE-990415-174"</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"The FEATURES field also contains indented sub-fields, but their name is not "
-"restricted to upper-case characters. These subfields are structured: their "
-"value starts with sequence coordinates, followed by a list of keys and "
-"values, where each pair of keys and values is separated by spaces and a "
-"slash '/'."
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"Lastly, GenBank records are terminated by a line that contains exactly two "
-"slashes (<code>//</code>) and nothing else."
-msgstr ""
-
-#. type: Content of: <h2>
-msgid "The program"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"This whole file is written using <a "
-"href=\"https://www.haskell.org/haskellwiki/Literate_programming\">literate "
-"Haskell</a>. It can actually be compiled! In literate Haskell, everything is "
-"comment by default and the code is prefixed by '> '."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\"><span class=\"kw\">import </span><span "
-"class=\"dt\">Text.Parsec</span>\n"
-"<span class=\"kw\">import </span><span "
-"class=\"dt\">Text.Parsec.String</span>\n"
-"<span class=\"kw\">import </span><span class=\"dt\">Data.List</span> "
-"(intercalate)\n"
-"<span class=\"kw\">import </span><span class=\"dt\">Data.List.Split</span> "
-"(splitOn)</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"I used the <a href=\"http://hackage.haskell.org/package/parsec\">Parsec</a> "
-"package for parsing, as well as two helper functions from packages related "
-"to list handling."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\">main <span class=\"fu\">=</span> <span "
-"class=\"kw\">do</span>\n"
-" r <span class=\"ot\"><-</span> getContents\n"
-" <span class=\"kw\">let</span> rs <span class=\"fu\">=</span> splitOn <span "
-"class=\"st\">"//\\n"</span> r\n"
-" mapM_ parseGbRecord rs</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"It does not seem possible to parse a stream of text: the whole parsing must "
-"be finished before a result is returned, which was not feasable with "
-"hundreds of megabytes of data. Therefore, in the main loop getting the "
-"contents from <em>stdin</em>, I split the records one after the other, and "
-"run the parser on each of them."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\"><span class=\"ot\">parseGbRecord "
-"::</span> <span class=\"dt\">String</span> <span class=\"ot\">-></span> "
-"<span class=\"dt\">IO</span> ()\n"
-"parseGbRecord r <span class=\"fu\">=</span> <span class=\"kw\">case</span> "
-"parse gbRecord <span class=\"st\">"(stdin)"</span> r <span "
-"class=\"kw\">of</span>\n"
-" <span class=\"dt\">Left</span> e <span "
-"class=\"ot\">-></span> <span class=\"kw\">do</span> putStrLn <span "
-"class=\"st\">"Error parsing input:"</span>\n"
-" print e\n"
-" <span class=\"dt\">Right</span> r <span "
-"class=\"ot\">-></span> putStrLn r</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid "Parsec returns either an error message or the result of the parsing."
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"TODO: how about printing always ? What is the difference between pring and "
-"putStrLn ?"
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\">gbRecord <span class=\"fu\">=</span> "
-"<span class=\"kw\">do</span>\n"
-" fs <span class=\"ot\"><-</span> many field\n"
-" return <span class=\"fu\">.</span> intercalate <span "
-"class=\"st\">"\\t"</span> <span class=\"fu\">$</span> filter "
-"(<span class=\"fu\">/=</span> <span class=\"st\">""</span>) "
-"fs</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"A GenBank record is made of many fields. For each field the parser reports a "
-"string, so the result will be a list of strings. Some strings will be empty "
-"and discarded. The resulting list is collapsed in a tab-separated string."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\">field <span class=\"fu\">=</span> <span "
-"class=\"kw\">do</span>\n"
-" f <span class=\"ot\"><-</span> fieldName\n"
-" many1 space\n"
-" <span class=\"kw\">case</span> f <span class=\"kw\">of</span>\n"
-" <span class=\"st\">"VERSION"</span> <span "
-"class=\"ot\">-></span> getVersionNumber\n"
-" <span class=\"st\">"FEATURES"</span> <span "
-"class=\"ot\">-></span> getGeneSymbol\n"
-" _ <span class=\"ot\">-></span> endField <span "
-"class=\"fu\">>></span> return <span "
-"class=\"st\">""</span></code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"A field starts with a field name, which is recorded. It is followed with "
-"multiple spaces. Since I am only intersted in version and gene symbol, if "
-"the field name was VERSION or FEATURES, more information is extracted, "
-"otherwise an empty string is returned."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\">fieldName <span class=\"fu\">=</span> "
-"many1 upper\n"
-"endField <span class=\"fu\">=</span> manyTill anyChar (try separator <span "
-"class=\"fu\"><|></span> try eof)\n"
-"separator <span class=\"fu\">=</span> newline <span "
-"class=\"fu\">>></span> notFollowedBy (char <span class=\"ch\">' "
-"'</span>)</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"A field name is upper case. Fields continue with any character until a "
-"separator or the end of the file is reached. A separator is a newline "
-"character not followed by a space."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\">getVersionNumber <span "
-"class=\"fu\">=</span> <span class=\"kw\">do</span>\n"
-" <span class=\"kw\">let</span> versionNumberChar <span "
-"class=\"fu\">=</span> oneOf <span class=\"fu\">$</span> <span "
-"class=\"st\">"NXRM_"</span> <span class=\"fu\">++</span> [<span "
-"class=\"ch\">'0'</span><span class=\"fu\">..</span><span "
-"class=\"ch\">'9'</span>] <span class=\"fu\">++</span> <span "
-"class=\"st\">"."</span>\n"
-" v <span class=\"ot\"><-</span> many versionNumberChar\n"
-" endField\n"
-" return v</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"A version number is a string containing the letters N, X, R or M, "
-"underscores, digits and dots. The precise definition is actually stricter: "
-"the version numbers start by NM_, NR_, XM_ or XR_, but I was worried of the "
-"performance hit if testing for this precisely."
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"Once the version number is read, it is recorded, and the rest of the field "
-"is discarded."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code class=\"sourceCode haskell\">getGeneSymbol <span class=\"fu\">=</span> "
-"<span class=\"kw\">do</span>\n"
-" <span class=\"kw\">let</span> geneSymbolChar <span class=\"fu\">=</span> "
-"oneOf <span class=\"fu\">$</span> [<span "
-"class=\"ch\">'A'</span><span class=\"fu\">..</span><span "
-"class=\"ch\">'Z'</span>] <span class=\"fu\">++</span> <span "
-"class=\"st\">"orf"</span> <span class=\"fu\">++</span> <span "
-"class=\"st\">"p"</span> <span class=\"fu\">++</span> <span "
-"class=\"st\">"-"</span> <span class=\"fu\">++</span> [<span "
-"class=\"ch\">'0'</span><span class=\"fu\">..</span><span "
-"class=\"ch\">'9'</span>]\n"
-" manyTill anyChar (try (string <span "
-"class=\"st\">"/gene=\\""</span>))\n"
-" g <span class=\"ot\"><-</span> many geneSymbolChar\n"
-" char <span class=\"ch\">'"'</span>\n"
-" endField\n"
-" return g</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or "
-"p, dashes and digits. It is recorded after finding the string "
-"<code>/gene="</code>. The parser then checks that the closing double "
-"quote is present, and discards it together with the rest of the field."
-msgstr ""
-
-#. type: Content of: <h2>
-msgid "Quick-and-dirty parsing with Unix tools"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"With the following command, I could produce a list of version numbers and "
-"gene symbols in a minute or so. However, what it does is not very obvious, "
-"nor flexible as it does not follow the format's syntax, and is probably very "
-"fragile."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code>zcat human.*.rna.gbff.gz |\n"
-" grep -e VERSION -e gene= |\n"
-" uniq |\n"
-" sed 's/=/ /' |\n"
-" awk '{print $2}' |\n"
-" tr '\\n' '\\t' |\n"
-" sed -e 's/"\\t/\\n/g' -e 's/"//g' > "
-"human.rna.ID-Symbol.txt</code>"
-msgstr ""
-
-#. type: Content of: <p>
-msgid "In brief, it:"
-msgstr ""
-
-#. type: Content of: <ul><li><p>
-msgid ""
-"filters lines containing <code>VERSION</code> or <code>gene=</code> and "
-"discards the other ones;"
-msgstr ""
-
-#. type: Content of: <ul><li><p>
-msgid ""
-"removes consecutive duplicates (there are multiple features per locus "
-"crosslinking to gene IDs), assuming that only one gene symbol is used per "
-"record;"
-msgstr ""
-
-#. type: Content of: <ul><li><p>
-msgid ""
-"replaces <code>=</code> with a space so that the relevant information "
-"appears to be on the second column if one considers the flowing text to be a "
-"space-separated table;"
-msgstr ""
-
-#. type: Content of: <ul><li><p>
-msgid ""
-"keeps only the second column (at that point, the version numbers and gene "
-"symbols alternate on successive lines;"
-msgstr ""
-
-#. type: Content of: <ul><li><p>
-msgid ""
-"replaces newlines by tabulations, so that the data is now a gigantic lines "
-"where tab-separated fields alternate for version numbers and symbols;"
-msgstr ""
-
-#. type: Content of: <ul><li><p>
-msgid ""
-"uses the double quotes around the gene symbol as a landmark to transform the "
-"flowing text into a tab-separated file with the two wanted fields;"
-msgstr ""
-
-#. type: Content of: <ul><li><p>
-msgid "cleans up the remaining double quotes."
-msgstr ""
-
-#. type: Content of: <h2>
-msgid "Speed"
-msgstr ""
-
-#. type: Content of: <p>
-msgid ""
-"Unfortunately, the Haskell version is way too slow to process the full "
-"RefSeq data. Here is a comparison using a test file of only 476 Kib."
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code>$ ghc -O2 refSeqIdSymbol.lhs\n"
-"[1 of 1] Compiling Main ( refSeqIdSymbol.lhs, refSeqIdSymbol.o "
-")\n"
-"Linking refSeqIdSymbol ...\n"
-"chouca⁅~⁆$ time ./refSeqIdSymbol < hopla.gb \n"
-"NM_001142483.1 NREP\n"
-"NM_001142481.1 NREP\n"
-"NM_001142480.1 NREP\n"
-"NM_001142477.1 NREP\n"
-"NM_001142475.1 NREP\n"
-"NM_001142474.1 NREP\n"
-"NM_004772.2 NREP\n"
-"NM_001142466.1 GPT2\n"
-"NM_133443.2 GPT2\n"
-"NM_173685.2 NSMCE2\n"
-"NM_007058.3 CAPN11\n"
-"\n"
-"\n"
-"real 0m0.701s\n"
-"user 0m0.664s\n"
-"sys 0m0.028s</code>"
-msgstr ""
-
-#. type: Content of: <pre>
-#, no-wrap
-msgid ""
-"<code>$ time cat hopla.gb | grep -e VERSION -e gene= | uniq | sed "
-"'s/=/ /' | awk '{print $2}' | tr '\\n' "
-"'\\t' | sed -e 's/"\\t/\\n/g' -e "
-"'s/"//g'\n"
-"NM_001142483.1 NREP\n"
-"NM_001142481.1 NREP\n"
-"NM_001142480.1 NREP\n"
-"NM_001142477.1 NREP\n"
-"NM_001142475.1 NREP\n"
-"NM_001142474.1 NREP\n"
-"NM_004772.2 NREP\n"
-"NM_001142466.1 GPT2\n"
-"NM_133443.2 GPT2\n"
-"NM_173685.2 NSMCE2\n"
-"NM_007058.3 CAPN11\n"
-"\n"
-"real 0m0.015s\n"
-"user 0m0.004s\n"
-"sys 0m0.004s</code>"
-msgstr ""