From: admin Date: Sun, 21 Dec 2014 04:48:40 +0000 (+0000) Subject: updated PO files X-Git-Url: https://source.charles.plessy.org/?a=commitdiff_plain;h=a1ed565c0b6edb96ba914b3a81e54cca11fc5de2;p=source.git updated PO files --- diff --git a/Haskell/refSeqIdSymbol.en.po b/Haskell/refSeqIdSymbol.en.po new file mode 100644 index 00000000..28ffb7a5 --- /dev/null +++ b/Haskell/refSeqIdSymbol.en.po @@ -0,0 +1,469 @@ +# SOME DESCRIPTIVE TITLE +# Copyright (C) YEAR Free Software Foundation, Inc. +# This file is distributed under the same license as the PACKAGE package. +# FIRST AUTHOR , YEAR. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: PACKAGE VERSION\n" +"POT-Creation-Date: 2014-12-21 04:48+0000\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"Language: \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=UTF-8\n" +"Content-Transfer-Encoding: 8bit\n" + +#. type: Content of:

+msgid "A RefSeq parser that outputs the gene symbol for each ID" +msgstr "" + +#. type: Content of:

+msgid "The goal" +msgstr "" + +#. type: Content of:

+msgid "" +"RefSeq is a database of " +"reference biological (DNA, RNA or protein) seqences made " +"by the National Center for Biological Information (NCBI, USA). In a project " +"at work we wanted to list for each entry in the RNA database the unique " +"version number of the entry and its gene symbol (a short human-readable " +"name)." +msgstr "" + +#. type: Content of:

+msgid "" +"This looked like an excellent excuse to practice Haskell... this is the first program I " +"wrote in that language. I did not yet manage to make it run fast enough " +"compared to a minimalistic strategy using shell commands." +msgstr "" + +#. type: Content of:

+msgid "" +"Here I describe the format that I want to parse, the Haskell program I " +"wrote, and the quick-and-dirty processing with shell commands." +msgstr "" + +#. type: Content of:

+msgid "The GenBank format" +msgstr "" + +#. type: Content of:

+msgid "" +"Human sequences can be downloaded from NCBI's FTP " +"site, in a structured format called GenBank. " +"RefSeq entries typically start like the following one." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"LOCUS       NM_131041               1399 bp    mRNA    linear   VRT "
+"28-SEP-2014\n"
+"DEFINITION  Danio rerio neurogenin 1 (neurog1), mRNA.\n"
+"ACCESSION   NM_131041\n"
+"VERSION     NM_131041.1  GI:18859080\n"
+"KEYWORDS    RefSeq.\n"
+"SOURCE      Danio rerio (zebrafish)\n"
+"  ORGANISM  Danio rerio\n"
+"            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; "
+"Euteleostomi;\n"
+"            Actinopterygii; Neopterygii; Teleostei; Ostariophysi;\n"
+"            Cypriniformes; Cyprinidae; Danio."
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"The record is made of fields, which start with their name in upper case at " +"the beginning of a line. Similarly to e-mail " +"headers, a field in the GenBank format includes the next line if this " +"line starts with a space. Fields can be nested by indentation, like for the " +"ORGANISM field above." +msgstr "" + +#. type: Content of:

+msgid "" +"In the VERSION field, the version number is the first element (NM_131041.1 " +"in the example above)." +msgstr "" + +#. type: Content of:

+msgid "" +"The gene symbol is contained in the FEATURE field. In the record " +"NM_131041.1, it starts like the following." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"FEATURES             Location/Qualifiers\n"
+"     source          1..1399\n"
+"                     /organism="Danio rerio"\n"
+"                     /mol_type="mRNA"\n"
+"                     /db_xref="taxon:7955"\n"
+"                     /chromosome="14"\n"
+"                     /map="14"\n"
+"     gene            1..1399\n"
+"                     /gene="neurog1"\n"
+"                     /gene_synonym="cb260; chunp6899; neurod3; ngn1; "
+"ngr1;\n"
+"                     zNgn1"\n"
+"                     /note="neurogenin 1"\n"
+"                     /db_xref="GeneID:30239"\n"
+"                     /db_xref="ZFIN:ZDB-GENE-990415-174""
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"The FEATURES field also contains indented sub-fields, but their name is not " +"restricted to upper-case characters. These subfields are structured: their " +"value starts with sequence coordinates, followed by a list of keys and " +"values, where each pair of keys and values is separated by spaces and a " +"slash '/'." +msgstr "" + +#. type: Content of:

+msgid "" +"Lastly, GenBank records are terminated by a line that contains exactly two " +"slashes (//) and nothing else." +msgstr "" + +#. type: Content of:

+msgid "The program" +msgstr "" + +#. type: Content of:

+msgid "" +"This whole file is written using literate " +"Haskell. It can actually be compiled! In literate Haskell, everything is " +"comment by default and the code is prefixed by '> '." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"import Text.Parsec\n"
+"import Text.Parsec.String\n"
+"import Data.List "
+"(intercalate)\n"
+"import Data.List.Split "
+"(splitOn)"
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"I used the Parsec " +"package for parsing, as well as two helper functions from packages related " +"to list handling." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"main = do\n"
+"  r <- getContents\n"
+"  let rs = splitOn "//\\n" r\n"
+"  mapM_ parseGbRecord rs"
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"It does not seem possible to parse a stream of text: the whole parsing must " +"be finished before a result is returned, which was not feasable with " +"hundreds of megabytes of data. Therefore, in the main loop getting the " +"contents from stdin, I split the records one after the other, and " +"run the parser on each of them." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"parseGbRecord "
+":: String -> "
+"IO ()\n"
+"parseGbRecord r = case "
+"parse gbRecord "(stdin)" r of\n"
+"            Left e  -> do putStrLn "Error parsing input:"\n"
+"                          print e\n"
+"            Right r ->  putStrLn r"
+msgstr ""
+
+#. type: Content of: 

+msgid "Parsec returns either an error message or the result of the parsing." +msgstr "" + +#. type: Content of:

+msgid "" +"TODO: how about printing always ? What is the difference between pring and " +"putStrLn ?" +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"gbRecord = "
+"do\n"
+"  fs <- many field\n"
+"  return . intercalate "\\t" $ filter "
+"(/= "") "
+"fs"
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"A GenBank record is made of many fields. For each field the parser reports a " +"string, so the result will be a list of strings. Some strings will be empty " +"and discarded. The resulting list is collapsed in a tab-separated string." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"field = do\n"
+"  f <- fieldName\n"
+"  many1 space\n"
+"  case f of\n"
+"    "VERSION"  -> getVersionNumber\n"
+"    "FEATURES" -> getGeneSymbol\n"
+"    _          -> endField >> return """
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"A field starts with a field name, which is recorded. It is followed with " +"multiple spaces. Since I am only intersted in version and gene symbol, if " +"the field name was VERSION or FEATURES, more information is extracted, " +"otherwise an empty string is returned." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"fieldName = "
+"many1 upper\n"
+"endField  = manyTill anyChar (try separator <|> try eof)\n"
+"separator = newline >> notFollowedBy (char ' "
+"')"
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"A field name is upper case. Fields continue with any character until a " +"separator or the end of the file is reached. A separator is a newline " +"character not followed by a space." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"getVersionNumber = do\n"
+"  let versionNumberChar = oneOf $ "NXRM_" ++ ['0'..'9'] ++ "."\n"
+"  v <- many versionNumberChar\n"
+"  endField\n"
+"  return v"
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"A version number is a string containing the letters N, X, R or M, " +"underscores, digits and dots. The precise definition is actually stricter: " +"the version numbers start by NM_, NR_, XM_ or XR_, but I was worried of the " +"performance hit if testing for this precisely." +msgstr "" + +#. type: Content of:

+msgid "" +"Once the version number is read, it is recorded, and the rest of the field " +"is discarded." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"getGeneSymbol = "
+"do\n"
+"  let geneSymbolChar = "
+"oneOf $ ['A'..'Z'] ++ "orf" ++ "p" ++ "-" ++ ['0'..'9']\n"
+"  manyTill anyChar (try (string "/gene=\\""))\n"
+"  g <- many geneSymbolChar\n"
+"  char '"'\n"
+"  endField\n"
+"  return g"
+msgstr ""
+
+#. type: Content of: 

+msgid "" +"Similarly, a gene symbol is made of uppercase letters, lower-case o, r, f or " +"p, dashes and digits. It is recorded after finding the string " +"/gene=". The parser then checks that the closing double " +"quote is present, and discards it together with the rest of the field." +msgstr "" + +#. type: Content of:

+msgid "Quick-and-dirty parsing with Unix tools" +msgstr "" + +#. type: Content of:

+msgid "" +"With the following command, I could produce a list of version numbers and " +"gene symbols in a minute or so. However, what it does is not very obvious, " +"nor flexible as it does not follow the format's syntax, and is probably very " +"fragile." +msgstr "" + +#. type: Content of:

+#, no-wrap
+msgid ""
+"zcat human.*.rna.gbff.gz |\n"
+"  grep -e VERSION -e gene= |\n"
+"  uniq |\n"
+"  sed 's/=/ /' |\n"
+"  awk '{print $2}' |\n"
+"  tr '\\n' '\\t' |\n"
+"  sed -e 's/"\\t/\\n/g' -e 's/"//g' > "
+"human.rna.ID-Symbol.txt"
+msgstr ""
+
+#. type: Content of: 

+msgid "In brief, it:" +msgstr "" + +#. type: Content of:

  • +msgid "" +"filters lines containing VERSION or gene= and " +"discards the other ones;" +msgstr "" + +#. type: Content of:

    • +msgid "" +"removes consecutive duplicates (there are multiple features per locus " +"crosslinking to gene IDs), assuming that only one gene symbol is used per " +"record;" +msgstr "" + +#. type: Content of:

      • +msgid "" +"replaces = with a space so that the relevant information " +"appears to be on the second column if one considers the flowing text to be a " +"space-separated table;" +msgstr "" + +#. type: Content of:

        • +msgid "" +"keeps only the second column (at that point, the version numbers and gene " +"symbols alternate on successive lines;" +msgstr "" + +#. type: Content of:

          • +msgid "" +"replaces newlines by tabulations, so that the data is now a gigantic lines " +"where tab-separated fields alternate for version numbers and symbols;" +msgstr "" + +#. type: Content of:

            • +msgid "" +"uses the double quotes around the gene symbol as a landmark to transform the " +"flowing text into a tab-separated file with the two wanted fields;" +msgstr "" + +#. type: Content of:

              • +msgid "cleans up the remaining double quotes." +msgstr "" + +#. type: Content of:

                +msgid "Speed" +msgstr "" + +#. type: Content of:

                +msgid "" +"Unfortunately, the Haskell version is way too slow to process the full " +"RefSeq data. Here is a comparison using a test file of only 476 Kib." +msgstr "" + +#. type: Content of:

                +#, no-wrap
                +msgid ""
                +"$ ghc -O2 refSeqIdSymbol.lhs\n"
                +"[1 of 1] Compiling Main             ( refSeqIdSymbol.lhs, refSeqIdSymbol.o "
                +")\n"
                +"Linking refSeqIdSymbol ...\n"
                +"chouca⁅~⁆$ time ./refSeqIdSymbol < hopla.gb \n"
                +"NM_001142483.1  NREP\n"
                +"NM_001142481.1  NREP\n"
                +"NM_001142480.1  NREP\n"
                +"NM_001142477.1  NREP\n"
                +"NM_001142475.1  NREP\n"
                +"NM_001142474.1  NREP\n"
                +"NM_004772.2 NREP\n"
                +"NM_001142466.1  GPT2\n"
                +"NM_133443.2 GPT2\n"
                +"NM_173685.2 NSMCE2\n"
                +"NM_007058.3 CAPN11\n"
                +"\n"
                +"\n"
                +"real    0m0.701s\n"
                +"user    0m0.664s\n"
                +"sys 0m0.028s"
                +msgstr ""
                +
                +#. type: Content of: 
                +#, no-wrap
                +msgid ""
                +"$ time cat hopla.gb | grep -e VERSION -e gene= |   uniq |   sed "
                +"'s/=/ /' |   awk '{print $2}' |   tr '\\n' "
                +"'\\t' |   sed -e 's/"\\t/\\n/g' -e "
                +"'s/"//g'\n"
                +"NM_001142483.1  NREP\n"
                +"NM_001142481.1  NREP\n"
                +"NM_001142480.1  NREP\n"
                +"NM_001142477.1  NREP\n"
                +"NM_001142475.1  NREP\n"
                +"NM_001142474.1  NREP\n"
                +"NM_004772.2 NREP\n"
                +"NM_001142466.1  GPT2\n"
                +"NM_133443.2 GPT2\n"
                +"NM_173685.2 NSMCE2\n"
                +"NM_007058.3 CAPN11\n"
                +"\n"
                +"real    0m0.015s\n"
                +"user    0m0.004s\n"
                +"sys 0m0.004s"
                +msgstr ""
                diff --git a/kunpuu/sources.en.po b/kunpuu/sources.en.po
                new file mode 100644
                index 00000000..a62ac37e
                --- /dev/null
                +++ b/kunpuu/sources.en.po
                @@ -0,0 +1,23 @@
                +# SOME DESCRIPTIVE TITLE
                +# Copyright (C) YEAR Free Software Foundation, Inc.
                +# This file is distributed under the same license as the PACKAGE package.
                +# FIRST AUTHOR , YEAR.
                +#
                +#, fuzzy
                +msgid ""
                +msgstr ""
                +"Project-Id-Version: PACKAGE VERSION\n"
                +"POT-Creation-Date: 2014-12-21 04:48+0000\n"
                +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
                +"Last-Translator: FULL NAME \n"
                +"Language-Team: LANGUAGE \n"
                +"Language: \n"
                +"MIME-Version: 1.0\n"
                +"Content-Type: text/plain; charset=UTF-8\n"
                +"Content-Transfer-Encoding: 8bit\n"
                +
                +#. type: Content of: outside any tag (error?)
                +msgid ""
                +"http://photo.charles.plessy.org/kunpuu"
                +msgstr ""