Register | Sign In


Understanding through Discussion


EvC Forum active members: 65 (9162 total)
3 online now:
Newest Member: popoi
Post Volume: Total: 915,817 Year: 3,074/9,624 Month: 919/1,588 Week: 102/223 Day: 13/17 Hour: 0/0


Thread  Details

Email This Thread
Newer Topic | Older Topic
  
Author Topic:   DNA similarity between Chimpanzee and Human 70%
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 1 of 32 (719960)
02-18-2014 4:12 PM


There is an article in Answers Research Journal with title: Comprehensive Analysis of Chimpanzee and Human Chromosomes Reveals Average DNA Similarity of 70%
What do you think about this article. Wouldn't be greate to reproduce the data from publicaly available resources using BLASTN algoritm? I am very interested in reproducing the results. If we achieve the same data we can use the same methodology for two species inside one baramin. I think it would be very interesting.
All resources (blastn program, DNA sequences) are available free.

Replies to this message:
 Message 2 by AdminNosy, posted 02-18-2014 4:33 PM Telesto has replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 3 of 32 (719962)
02-19-2014 6:56 AM
Reply to: Message 2 by AdminNosy
02-18-2014 4:33 PM


Re: More Please
There are several papers about DNA comparison of Human and Chimpanzee by comparing nucleotide bases:
Chimpanzee DNA Sequences Queried Against Human Genome | Answers Research Journal
Comprehensive Analysis of Chimp & Human Chromosomes | Answers Research Journal
All of them indicate large differences 70%-89% between Human and Chimpanzee, which is in contrast to generally accepted difference between 94%-98%. This is presented as proof that Humans and Chimpanzees are not closly related as it was presented for decades.
I was very interested in these papers. I am software developer so I have close to "playing" with algorithms. And because all resources are free on the Internet I was wondering if I could reproduce results.
I think these numbers are taken by different method than previous high similarity results. So I think other reference data is needed. But I have difficulties in obtaining results presented in papers above so I was wondering if there is anybody, who can help.
I have several goals:
1) Opinions: I would like to know what other think about this research and used methodology.
2) Verification: I would like to verify data from papers to be sure my method is exactly the same as it was used in papers.
3) Test: Because I am suspicious (my preliminary results are far from numbers preseted in paper), I would like to do some method verification test: e.g. compare Human-Human DNA.
4) Further research: By the same method compare DNA between species inside one baramin (e.g. mouse and rat).
I think sharing these information, advices, tips, hints in this disscussion would be benefical. We can find some interesting results together.
What do you think?

This message is a reply to:
 Message 2 by AdminNosy, posted 02-18-2014 4:33 PM AdminNosy has replied

Replies to this message:
 Message 4 by AdminNosy, posted 02-19-2014 11:50 AM Telesto has not replied
 Message 9 by Dr Adequate, posted 02-19-2014 1:55 PM Telesto has replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 13 of 32 (720007)
02-19-2014 4:03 PM
Reply to: Message 8 by Taq
02-19-2014 1:26 PM


Re: One word: Gaps
Hi Taq,
Gapped alignment
Species A: TATA-AGCGTAGGCAAT
Species B: CATAGAGCGTAGGCAAT
With this alignment, there is a one base indel and one substitution mutation at the very beginning for an overall identity of 15/17, or 88%. Now for the ungapped alignment.
Species A: TATAAGCGTAGGCAAT
Species B: CATAGAGCGTAGGCAAT
Thanks for replay. I am not sure that the blastn algorithm compute the sequence as you described. I think the algorithm rather split the sequence to two parts and compare it separately. No doubt that both cases are not much usefull.
I tried to use same arguments as they did in their research. I changed only word_size to 50. 11 was to small and it took a lot of time and memory. I tried compare whole chromosomes Chimpanzee Y and Human Y (they separated chimp chromosome to 100-450 base long slices - I have no idea why). There should be only 43% similarity. However after few minutes I got results completly confusing.
First of all. There is no ONE result number. The algorithm compared about 650000 sequences with about 400 million bases in summary (Human Y is 60 million bases long, Chimp Y is about 20 million bases long). So compared sequences overlaped many times. I also got mismatch bases. Each sequence had (in my case) percentage identity, sequence length and number of mismatch bases. For example:
97.3% 4552 105
The shortes sequence was 50 bases long (according to attribute word_size). Lowest indentity percentage was 69% and highest 100%.
I really don't know how to get ONE number representing overall similarity from this set of data. The research doesn't say anything about calculation of overall similarity from all of the data.
I made only logical move. I summarized all number of bases in all sequences (about 400 million) and compared it to sum of all mismatch bases. I got with this method for chromosome Y 93% similarity.
I didn't try this with gaps (indels) - I uses parameter -ungapped as they did. I think the number would be similar anyway. More interestingly I compared with this method Human Y chromosome and Human Y chromosome (yes - exactly the same chromosomes) and overall similarity was about 97%!! Not real 100%. So... It looks like this method is completly useless.
However, I didn't get 43% similarity. May be this is caused by comparing whole chromosome and not only slices 100-450 long. Anyway I tried to compare Chimp slices 100 bases long and overall similarity was around 80%-90%. But I didn't try it for whole chromosome - only a few slices.
Please do you have any idea how they got the overall similarity by blastn algorith. Do you know this programe??

This message is a reply to:
 Message 8 by Taq, posted 02-19-2014 1:26 PM Taq has replied

Replies to this message:
 Message 16 by RAZD, posted 02-19-2014 4:35 PM Telesto has replied
 Message 22 by Taq, posted 02-19-2014 6:28 PM Telesto has replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 15 of 32 (720011)
02-19-2014 4:27 PM
Reply to: Message 9 by Dr Adequate
02-19-2014 1:55 PM


Re: More Please
Hi Dr Adequate,
Psst ... you mean similarities, not differences.
Oh... sure. Thanks
As Taq points out, the creationists are ignoring the possibility of indels, so they get a different and less meaningful figure.
In theory - yes. However, I think that blastn program does something else. I does not compare base after base. It search referenced sequence and is trying to find some similarity everywhere. Algorith compares overlapping sequences many times and is trying to find best mach. From longest sequences to shortes. The only difference between gapped and ungapped sequences would be in total length of longest sequences (that is my guest - I didn't try it).
Anyway I tried compared ungapped sequences and the result (useless in my opinion) is still high above 43%. So where in the hell they got this number?
An excellent idea.
After few hours playing with blastn program I have better one. Compare two identical chromosomes - expected result is 100%. But first of all I need to get 43% similarity for chromosome Y. When I verify the methodology I can go on with other jobs.
From the data I can find, humans and chimps should be further apart than the two sequenced species of macaques, which belong to the same genus; maybe a little further apart than a domestic cat and a tiger; and closer than a rat and a mouse. If the latter is the case, the creationists would be hoist on their own petard.
Yes... rat and mouse are more different than human and chimp. And for both the genom is already sequenced and available for free.
The paper speaks approvingly of this guy. The web page makes his perl scripts freely available, so it should be easy enough to re-use his techniques on other genomes.
I am not sure if this algorithm was used in the particular research. But I will look at this script and try to reproduce those numbers.

This message is a reply to:
 Message 9 by Dr Adequate, posted 02-19-2014 1:55 PM Dr Adequate has replied

Replies to this message:
 Message 17 by Taq, posted 02-19-2014 4:52 PM Telesto has replied
 Message 19 by Dr Adequate, posted 02-19-2014 5:20 PM Telesto has replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 20 of 32 (720025)
02-19-2014 6:21 PM
Reply to: Message 17 by Taq
02-19-2014 4:52 PM


Re: More Please
First of all... I think we don't understand each other. Probably it is caused by my english - as you realized I am not native speaker.
That isn't true. Go back to my post in message #8. Using those sequences, the best hit for the gapped alignment would be 88% similarity. The best hit for the ungapped alignment would be 29%.
I completly understand. But I think that this is not the case for blastn algorithm used in the research. I made similar experiment. I created two identical strings 50 bases long. Then I deleted one base in second one on 25 position. So that the second string has only 49 bases and is shifted with one base.
I understand what you have told me about overall differences. But lets try to use blastn with parameter -ungapped and -word_size 11. The results are below (numbers: percent identical, sequence length, mismatch bases):
1) For identical strings - 1 hit
100.00 50 0
2) Second string shorten in the middle - 2 hits
100.00 25 0
100.00 24 0
3) One base changed in the middle - 1 hit
98.00 50 1
These are results from blastn. What now? What is it saying?
No, it isn't. Different chromsomes have diverged at different rates. There is no expectation that the similarities will be the same for a comparison of any two chromosomes.
I was talking about exactly the same chromosomes (e.g. Human Y vs. Human Y).
The Y chromosome has 50 million bases, or just 1.6% of the total genome. You do know this, right?
Sure I know. I chose this chromosome because of its length and because in the research there was smallest similarity. I know this has a little impact for whole genome. But they used in the paper also chromosome Y separately and their result was 43%. I tried to get this number also.
More importantly, chimp and gorilla are more different than chimp and human. Chimp and orangutan are more different than chimp and human. No species is closer to chimps than humans.
I meant the difference between rat vs. mouse is larger than between chimp vs. human.

This message is a reply to:
 Message 17 by Taq, posted 02-19-2014 4:52 PM Taq has not replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 21 of 32 (720026)
02-19-2014 6:25 PM
Reply to: Message 19 by Dr Adequate
02-19-2014 5:20 PM


Re: More Please
It wasn't, but they cite it approvingly and the code is there for you to use.
You'r right. They didn't use it. However I tried to use these scripts and it seems it calculate something (I hate perl )
I tried to use it on some reference sequences but I failed. I am not sure what values I should set. Perl is quite difficult to read for me

This message is a reply to:
 Message 19 by Dr Adequate, posted 02-19-2014 5:20 PM Dr Adequate has not replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 23 of 32 (720029)
02-19-2014 6:32 PM
Reply to: Message 16 by RAZD
02-19-2014 4:35 PM


Re: One word: Gaps
Hi RAZD,
So they take one as the baseline and then compare the second one starting with matching both at one end and then shifting the second one along the first one base at a time, recording the degree of matching for each step.
Yes I understand. Do you think that it is possible to get overall genetical similarity with such method (gapped or ungapped)? I think the blast algorithm is not created for this purpose. Anyway I would like to get the numbers from the research (even if they are wrong).
It is bad that I don't know what to do with all the numbers I got. What is the algorithm to get one number that represent overall similarity. I always got thousands of numbers. How they got 43% from these numbers? I have no idea...

This message is a reply to:
 Message 16 by RAZD, posted 02-19-2014 4:35 PM RAZD has seen this message but not replied

Replies to this message:
 Message 24 by Taq, posted 02-19-2014 7:01 PM Telesto has not replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


(1)
Message 28 of 32 (720097)
02-20-2014 10:46 AM
Reply to: Message 22 by Taq
02-19-2014 6:28 PM


Re: One word: Gaps
Hi Taq,
It does. If you leave out gaps you will have a much lower score than if gaps are included.
Well I made simple application that works as follows:
1) Referenced (subject) chromosome is Human chromosome
2) It takes 500 subsequences from Chimp chromosome each 300 bases long (as it was in your quoted comment).
3) Blastn uses these attributes (as it was used in the creationist research paper): -word_size 11 -evalue 10 -num_alignments 1 -dust no -soft_masking false
4) Parameter "-ungapped" is optional and I made two experiments with this parameter and without it.
5) And the calculation. I have no idea what should I calculate. But I made a few calculations:
a) First of all I check if the Chimp subsequence matched. I am not sure what can be considered as MATCH. I guess match means the whole subsequence was found. In this case match is 300 bases long (or longer if I use gaps). If matched subsequence was shorter I counted is as "not match". For example:
Best sequence is 298 bases long with 5 mismatch. - NOT match
Best sequence is 300 bases long with 2 mismatch - MATCH
In the end I calculated the percentage of matched sequences according to the above logic. I think this number has nothing to do with the whole genome comparison. It just says how many 300 (or more) bases long similar subsequences of Chimp chromosome was found in Human chromosome.
b) Then I was trying to calculate some relevant similarity percentage. First number was taken only from matched subsequences. Subsequences shorter than 300 bases were completly ignored. From these numbers I take the best match. Longest sequence with the lowest number of mismatch. Example:
300 - 5
298 - 1
300 - 2
the winner is 300 - 2
I summarized all these bases and compared them with summarized mismatch.
This is I think not much useful. It ignores shorter sequences that were found. For example if in the result file is the best match 289 - 2, it is ignored.
c) Next number took into account also shorter sequences, but the rest of bases were added. The missing were counted as mismatch. For example:
Best match from result file 289 - 2 was recalculated to 300 - 13
Not sure if this is right...
d) Next number was taken from number as they were in result file. Example:
best match from result file 289 - 2 was not changed. In the end it was compared with exactly the same number of bases and mismatch. No changes...
e) The last number was calculated also from all steps in experiment - matched (300 or longer) and not matched (shorter) sequences. However if the sequence was marked as not matched (shorter) the number was calculated as completly wrong. Example:
best match 289 - 3 was marked as not aligned and calculated in sum as 300 - 300 (300 bases long with 300 mismatch = 0% similarity)
And here are results for chromosome Y:
1) Ungapped!
a) Matched vs. all: 234/500 => 46,80%
b) Only matched similarity: 97.8%
c) All results, calculated as full 300 bases long: 81.43%
d) All results as they were, no recalculation: 96.03%
e) All results, with 100% penalty: 45.77%
2) Gapped
a) Matched vs. all: 359/500 => 71.80%
b) Only matched similarity: 97.12%
c) All results, calculated as full 300 bases long: 91.14%
d) All results as they were, no recalculation: 95.76%
e) All results, with 100% penalty: 69.81%
So... What is right what is wrong. The only think I can see is the number 45.77% similarity that is very close to 43% reported in research paper. Of course this number is nonsense - but that is another story
I hope you understand to my "methodology". Or is there better approach?
As you can see gapped was better but not much. I think the most representative number is d) Calculated as it was with no recalculation and no penalty. But with ungapped parameter the results were better 96.03% than with gaps 95.76%. But both very close.
I would like to do the same experiment for chromosome 1. But it will take much more time as it is 250 MB large (in contrast to 60 MB of human chromosome Y).

This message is a reply to:
 Message 22 by Taq, posted 02-19-2014 6:28 PM Taq has replied

Replies to this message:
 Message 29 by Taq, posted 02-20-2014 11:04 AM Telesto has replied

  
Telesto
Junior Member (Idle past 3636 days)
Posts: 10
From: Zlín
Joined: 02-03-2014


Message 30 of 32 (720111)
02-20-2014 11:20 AM
Reply to: Message 29 by Taq
02-20-2014 11:04 AM


Re: One word: Gaps
Not much? You went from 47% to 72% for matches. I would call that a pretty massive jump, especially given that Tomkins is comparing a 70% match to 95% similarity.
Well for matching yes. But I hoped for allmost 100% according to sfs resluts. But I know that human Y chromosome is most diverse. So I will wait for results of other chromosomes.
But I am really curios about these numbers. Do you really think that Tomkins compare number of matches with similarity? Unbelivable... I hoped not, but from my preliminary results it really looks like he did it.
I will try to contact sfs

This message is a reply to:
 Message 29 by Taq, posted 02-20-2014 11:04 AM Taq has replied

Replies to this message:
 Message 32 by Taq, posted 02-20-2014 11:31 AM Telesto has not replied

  
Newer Topic | Older Topic
Jump to:


Copyright 2001-2023 by EvC Forum, All Rights Reserved

™ Version 4.2
Innovative software from Qwixotic © 2024