From MAILER-DAEMON Fri Feb 13 11:21:56 2004 Date: 13 Feb 2004 11:21:56 -0500 From: Mail System Internal Data Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA X-IMAP: 1076689316 0000000000 Status: RO This text is part of the internal format of your mail folder, and is not a real message. It is created automatically by the mail system software. If deleted, important folder data will be lost, and it will be re-created with the data reset to initial values. From skvidal@phy.duke.edu Thu Dec 4 23:22:52 2003 Return-Path: Delivered-To: mstenner@phy.duke.edu Received: from [192.168.0.12] (rdu57-255-028.nc.rr.com [66.57.255.28]) by mail.phy.duke.edu (Postfix) with ESMTP id 559AAA77D2 for ; Thu, 4 Dec 2003 23:22:52 -0500 (EST) Subject: urlgrabber From: seth vidal To: Michael Stenner Content-Type: text/plain Message-Id: <1070597975.9147.12.camel@binkley> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.5 (1.4.5-7) Date: Thu, 04 Dec 2003 23:19:36 -0500 Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-0.5 required=4.0 tests=AWL,SPAM_PHRASE_00_01 version=2.44 X-Spam-Level: Lines: 10 Status: RO X-Status: A X-Keywords: Hey, I hope you don't mind, but I think you didn't when we talked about it before. Ryan Tomayko wanted to know if I wanted some help on things and I told him to go ahead and start implementing your specification for urlgrabber and email code or questions to both of us or the yum list. see you tomorrow. -sv From rtomayko@naeblis.cx Fri Dec 5 23:30:32 2003 Return-Path: Delivered-To: mstenner@phy.duke.edu Received: from ms-smtp-01-eri0.ohiordc.rr.com (ms-smtp-01-smtplb.ohiordc.rr.com [65.24.5.135]) by mail.phy.duke.edu (Postfix) with ESMTP id 765A5A77CB; Fri, 5 Dec 2003 23:30:32 -0500 (EST) Received: from asha.fade.naeblis.cx (dhcp065-024-052-214.columbus.rr.com [65.24.52.214]) by ms-smtp-01-eri0.ohiordc.rr.com (8.12.10/8.12.7) with ESMTP id hB64UV5H021832; Fri, 5 Dec 2003 23:30:31 -0500 (EST) Received: from [192.168.1.101] (daishar.fade.naeblis.cx [192.168.1.101]) by asha.fade.naeblis.cx (8.12.8/8.12.8) with ESMTP id hB64UAHF032465; Fri, 5 Dec 2003 23:30:10 -0500 Subject: Re: urlgrabber From: Ryan Tomayko To: Michael Stenner Cc: Seth Vidal In-Reply-To: <20031205130814.GA435@phy.duke.edu> References: <1070611124.9147.32.camel@binkley> <1070597975.9147.12.camel@binkley> <20031205130814.GA435@phy.duke.edu> Content-Type: text/plain Message-Id: <1070685010.28826.221.camel@daishar.fade.naeblis.cx> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.5 (1.4.5-7) Date: Fri, 05 Dec 2003 23:30:10 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: Symantec AntiVirus Scan Engine X-Spam-Status: No, hits=-1.2 required=4.0 tests=AWL,IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES, SPAM_PHRASE_01_02 version=2.44 X-Spam-Level: Lines: 65 Status: RO X-Status: A X-Keywords: On Fri, 2003-12-05 at 08:08, Michael Stenner wrote: > Mind? On the contrary, I've seen Ryan's code and seen his discourse > on yum-list. I'm quite happy. Good! > urllib2 is designed to be subclassed in order to add > on features. I never really went this route (except in a few very > minor cases), but I think it's the right way to go. I STRONGLY > encourage you (Ryan) to read urllib2.py and understand how it works, > and then implement much of the low-level stuff (like reget) at that > level. Yep. I had a chance to look at this a little bit yesterday and I came to the same conclusion. I also noticed that subclassing urllib2 handlers was already an established pattern in urlgrabber with keep-alive and auth being implemented in this fashion. As a matter of fact, I didn't see much of a choice for certain things like FTP range support. With HTTP, it looks like range support could be as simple as creating a Request object to pass into urlopen and adding a { 'range' : 'bytes 200-500' } to the header hash. But, when you move over to FTP, there's just no way that I could see to get this without adding in a handler. So, I'm rambling, but the point is that your statement is a very good one and I'm in full agreement. > You can keep the interface the same. One of the things I had on a short list was the possibility of moving to the **kwargs interface for the urlXXX methods, as was first put forth by your design document. With features like reget and range support introducing more and more arguments into the mix, everything will start to clutter up. Is this something you would rather I stayed away from given the quotation? Or, were you simply pointing out that--as seems more likely from the next quote--that a majority of functionality currently established is in the right place and shouldn't need to be moved around too horribly if at all? > Most of the stuff that's currently implemented in urlgrabber can stay > at the same level it's at now, but I think some of the reget/byterange > stuff will NEED to be at a lower level. > > The best example is this: lets say you need to do two slightly > different things for ftp: and http (as is common). You have two > choices: (1) they are handles by different handlers in urllib2, so you > can subclass those handlers. Or (2), you can put a block of if's at > the urlgrabber level. The latter is much much uglier and ultimately > doomed. You get the idea. Absolutely. I'm really thinking that one of the first things I should probably look at is creating urllib2 handler subclasses of at least HTTP and FTP just so I have a clear implementation path for the proposed features. i.e. Once we have handlers plugged into urllib2, answers to such questions as "Where do I implement such-and-such mechanism for FTP" are automatic. I'm still trying to get everything straight in my head so I can feel more confident coming to you guys with proposals but expect another message from me this weekend on placing I'm thinking would be good starting points and some high level implementation proposals. Thanks! - Ryan From mstenner@phy.duke.edu Sat Dec 6 08:42:08 2003 Return-Path: Delivered-To: mstenner@phy.duke.edu Received: from elmo.phy.duke.edu (elmo.phy.duke.edu [152.3.182.49]) by mail.phy.duke.edu (Postfix) with ESMTP id 2B9D2A77CB; Sat, 6 Dec 2003 08:42:08 -0500 (EST) Received: by elmo.phy.duke.edu (Postfix, from userid 697) id D044CBB3CC; Sat, 6 Dec 2003 08:42:06 -0500 (EST) Date: Sat, 6 Dec 2003 08:42:05 -0500 From: Michael Stenner To: Ryan Tomayko Cc: Seth Vidal Subject: Re: urlgrabber Message-ID: <20031206134205.GA2992@phy.duke.edu> References: <1070611124.9147.32.camel@binkley> <1070597975.9147.12.camel@binkley> <20031205130814.GA435@phy.duke.edu> <1070685010.28826.221.camel@daishar.fade.naeblis.cx> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1070685010.28826.221.camel@daishar.fade.naeblis.cx> User-Agent: Mutt/1.4.1i X-Spam-Status: No, hits=-2.9 required=4.0 tests=AWL,IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES, SIGNATURE_SHORT_DENSE,SPAM_PHRASE_02_03,USER_AGENT, USER_AGENT_MUTT version=2.44 X-Spam-Level: Lines: 77 Status: RO X-Status: X-Keywords: On Fri, Dec 05, 2003 at 11:30:10PM -0500, Ryan Tomayko wrote: > On Fri, 2003-12-05 at 08:08, Michael Stenner wrote: > > You can keep the interface the same. > > One of the things I had on a short list was the possibility of moving to > the **kwargs interface for the urlXXX methods, as was first put forth by > your design document. With features like reget and range support > introducing more and more arguments into the mix, everything will start > to clutter up. Is this something you would rather I stayed away from > given the quotation? No. I was unclear. I just meant that the function names and arguments can remain backwards compatible. I'm still a big fan of the **kwargs approach. I don't see that as a major change. I've been advocating that people only use keyword args to call the functions, and if they do that, nothing will break. > Or, were you simply pointing out that--as seems > more likely from the next quote--that a majority of functionality > currently established is in the right place and shouldn't need to be > moved around too horribly if at all? Yes, that's precisely what I meant. But don't feel limited by that either. I'd much rather have you DTRT rather than make things crufty in some attempt to please me by changing very little :) There are some other things that I talked about with seth (and some I didn't) that I should mention to you now: 1) toplevel class it might be nice to really have a URLGrabber class whose methods are urlgrab, urlopen, etc. Like urllib2 (and urllib) you could have module-level functions that call an 'internal' instance of this for convencience. The nice thing about this is that you could have a couple of the object laying around if you need different options. Maybe for different threads, etc. 2) mirror objects Create a class called MirrorGroup (or something) that takes a list of urls. This class has methods called urlgrab, etc. When called with a _relative_ path, these pick a base url from the list and try to download. Failover could be implemented at this level -- if one server doesn't work, move on to the next. The specific function for choosing a mirror (for the first try and for the case of failover) could be handled via some internal method. Therefore, different failover policies could simply amount to subclassing and overriding a couple of methods. This ties into the next idea. 3) Global statistics tracking. Urlgrabber, at some fairly low level could keep track of statistics on different servers. My recommendation would be to provide methods like loaddb(filename=...) savedb(filename=...) and possibly an option to disable it altogether. Anyway, the db (and the format could be anything; pickle, shelve, xml, etc) could store info like transfer rate (mean, median, min, max, etc) reliability (what fraction of the time we've tried to use this has something gone wrong), etc. I haven't thought much about the specific stats. If globally available, this info could be used by the mirror objects to make decisions about what mirrors to use. OK, that ought to give you some things to think about :) -Michael -- Michael Stenner Office Phone: 919-660-2513 Duke University, Dept. of Physics mstenner@phy.duke.edu Box 90305, Durham N.C. 27708-0305 From skvidal@phy.duke.edu Sat Dec 6 14:45:34 2003 Return-Path: Delivered-To: mstenner@phy.duke.edu Received: from opus.phy.duke.edu (opus.phy.duke.edu [152.3.182.42]) by mail.phy.duke.edu (Postfix) with ESMTP id E07C2A77FC; Sat, 6 Dec 2003 14:45:34 -0500 (EST) Subject: Re: urlgrabber From: seth vidal To: Michael Stenner Cc: Ryan Tomayko In-Reply-To: <20031206134205.GA2992@phy.duke.edu> References: <1070611124.9147.32.camel@binkley> <1070597975.9147.12.camel@binkley> <20031205130814.GA435@phy.duke.edu> <1070685010.28826.221.camel@daishar.fade.naeblis.cx> <20031206134205.GA2992@phy.duke.edu> Content-Type: text/plain Message-Id: <1070739934.14010.3.camel@opus> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.3 (1.4.3-1.duke.1) Date: 06 Dec 2003 14:45:34 -0500 Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-1.4 required=4.0 tests=AWL,IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES, SPAM_PHRASE_03_05 version=2.44 X-Spam-Level: Lines: 21 Status: RO X-Status: X-Keywords: > 1) toplevel class > > it might be nice to really have a URLGrabber class whose methods > are urlgrab, urlopen, etc. Like urllib2 (and urllib) you could > have module-level functions that call an 'internal' instance of > this for convencience. The nice thing about this is that you could > have a couple of the object laying around if you need different > options. Maybe for different threads, etc. Something related here that I think might be useful. It would be nice if urlgrab could grab to an mkstemp file location and return that location. I can definitely see how this would be useful, though I can also see how it might not belong in urlgrabber either. just something I thought of while I was reading through michael's comments. -sv From rtomayko@naeblis.cx Sun Dec 7 00:38:39 2003 Return-Path: Delivered-To: mstenner@phy.duke.edu Received: from ms-smtp-02-eri0.ohiordc.rr.com (ms-smtp-02-smtplb.ohiordc.rr.com [65.24.5.136]) by mail.phy.duke.edu (Postfix) with ESMTP id 3234AA786F; Sun, 7 Dec 2003 00:38:39 -0500 (EST) Received: from asha.fade.naeblis.cx (dhcp065-024-052-214.columbus.rr.com [65.24.52.214]) by ms-smtp-02-eri0.ohiordc.rr.com (8.12.10/8.12.7) with ESMTP id hB75cbDh013420; Sun, 7 Dec 2003 00:38:38 -0500 (EST) Received: from [192.168.1.101] (daishar.fade.naeblis.cx [192.168.1.101]) by asha.fade.naeblis.cx (8.12.8/8.12.8) with ESMTP id hB75caHF026462; Sun, 7 Dec 2003 00:38:37 -0500 Subject: Re: urlgrabber From: Ryan Tomayko To: Michael Stenner Cc: Seth Vidal In-Reply-To: <20031206134205.GA2992@phy.duke.edu> References: <1070611124.9147.32.camel@binkley> <1070597975.9147.12.camel@binkley> <20031205130814.GA435@phy.duke.edu> <1070685010.28826.221.camel@daishar.fade.naeblis.cx> <20031206134205.GA2992@phy.duke.edu> Content-Type: text/plain Message-Id: <1070775516.28826.870.camel@daishar.fade.naeblis.cx> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.5 (1.4.5-7) Date: Sun, 07 Dec 2003 00:38:36 -0500 Content-Transfer-Encoding: 7bit X-Virus-Scanned: Symantec AntiVirus Scan Engine X-Spam-Status: No, hits=-0.9 required=4.0 tests=AWL,IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES, SPAM_PHRASE_03_05 version=2.44 X-Spam-Level: Lines: 73 Status: RO X-Status: A X-Keywords: Great stuff! On Sat, 2003-12-06 at 08:42, Michael Stenner wrote: > 1) toplevel class > > it might be nice to really have a URLGrabber class whose methods > are urlgrab, urlopen, etc. Like urllib2 (and urllib) you could > have module-level functions that call an 'internal' instance of > this for convencience. The nice thing about this is that you could > have a couple of the object laying around if you need different > options. Maybe for different threads, etc. Funny you mentioned it. I had already toyed a bit with moving some functionality into a class. Specifically, the _do_open and _do_grab and some other things. It seems to be working out very nicely so far. I'll be sending over some of the progress at some point this weekend once I have things cleaned up a bit. > 2) mirror objects > 3) Global statistics tracking. > OK, that ought to give you some things to think about :) It sure does! I'll be keeping 2 and 3 in mind as I move along. My nowish (as in this weekend) list looks like this.. I won't go into depth on these because they've been hashed out pretty thoroughly in the past: 1. **kwargs 2. URLGrabber class 3. Bring retry functionality into normal urlXXX methods. 4. range support (this will force me to look into the urllib2 handler subclassing a bit more and build up some framework there) I've made good progress on 1-3 but need to do some serious testing and cleanup. I was hoping to look into 4 tonight and have something for you guys to look at tomorrow night. I start losing a sense of priority after those first four but here's a list of things I plan on looking at a bit later (ignore order). 5. reget Not much to say here. This should be straightforward once range is available. 6. MirrorGroup 7. Add keepalive=[1|0] to kwargs This is something I thought might be a little nicer way of handling turning keepalive on/off as requested by Seth. I'm thinking that the URLGrabber class can handle turning keepalive on/off (based on a kwarg) on a per object instance. Not sure how easy/possible this will be but I wanted to throw it out there. 8. Package I don't know if this is still something you wanted to pursue but given that you already have distutils worked in nicely this should be as easy as adding a line to setup.py. I think the big decision to make is how much you want to disrupt the interface since putting this in a package would require a bit more work for yum and anything else currently using grabber to migrate to the new design. 9. thread-safety I've been keeping my eyes open for anything that might cause trouble if multiple threads were to use urlgrabber simultaneously. I thought the auth_handler stuff might cause problems but that looks okay. At some point I will resort to experiment to flush out what I'm not seeing. 10. Site Statistics Please rearrange/add/remove items from this list as you see fit. - Ryan From mstenner@phy.duke.edu Sun Dec 7 08:58:22 2003 Return-Path: Delivered-To: mstenner@phy.duke.edu Received: from elmo.phy.duke.edu (elmo.phy.duke.edu [152.3.182.49]) by mail.phy.duke.edu (Postfix) with ESMTP id CB50AA7819; Sun, 7 Dec 2003 08:58:22 -0500 (EST) Received: by elmo.phy.duke.edu (Postfix, from userid 697) id 31BD8BB3CC; Sun, 7 Dec 2003 08:58:21 -0500 (EST) Date: Sun, 7 Dec 2003 08:58:21 -0500 From: Michael Stenner To: Ryan Tomayko Cc: Seth Vidal Subject: Re: urlgrabber Message-ID: <20031207135821.GA15743@phy.duke.edu> References: <1070611124.9147.32.camel@binkley> <1070597975.9147.12.camel@binkley> <20031205130814.GA435@phy.duke.edu> <1070685010.28826.221.camel@daishar.fade.naeblis.cx> <20031206134205.GA2992@phy.duke.edu> <1070775516.28826.870.camel@daishar.fade.naeblis.cx> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1070775516.28826.870.camel@daishar.fade.naeblis.cx> User-Agent: Mutt/1.4.1i X-Spam-Status: No, hits=-2.7 required=4.0 tests=AWL,IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES, SIGNATURE_SHORT_DENSE,SPAM_PHRASE_03_05,USER_AGENT, USER_AGENT_MUTT version=2.44 X-Spam-Level: Lines: 78 Status: RO X-Status: X-Keywords: On Sun, Dec 07, 2003 at 12:38:36AM -0500, Ryan Tomayko wrote: > On Sat, 2003-12-06 at 08:42, Michael Stenner wrote: > > 1) toplevel class > > > > it might be nice to really have a URLGrabber class whose methods > > are urlgrab, urlopen, etc. Like urllib2 (and urllib) you could > > have module-level functions that call an 'internal' instance of > > this for convencience. The nice thing about this is that you could > > have a couple of the object laying around if you need different > > options. Maybe for different threads, etc. > > Funny you mentioned it. I had already toyed a bit with moving some > functionality into a class. Specifically, the _do_open and _do_grab and > some other things. It seems to be working out very nicely so far. I'll > be sending over some of the progress at some point this weekend once I > have things cleaned up a bit. One note that's probably pretty obvious. If you do make a URLGrabber class, then each instance of that should have its own instance of the urllib2 thingy. I don't like cluttering up the urllib2 space is urlgrabber currently does (by pushing things into the "global" urllib2 thingy). > 7. Add keepalive=[1|0] to kwargs > This is something I thought might be a little nicer way of handling > turning keepalive on/off as requested by Seth. I'm thinking that the > URLGrabber class can handle turning keepalive on/off (based on a kwarg) > on a per object instance. Not sure how easy/possible this will be but I > wanted to throw it out there. In fact, that behavior would be nice for lots of things. I currently have a way to set the throttle, bandwidth, progress_meter, etc. both globally and locally. I think they're analagous. It would be nice to be able to set them at the instance level, or on a file-by-file basis. > 8. Package > I don't know if this is still something you wanted to pursue but given > that you already have distutils worked in nicely this should be as easy > as adding a line to setup.py. I think the big decision to make is how > much you want to disrupt the interface since putting this in a package > would require a bit more work for yum and anything else currently using > grabber to migrate to the new design. Actually, the idea of splitting it out of yum was really pushed hardest by seth and icon. I still think it's a good idea. I think the best way to do it is probably to make it a package something like: URLGrabber/grabber.py URLGrabber/keepalive.py URLGrabber/progress.py I don't have a huge stake in the details of that, though. They'll all be compat breaks, but you should still be able to simply do: # from urlgrabber import urlgrab from URLGrabber.grabber import urlgrab or # import urlgrabber import URLGrabber.grabber as urlgrabber The latter if you're really lazy. > 9. thread-safety > I've been keeping my eyes open for anything that might cause trouble if > multiple threads were to use urlgrabber simultaneously. I thought the > auth_handler stuff might cause problems but that looks okay. At some > point I will resort to experiment to flush out what I'm not seeing. I predict that keepalive will be your biggest headache there :) -Michael -- Michael Stenner Office Phone: 919-660-2513 Duke University, Dept. of Physics mstenner@phy.duke.edu Box 90305, Durham N.C. 27708-0305