pycs-devel archive weblog

2003-10-14

Phillip Pearson: [PyCS-devel] possible security issue

While hacking on the search stuff, I found that MetaKit sometimes calls

abort() if you give it weird enough values.  It looks like if you try to

pass a unicode object into view.append() or view.find(), it crashes.  It

also dies if you try to give it an xmlrpclib.DateTime object, taking the

whole server process down with it.



Not sure if this is just on FreeBSD, but it looks like we'll have to put

more input validation into the XML-RPC calls...



Perhaps this is why pycs.net periodically just _stops_ without any warning.

Unicode stuff coming into weblogUpdates.ping or something ... hmm.



Cheers,

Phil :-)

Phillip Pearson: [PyCS-devel] big cvs commit

I just pushed a whole bunch of stuff into CVS.  Some tidying / refactoring

(pycs.py and xmlStorageSystem.py are now a little shorter), some minor

changes (comments.py returns more helpful errors, referrers.py has a

different message) and one biggie: search.



I never managed to get enough time alone to finish the ht://Dig or Lucene

integration that I was working on, but I just hacked up a Python equivalent

to what I was doing with Lucene:



- added a new xmlStorageSystem.mirrorPosts(email,password,posts) method that

takes a list of posts (structs: {postid,date,url,guid,title,text}) and dumps

them in a new database table.  It deletes them (searches by postid) if they

already exist.



- added modules/system/search.py that searches through this table.



Examples:



    http://www.pycs.net/system/search.py?u=0000049&q=georg+%2Bxml-rpc



    http://www.pycs.net/system/search.py?u=0000049&q=georg+-bauer



    http://www.pycs.net/system/search.py?u=0000049&q=phil+-pearson



I've added support for this in to bzero ... it makes structs that look like

this:

    {postid: '200310151',

    'date': xmlrpclib.Date(...),

    'url': 'http://www.pycs.net/archive/2003/10/15/#200310151',

    'guid': 'http://www.pycs.net/archive/2003/10/15/#200310151',

    'title': 'post about something',

    'text': 'foo bar baz'}



It converts all incoming text into UTF-8 for storage in the database, and

displays everything as UTF-8.



TODO:



- make it respect access rules

- check that it works for non-english

- audit xml-rpc methods to make sure nothing else tries to put weird objects

in the database



------



Here are the full details of the CVS commit:



- modules/system/comments.py: added some helpful links when someone requests

comments.py?u=foo without 'p=...'



- pycs_settings.py: changed code to use standard python style; added new

mirrored_posts database table



- xmlStorageSystem.py: added xmlStorageSystem.mirrorPosts() call to allow

users to push posts into the search mirror (not really an index...)



- modules/system/rankings.py: fixed display bug - now everything appears at

the top of the screen and is spaced horizontally properly



- modules/system/referrers.py: fixed spelling



- modules/system/search.py: moved old search.py to htdig.py and made

search.py do a simple linear search through the mirrored_posts table.  NB:

it doesn't respect access rules yet.



- modules/system/users.py: added indication as to whether the user is using

the search mirror (and if so, how many posts are in there)



- www/pycs.css: added some css rules for the search results



Cheers,

Phil :-)

Yasushi Iwata: Re: [PyCS-devel] html_cleaner.py is not safe

Hi,



On Tue, 14 Oct 2003 01:45:47 +1300 you wrote:



> Looks all right.  I've simplified the <pre> stuff a bit, got rid of the

> global var, and added in my test cases.  Attached - what do you think of

> this one?



Looks better. Thank you.



> BTW I stopped writing Python code with spaces in strange places ("foo(

> bar )") ages ago and now the PyCS code looks really weird to me.  Would it

> bother anyone here to just use the normal Python way ("foo()") instead for

> new modules?



Normal Python way looks good to me, too. I agree with your proposal.

Georg Bauer: Re: [PyCS-devel] html_cleaner.py is not safe

Hi!



> it

> bother anyone here to just use the normal Python way ("foo()") instead

> for

> new modules?



No problem. I feel weird when looking into PyCS source, too, from time

to time. And I did write some of that code ;-)



Is there a Python source formatter out there? So we could change sources

to a canonical format?



bye, Georg

> Hi, I wrote strip-o-gram version of html_cleaner.py. How about this?



Looks all right.  I've simplified the <pre> stuff a bit, got rid of the

global var, and added in my test cases.  Attached - what do you think of

this one?



BTW I stopped writing Python code with spaces in strange places ("foo(

bar )") ages ago and now the PyCS code looks really weird to me.  Would it

bother anyone here to just use the normal Python way ("foo()") instead for

new modules?



Cheers,

Phil

#!/usr/bin/python



import re

from stripogram import HTML2SafeHTML



class PyCSSafeHTML(HTML2SafeHTML):



	def __init__(self, valid_tags):

		HTML2SafeHTML.__init__(self, valid_tags)

		self.in_a =3D 0



	def start_tag(self, tag):

		if tag =3D=3D 'a':

			self.in_a =3D 1

	=09

	def end_tag(self, tag):

		if tag =3D=3D 'a':

			self.in_a =3D 0

		self.result =3D "%s</%s>" % (self.result, tag)



	def handle_data(self, data):

		if data:

			if not self.in_a:

				data =3D url2link(data)

			self.result +=3D data



def url2link(text):

	return re.sub(r'(http://[^\r\n \"\<]+)',

				  r'<a href=3D"\1" target=3D"_blank">\1</a>',

				  text,

				  )





def html2safehtml(s, valid_tags):

	parser =3D PyCSSafeHTML(valid_tags)

	parser.feed(s)

	parser.close()

	parser.cleanup()

	return parser.result



def cleanHtml( text ):

	# strip out anything dodgy

	text =3D html2safehtml(text, valid_tags=3D('a', 'b', 'i', 's', 'tt', =

'pre'))



	# now run through and convert line feeds to <br />, but not inside =

<pre> blocks

	new_text =3D []

	add =3D new_text.append

=09

	pre_count =3D 0

	for line in text.split('\n'):

		pre_count +=3D line.count("<pre>") - line.count("</pre>")

		add(line)

		if not pre_count:

			add("<br />")

		add("\n")



	return "".join(new_text)



if __name__ =3D=3D '__main__':

	text =3D [

		# general: make sure <b> and <a> tags aren't mangled

		"""I'm writing my <b>Radio Klogging Kit for Managers</b> as an <a =

href=3D"http://radio.weblogs.com/0100827/instantOutliner/klogging.opml">O=

PML file</a> with <a =

href=3D"http://www.cadenhead.org/servlet/ViewInstantOutline?opmlFile=3Dht=

tp://radio.weblogs.com/0100827/instantOutliner/klogging.opml">a link on =

my site using your servlet</a>. I have a pointer to the opml in my =

Instant Outline. Does the polling of my i/o cascade to xref'd outlines? =

""",

		# check urls turn into links

		"""It looks like someone's subscribed to the rendered form of your =

outline. People should be subscribing to the raw OPML version - =

http://rcs.myelin.cjb.net/users/0000001/instantOutliner/rogersCadenhead.o=

pml - but actually they're subscribing to the one that calls your =

servlet.=20



Your outline is currently the most popular file on this server, because =

you plus one or two others are downloading it every 10-60 seconds. I =

can't imagine the hammering radio.weblogs.com must be getting from all =

the I/O polling, but it must be pretty shocking.""",

		# make sure <br/> isn't inserted inside a <pre>

		"""here's some text, with some code inside it.



<pre>import os

for foo in os.listdir('/'):

	print foo</pre>



now, there shouldn't have been any &lt;br /&gt;s inserted in there =

...""",

		"""Script should be removed: <script>foo bar</script>""",

		"""Entities &amp; stuff should stay: &lt;b&gt; shouldn't make the text =

bold!""",

		"""unclosed tags should be <i>closed""",

		"""unopened tags </i> should be ignored""",

	]

	for post in text:



		print "PARSING: ------------------------------------------------"

		print post

		print "--->     ------------------------------------------------"

		print cleanHtml( post )

Yasushi Iwata: Re: [PyCS-devel] html_cleaner.py is not safe

Hi, I wrote strip-o-gram version of html_cleaner.py. How about this?