While hacking on the search stuff, I found that MetaKit sometimes calls
abort() if you give it weird enough values. It looks like if you try to
pass a unicode object into view.append() or view.find(), it crashes. It
also dies if you try to give it an xmlrpclib.DateTime object, taking the
whole server process down with it.
Not sure if this is just on FreeBSD, but it looks like we'll have to put
more input validation into the XML-RPC calls...
Perhaps this is why pycs.net periodically just _stops_ without any warning.
Unicode stuff coming into weblogUpdates.ping or something ... hmm.
Cheers,
Phil :-)
I just pushed a whole bunch of stuff into CVS. Some tidying / refactoring
(pycs.py and xmlStorageSystem.py are now a little shorter), some minor
changes (comments.py returns more helpful errors, referrers.py has a
different message) and one biggie: search.
I never managed to get enough time alone to finish the ht://Dig or Lucene
integration that I was working on, but I just hacked up a Python equivalent
to what I was doing with Lucene:
- added a new xmlStorageSystem.mirrorPosts(email,password,posts) method that
takes a list of posts (structs: {postid,date,url,guid,title,text}) and dumps
them in a new database table. It deletes them (searches by postid) if they
already exist.
- added modules/system/search.py that searches through this table.
Examples:
http://www.pycs.net/system/search.py?u=0000049&q=georg+%2Bxml-rpc
http://www.pycs.net/system/search.py?u=0000049&q=georg+-bauer
http://www.pycs.net/system/search.py?u=0000049&q=phil+-pearson
I've added support for this in to bzero ... it makes structs that look like
this:
{postid: '200310151',
'date': xmlrpclib.Date(...),
'url': 'http://www.pycs.net/archive/2003/10/15/#200310151',
'guid': 'http://www.pycs.net/archive/2003/10/15/#200310151',
'title': 'post about something',
'text': 'foo bar baz'}
It converts all incoming text into UTF-8 for storage in the database, and
displays everything as UTF-8.
TODO:
- make it respect access rules
- check that it works for non-english
- audit xml-rpc methods to make sure nothing else tries to put weird objects
in the database
------
Here are the full details of the CVS commit:
- modules/system/comments.py: added some helpful links when someone requests
comments.py?u=foo without 'p=...'
- pycs_settings.py: changed code to use standard python style; added new
mirrored_posts database table
- xmlStorageSystem.py: added xmlStorageSystem.mirrorPosts() call to allow
users to push posts into the search mirror (not really an index...)
- modules/system/rankings.py: fixed display bug - now everything appears at
the top of the screen and is spaced horizontally properly
- modules/system/referrers.py: fixed spelling
- modules/system/search.py: moved old search.py to htdig.py and made
search.py do a simple linear search through the mirrored_posts table. NB:
it doesn't respect access rules yet.
- modules/system/users.py: added indication as to whether the user is using
the search mirror (and if so, how many posts are in there)
- www/pycs.css: added some css rules for the search results
Cheers,
Phil :-)
Hi,
On Tue, 14 Oct 2003 01:45:47 +1300 you wrote:
> Looks all right. I've simplified the <pre> stuff a bit, got rid of the
> global var, and added in my test cases. Attached - what do you think of
> this one?
Looks better. Thank you.
> BTW I stopped writing Python code with spaces in strange places ("foo(
> bar )") ages ago and now the PyCS code looks really weird to me. Would it
> bother anyone here to just use the normal Python way ("foo()") instead for
> new modules?
Normal Python way looks good to me, too. I agree with your proposal.
Hi!
> it
> bother anyone here to just use the normal Python way ("foo()") instead
> for
> new modules?
No problem. I feel weird when looking into PyCS source, too, from time
to time. And I did write some of that code ;-)
Is there a Python source formatter out there? So we could change sources
to a canonical format?
bye, Georg
> Hi, I wrote strip-o-gram version of html_cleaner.py. How about this?
Looks all right. I've simplified the <pre> stuff a bit, got rid of the
global var, and added in my test cases. Attached - what do you think of
this one?
BTW I stopped writing Python code with spaces in strange places ("foo(
bar )") ages ago and now the PyCS code looks really weird to me. Would it
bother anyone here to just use the normal Python way ("foo()") instead for
new modules?
Cheers,
Phil
#!/usr/bin/python
import re
from stripogram import HTML2SafeHTML
class PyCSSafeHTML(HTML2SafeHTML):
def __init__(self, valid_tags):
HTML2SafeHTML.__init__(self, valid_tags)
self.in_a =3D 0
def start_tag(self, tag):
if tag =3D=3D 'a':
self.in_a =3D 1
=09
def end_tag(self, tag):
if tag =3D=3D 'a':
self.in_a =3D 0
self.result =3D "%s</%s>" % (self.result, tag)
def handle_data(self, data):
if data:
if not self.in_a:
data =3D url2link(data)
self.result +=3D data
def url2link(text):
return re.sub(r'(http://[^\r\n \"\<]+)',
r'<a href=3D"\1" target=3D"_blank">\1</a>',
text,
)
def html2safehtml(s, valid_tags):
parser =3D PyCSSafeHTML(valid_tags)
parser.feed(s)
parser.close()
parser.cleanup()
return parser.result
def cleanHtml( text ):
# strip out anything dodgy
text =3D html2safehtml(text, valid_tags=3D('a', 'b', 'i', 's', 'tt', =
'pre'))
# now run through and convert line feeds to <br />, but not inside =
<pre> blocks
new_text =3D []
add =3D new_text.append
=09
pre_count =3D 0
for line in text.split('\n'):
pre_count +=3D line.count("<pre>") - line.count("</pre>")
add(line)
if not pre_count:
add("<br />")
add("\n")
return "".join(new_text)
if __name__ =3D=3D '__main__':
text =3D [
# general: make sure <b> and <a> tags aren't mangled
"""I'm writing my <b>Radio Klogging Kit for Managers</b> as an <a =
href=3D"http://radio.weblogs.com/0100827/instantOutliner/klogging.opml">O=
PML file</a> with <a =
href=3D"http://www.cadenhead.org/servlet/ViewInstantOutline?opmlFile=3Dht=
tp://radio.weblogs.com/0100827/instantOutliner/klogging.opml">a link on =
my site using your servlet</a>. I have a pointer to the opml in my =
Instant Outline. Does the polling of my i/o cascade to xref'd outlines? =
""",
# check urls turn into links
"""It looks like someone's subscribed to the rendered form of your =
outline. People should be subscribing to the raw OPML version - =
http://rcs.myelin.cjb.net/users/0000001/instantOutliner/rogersCadenhead.o=
pml - but actually they're subscribing to the one that calls your =
servlet.=20
Your outline is currently the most popular file on this server, because =
you plus one or two others are downloading it every 10-60 seconds. I =
can't imagine the hammering radio.weblogs.com must be getting from all =
the I/O polling, but it must be pretty shocking.""",
# make sure <br/> isn't inserted inside a <pre>
"""here's some text, with some code inside it.
<pre>import os
for foo in os.listdir('/'):
print foo</pre>
now, there shouldn't have been any <br />s inserted in there =
...""",
"""Script should be removed: <script>foo bar</script>""",
"""Entities & stuff should stay: <b> shouldn't make the text =
bold!""",
"""unclosed tags should be <i>closed""",
"""unopened tags </i> should be ignored""",
]
for post in text:
print "PARSING: ------------------------------------------------"
print post
print "---> ------------------------------------------------"
print cleanHtml( post )
Hi, I wrote strip-o-gram version of html_cleaner.py. How about this?