I filed a bug with Apple a few weeks ago about Safari dropping # anchors on redirects. It is problem ID #7209106. It doesn't seem like you can link to their bugs, so I will describe it here: When visiting a URL which causes a 301 redirect, the anchor at the end of the URL is discarded. Steps to Reproduce: Visit: http://wikipedia.org/wiki/Safari_(web_browser)#Safari_4 Expected Results: The redirect to en.wikipedia.org should result in Safari visiting: http://en.wikipedia.org/wiki/Safari_(web_browser)#Safari_4 Actual Results: Safari discards the anchor, and the resulting page is: http://en.wikipedia.org/wiki/Safari_(web_browser) Regression: I did not try this with Safari 3 or earlier. But Chrome and Firefox both preserve the anchor. Notes: Here is the transaction with the server that is the redirect: GET /wiki/Safari_(web_browser) HTTP/1.1 Host: wikipedia.org User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9 Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us Accept-Encoding: gzip, deflate Connection: keep-alive HTTP/1.0 301 Moved Permanently Date: Wed, 09 Sep 2009 01:12:23 GMT Server: Apache Location: http://en.wikipedia.org/wiki/Safari_(web_browser) Content-Length: 257 Content-Type: text/html; charset=iso-8859-1 Connection: keep-alive <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>301 Moved Permanently</title> </head><body> <h1>Moved Permanently</h1> <p>The document has moved <a href="http://en.wikipedia.org/wiki/Safari_(web_browser)">here</a>.</p> </body></html>

That is, if you don't realize what happened.

I haven't found a reference as to why this was done, but after my upgrade to iPhoto 8.0.3, my rsync backup strategy decided to re-backup almost every picture I have in my library. That is until I did some poking around and found this in my iPhoto Library directory:

drwx---rwx  29 jared  staff        986 Jan 11 11:25 Data.noindex
lrw-rw-rw-   1 jared  staff         14 Jun  7 17:16 Data -> ./Data.noindex

Of course Jun 7 at 17:16 is about exactly when I upgraded to 8.0.3. Anyhow, this simple command from the appropriate point on my backup drive saved me a very large rsync:

$ mv Data Data.noindex

If you're using TimeMachine this isn't possible or desireable. I've seen forum posts on the net that imply that 8.0.3 will cause TimeMachine to re-backup everything, but it seems to me it should be smart enough to realize that those files just changed their path; the inode number probably didn't change.

The other day my iPhone ran into a catastrophe: it wouldn't charge, it wouldn't connect via USB, and the battery was dying. It's a first generation iPhone, and it looked like that was the end of my old iPhone.

I followed Basic iPhone troubleshooing, read various threads about similar problems, and tried a few of my own tricks:

  • Tried different cable
  • Tried different plug
  • Tried different outlet
  • Verified that those all worked with our other iPhone
  • Blew gunk out of the iPhone connector area (there was actually a lot in there)
  • Rebooted iPhone several times
  • Tried to clean the tiny iPhone connector with rubbing alcohol

None of that worked. So I went to the Apple Store, and the Genius did the same things, and determined that it must be the battery. Actually, it was not the battery--the connection sound never happened, and the iPhone never recognized a connection. But the Apple Store genius was doing us a favor, because if the battery dies, we can buy an exact replacement (a first generation iPhone) for $85 dollars.

And by this time the battery in my iPhone was completely dead: I had failed to email my pictures off before the battery died. Another lesson learned...

I went home to consider my options, and try some other last resort actions, such as contact cleaner or disassembling the iPhone myself (or going to We Fix Macs). But I finally had an idea:

What if I tried to use a FireWire cable to charge?

It works!

I have an old JBL On Stage iPod speaker/dock that uses the FireWire pins instead of the USB pins to provide power, and that worked. It still doesn't make the "beep beep" sound when you connect the iPhone (and neither does the perfectly working iPhone). That must be reserved for a USB connection. But the iPhone immediately recognizes that it's getting power, and it charges perfectly.

Note that this trick won't work with a 3G iPhone or a second generation iPod Touch: the FireWire connection inside those have been removed.

Also, I haven't yet tried to sync with a FireWire cable. I believe it just might work, but I still have to find a FireWire cable (not the speaker dock) and try.

Today I created an Open Source project called TestCpp. It's a very simple JUnit-like C++ unit testing framework, and I'll be adding more to it in the near future.

I really wrote it because I'm working on another open source project, and I wanted to write some unit tests in C++; I quickly got frustrated with everything I had to do to get it going. With TestCpp you should be able to write your C++ unit tests and actually execute them in no time.

I just installed Ubuntu 8.10 Desktop and had a very interesting time trying to configure a static IP address. There are plenty of discussions on the forums about how this doesn't just work with the standard Network Manager. And you can't just edit /etc/network/interfaces because that is ignored when you have Network Manager installed.

To make it work, I followed this procedure:

First, remove the Network Manager packages:

sudo apt-get remove network-manager
sudo apt-get remove gnome-netstatus-applet

Now you'll have to manually set an IP so that you can connect to the Internet (modify this to be appropriate to your setup):

sudo ifconfig eth0 10.x.x.y netmask 255.255.255.0
sudo ip route add default via 10.x.x.1
sudo vi /etc/resolv.conf

Set nameserver 10.x.x.z appropriately. Next install the old gnome network admin tool:

sudo apt-get install gnome-network-admin

Finally use the old GUI to set networking configuration:

network-admin

This will store the network configuration in /etc/network/interfaces where it belongs. And it seems to work when you reboot. I'll keep it this way until Network Manager is fixed.

Object Oriented C

| No Comments | No TrackBacks

In my first job, the project we worked on was 100% C code. However, it was object oriented C. This was led by our colleague Chris Westin. As we were fond of pointing out, there is a difference between Object oriented languages and Object oriented programming. You can apply OOP concepts (given appropriate primitives) in any language. Here's a table where I'll record a few of these...

Runtime polymorphism Achievable by using function pointers, and syntactic sugar is done with clever macros.
Link-time polymorphism You probably do this already but don't define it as such; for instance, if you implement a function defined in a header differently on different platforms, you can consider this polymorphism. The function is different on Windows vs Linux. Another example might be a plugin for a browser. If you want it to run in Firefox and Opera, you might be able to get the core of it to call your own abstracted calls to the browser; the implementation thereof is determined at link-time.
Abstraction This is more a matter of design than implementation, and thus is applicable to any language.
Information hiding Again, this is a design issue. But the mechanism that is often used to achieve this is referred to as encapsulation.
Encapsulation This can be done in different ways, but usually the most effective technique is opaque types. Again, with clever data structures and macros, this can be combined with the above Runtime polymorphism to construct objects that feel like C++, or can even co-exist with C++.
Exceptions You can simulate some of this using setjmp/longjmp, but this can only go so far; the compiler doesn't know what's going on, so if you have a try/catch block that's really two macros doing housekeeping on the try/catch data structures, and then you return or break or continue out of the middle of that, there's nothing to stop you, and you've corrupted your try/catch data structures. A better method to use is to create an error data structure that can contain more information (like __FILE__ and __LINE__) than a simple int error code. This doesn't get you the magic stack unwinding, but at least it can be more informative than -1.

Now you might be asking "why not use C++?" There are lots of answers to this, but here are a few:

  • Fragile binary interface problem or Fragile base class problem
    This is a real problem for deployment of C++ code, and likely an important driver early on for Microsoft to develop COM. If you ship C++ objects in a shared library, you can't do so without being extremely careful about what's exposed in the header file.
  • Windows Debug Heap
    Similarly, on Windows you must be careful with memory management. You can't cross allocations/deallocations across module boundaries in Windows, because that would be very bad. This can easily happen in C++ if you do allocations in the header file, and then deallocation in the implementation file (or vice-versa). You might be mixing memory heaps which will cause your app to crash.
  • Incompatible behaviors across compilers or even compiler versions.
    Certainly early on different compilers or even different versions of the same compiler can generate code that is incompatible in terms of things like throwing/catching exceptions, or name mangling. This might not have been a problem for some time (I haven't checked), but is indicative of C++'s lack of an ABI.
  • C++ Standard library lacks ABI
    Similar to the above points, if you use a certain version of compiler/C++ Standard library in your shared object, you cannot share those data types with another shared object or application that uses a different version of compiler/C++ Standard library.
  • C++0x doesn't appear to address any of the ABI issues that are so well known in C++. If I'm wrong, please correct me. Bjarne's C++0x FAQ (or his C++ FAQ) doesn't even mention the word "binary"; although the word "ABI" is used, but in reference to the GC system.
  • Lack of a "platform".
    This is a common criticism, which of course C and other languages share. If you want to acquire a mutex, you have to do it differently depending on what platform you're on. Java and other more modern languages include ways to do this, and many many other things. C++0x and its standard library seem to address at least some of this...

Don't get me wrong, C++ is an extremely useful language that I use in lots of projects, but you have to know its limitations, in addition to mastering its use. I just wish that the most glaring deficiency, binary compatibility, was address in C++0x.

Exadel E7

| No Comments | No TrackBacks

Just attended a web conference demonstrating a new product from Exadel called E7, hosted by Brandon Blell, Charley Cowens and Max Katz.

It looks pretty cool, although I haven't played with it yet. The idea is to present to the business rules owner/author something that is removed from Java and UI code. The Java/UI author presents different types of services, such as Web services, POJO, Page services for JSF/Flex/JavaFX, etc.

Right now it only works on Seam (as Exadel is a JBoss partner).

I also asked the question "What if you had multiple UI types in your app: JSF/Flex/JavaFX. Could you set up a 'generic' type of page service so that the process can be shared across all 3 of the UI types?" The answer is that it's in development for the next release.

Also, someone asked if there's Drools integration. The answer is not yet--they're working on it for the next release.

My setup at home is Comcast for ISP, and Vonage for phone service. I haven't noticed any severe degradation in my Comcast service recently... But then again, I probably don't use my home phone often, and perhaps I'm not online when outages occur.

However, that all changed on Thursday, and according to my neighbors this wasn't a one-time thing. On Thursday Feb 19, 2009, after 2pm and 4pm or so, my connection was pretty unusable. I couldn't use Vonage in a reasonable manner, and had real problems transferring data. This wasn't because of a complete inability to transfer data, it was dropping packets periodically. See the graph that I generated from dslreports.com about that time.

So I called 1-800-COMCAST and I hit the right buttons to report trouble with my "high speed Internet." A recorded informed me that there was trouble in my area, that technicians were working on it, and offered to call me back when it was better. It got better before 11pm that night, but I got my automated call back on Friday at 11am.

Some theories about what's happening:

Recently, about 2 or 3 weeks ago, Comcast made a change and blocked my incoming port 25. Ever since I've had Comcast I could never make an outgoing connection on port 25, which is "normal" for an ISP. But blocking incoming port 25 is deadly if you're attempting to run your own mail server.

The reason I mention the blocked port 25 is that I believe that these problems are related to Comcast's recent changes to control their bandwidth utilization. Comcast is notorious for sending forged TCP control packets to upset your P2P transfers: that's like blowing out the tires of cars on a congested highway because cars might be carrying illegal contraband. In any case, they have been rightfully remorseful and punished, at least in reputation, for this behavior.

Because of this "turnaround," they've come up with a new scheme to control their traffic. Of course, every ISP has a right to control their traffic so a single subscriber doesn't swamp all other users. But perhaps their current implementation is is still a little green... And thus our current connectivity troubles.

And you can see here we (Bay Area) were scheduled for the switchover at the end of November 2008 or so, but it probably happened more recently, or there are more changes than a simple switch, and takes time to convert all neighborhoods in the Bay Area. But apparently according to this the new system is 100% online a month and a half ago.

I just did another test, and at this time it's much better, but still not great. I should probably set up a monitoring schedule with dslreports.com; it costs a little bit of money, but no big deal.

Update
I've set up the monitoring tool. Here are my up-to-date line monitoring results:
East Coast Hourly Daily
West Coast Hourly Daily

About a year ago I patched my own log4j to fix the fact that it can swallow InterruptedException via InterruptedIOException. Then I filed a bug against log4j, and got into an email conversation with Curt Arnold, who rightly pointed out that there were other scenarios where InterruptedException could be easily ignored. Anything that wraps an InterruptedException or InterruptedIOException and rethrows something that doesn’t derive from them is effectively ignoring the intended effect of a thread interrupt. The most common examples of this are java.lang.reflect.Method.invoke and java.lang.reflect.Constructor.newInstance which both throw java.lang.reflect.InvocationTargetException, which can have this problem.

The title of this post may be a bit misleading; java.io.InterruptedIOException doesn’t cause this problem, it just makes it at least twice as difficult, because you must always check for both InterruptedException and InterruptedIOException in wrapped exceptions. But it also means that any method that throws java.io.IOException must have special handlers for an interrupted thread.

I tried to find a bug in Sun’s database that warns about InterruptedIOException and these cases, but the closest I could find was Sun bug 4385444. That doesn’t really have anything to do with it.

When I first saw the changes made for java.nio for interruptible IO, I thought the use of InterruptedIOException was clever and elegant. But because of the very special nature of InterruptedException, I changed my mind—of course, there wasn’t much option, because java.nio integrates with existing java.io interfaces and methods which already do not throw InterruptedException, therefore they had to follow that path. It’s really a difficult situation; it isn’t the first case in Java of a hidden or wrapped InterruptedException, it just makes it more widespread. Now you have to handle an interrupted thread anywhere java.io.IOException is thrown.

Note that on Solaris (x86 and Sparc) java.io methods can also throw InterruptedIOException. You don’t need to use java.nio to see this effect. On other platforms you only have to worry about this if you (or things you call) use java.nio.

I’ve come to the conclusion that thread interrupts are so special, and currently so difficult to deal with, that Java should treat InterruptedException as a third type of exception: one that is implicitly thrown from every method. Of course this opens up its own can of worms, not to mention that it’s about 15 years too late to make such a change.

This also speaks to the fact that you should be following a pattern where it’s rare that you catch exceptions that you don’t understand, and should be catching them as far up as possible, where you can centralize your exception handling. I think this is also a good case for frameworks like Spring and Seam which use AoP; each method invocation can have a carefully thought out exception handler, either via your own AoP, or directly handled in the framework.

Update I’ve found bug 4176863 which is related to this issue, but more importantly, the paper Java Thread Primitive Deprecation. I had read this long, long ago, and thought it important to link here.

I found a nice short description about why you get what you get in Xcode when you create a new source file:

http://symmetricdesigns.com/component/content/article/3-misc/12-changing-copyright-notice-for-new-xcode-projects-just-change-your-address-card.html

and related, how you should configure your git global settings and project settings to get the right information about you:

http://github.com/guides/tell-git-your-user-name-and-email-address

I’ve run my own mail server in my house for quite a long time now, with no problems, no downtime, and it just works. Not anymore… Comcast has finally gotten around to my account to block my incoming port 25. As far as I can tell this started at midnight Thursday morning.

Several years ago they blocked my outgoing port 25, unless I used the Comcast MTA. That’s OK… so that’s what I did, reconfigured my postfix to use their MTA. But now that doesn’t even work—until I change it to use port 587.

A call to customer support gives you the expected response: “Are you using XP or Vista?” “You can’t read email in Outlook?” Of course, none of this is relevant. When the tech support person carries the appropriate information to the supervisor, the expected response is received: this is the policy for Comcast subscribers and there’s no option around it.

But there are still options… Here’s my list that I’ve been considering:

  • Plead with Comcast Has anyone had success with this approach?

  • Switch ISP There really aren’t many options here in the Bay Area. I’ve tried AT&T, other medium sized and smaller DSL’s, and they all have their disadvantages, including blocking port 25. But I am forever hopeful that someday we’ll get Fios and they’ll be good enough not to do port blocking or other evil ISP things.

  • Pobox.com This is the service I’ve been using for 12 years now. They forward my pobox.com email address to one that I specify. Until yesterday that was an address on a machine in my closet. Now I have it forwarded to gmail. I’ve asked them if they can forward to a port other than 25, but I haven’t gotten a response yet…

  • No-IP This is a little different than Pobox.com. You point your MX record at their servers and they “reflect” the email right into your server with whatever address and port you give them. This costs $40 a year… The benefit over pobox.com is that I can use this for whatever email address I like with my own domain. There are other vendors, such as AuthSMTP and DynDNS (which I use for DNS), and there’s a list that’s slightly out of date here.

  • GMail I can just stick with GMail and be done with it. You can find lots of discussions about using GMail, or any free email service. I just would have preferred to have some control over my own data… Update: I discovered that gmail is rewriting my outgoing email address with xxx@gmail.com (this is a problem because I want everyone to remember my “permanent” address at pobox.com which is forwarded to gmail); but, you can actually teach gmail your intended email address. I saw this tip in this lifehacker article.

We've discovered a bug in the Solaris JVM that we're using (1.5.0_08-b03). What happens is that we have a long running JVM that once a minute forks an executable via java.lang.Runtime.exec and reads its stdout. After a long time, one of the forks doesn't actually make it to the exec, and the thread that asked for the exec doesn't continue. That's why I think it's a JVM bug: the exec call is made and only completes half of its job.

I describe the problem on a Sun forum here.

I definitely have to upgrade the JVM to see if that helps...

This is the second time this has bitten me: I want to use Hibernate with standard JPA to persist my entities. And I want my Entities autodetected, as Hibernate is capable of, even outside of a JEE container. So in my persistence.xml I have this bit of code:

      <property name="hibernate.archive.autodetection" value="class,hbm"/>

But, I do not compile this persistence.xml into my .jar file, instead I just make it part of my classpath for my unit tests, thinking this will make things more flexible. And of course, this doesn’t work. The autodetection only works if the persistence.xml is located in the META-INF section of the jarfile that contains the Entities to be detected. See my post in the Hibernate forum.

If you’re a JSF newbie (like me), and you’re using Seam, you might be tempted to take one of the examples and hack away at it. For instance, in the booking example, the first page is called home.xhtml. After you type up some tags, you want it to run, so you point your browser at:

http://localhost:8080/myapp/home.xhtml

What you’ll find is not a JSF rendered page, but your JSF tag source! Then you think that you’ve misconfigured something, so you look over everything. But what’s really going on is that the Seam example isn’t configured to render that page, instead you should go to:

http://localhost:8080/myapp/home.seam

That will render the home.xhtml as JSF. Look in your war file’s WEB-INF/web.xml, and find this bit of code:

  <servlet-mapping>
    <servlet-name>Faces Servlet</servlet-name>
    <url-pattern>*.seam</url-pattern>
  </servlet-mapping>

This means that the Faces Servlet logic executes against URLs that end with .seam. However, being a newbie, I’m not quite sure what maps that against the .xhtml file, unless it’s this previous entry:

  <context-param>
    <param-name>javax.faces.DEFAULT_SUFFIX</param-name>
    <param-value>.xhtml</param-value>
  </context-param>

Also being a newbie, I’m not sure what I should do to configure things so that it’s impossible for a user to download my JSF code…

Anyhow, what makes this all work by default is the file index.html, which is what will load when you visit:

http://localhost:8080/myapp/

and that contains the following:

<html>
  <head>
    <meta http-equiv="Refresh" content="0; URL=home.seam">
  </head>
</html>

which naturally redirects you to the appropriate URL to start rendering your JSF/Seam application.

Thanks to anyone who points out the answers to the above mysteries (that are probably in the docs somewhere…)

Using JBoss AS and Seam

| No Comments | No TrackBacks
Things to do when using JBoss AS (4.2.3.GA) and Seam (2.1.0.SP1):

  • Download the latest (at least 1.8.0) commons-beanutils for your .ear. It fixes a memory leak you'd see after you redeploy your .ear.
  • If you're running on OS X, disable the unnecessary dock icon from JBoss by adding -Djava.awt.headless=true. This might also solve problems on Linux/Solaris boxes that you're ssh'd onto and you don't have a DISPLAY environment variable set for X Windows.
  • Use your own jboss-log4j.xml file, which by default is in ${jboss.home}/server/default/conf/jboss-log4j.xml. You probably don't need DEBUG output emitted to your console...

OS X and Java SE 6

| No Comments | No TrackBacks
It appears that you can run Java SE 6 on a Mac, but only if you're Intel 64-bit. No PowerPC, no 32-bit Intel.

But from my terminal on a 64-bit Intel machine with all updates:

$ type java
java is /usr/bin/java

$ java -version
java version "1.5.0_16"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284)
Java HotSpot(TM) Client VM (build 1.5.0_16-133, mixed mode, sharing)

In order to get Java SE 6 running on your Mac, open the Java Preferences (at /Applications/Utilities/Java/Java Preferences) and drag Java SE 6 to the top of the list for Java applet versions and/or Java application versions. Now I get:

$ java -version
java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

Note that you may not be very happy with the result... Given the history of Java support on OS X, perhaps I should try SoyLatte.

I'm working on a new project and using git for the first time; I really like the distributed source control model, it makes a lot of sense. I come from using p4 for quite a long time, so git feels a lot like CVS/SubVersion, but obviously a lot more powerful than CVS. Also, I want to make sure that I do not say that git is hard to understand and use.

The way I work with my projects is by submitting not only all my source code for my project, but also all tools that are used to generate my project. (Obviously I draw the line somewhere, I don't submit my Linux distribution that I use on the build machine.) That way I can guarantee reproducibility and no one on the team has to install their own tools before being productive.

A short time into my new project I already have more than 500,000 files in my git repository. Now every time I use any git command, it just sits there for many seconds... even when I open a file using Emacs, I have a severe delay because git is invoked to check its status. Using dtrace I've determined that it is doing an lstat on every single file in the repository, to make sure that it's not out of date.

Therefore beware: git is not good for large repositories. And when I say large, I don't mean "sorta large" like 20,000 files. I mean more on the order of 500,000 or 2.5 million files (the current size of our p4 repository). This is because git is just like CVS: it automatically determines for you which files you have edited/deleted/added, and which you have not, but it does this by doing an lstat on every single file in your tree. p4 does not make this assumption, and requires you to tell it what files you have edited/deleted/added, therefore it never does an lstat of the files in your tree (unless you invoke some special commands to ask it to look).

It seems there are attempts to optimize the lstat operations, but it is part of the design of git, and would likely be unnatural to avoid it or suppress it.

I found the git benchmarks, where it describes how efficient it is for the size of its repositories, which is great. However, it does not mention the number of files in a repository. And I couldn't find the current size of the linux kernel repository, but I found a mention that the pull of an entire tree was 5,000 files, which included all of git's metadata (if I understood it correctly).

In order to work around git's slowness with large numbers of files, I decided to split my git repository into two halves; the half with the tools, and the other half with my project source. I found a tool called git split that seems to do the job (see http://people.freedesktop.org/~jamey/git-split), but it didn't work on my git 1.6.0 repository. It got stuck because it couldn't read my .git/info/grafts. So I gave up there and just deleted all the tools out of my tree, and added them into a separate git repository. That made things much better.

I have a project where I'm using Hibernate with JPA annotations. I got stuck where the annotations to one of my attributes wouldn't stick, but I finally figured it out:


You have to annotate either the member (in which case all attributes must be annotated via member), or the get method for your attribute. You can't annotate the set method.

Beware Windows exit codes

| No Comments | No TrackBacks
I have a Java process that is forked from a Windows service; the service process monitors the child Java process to make sure it's OK.

However, I was seeing in my logs that sometimes I would get the exit code 143. My code does not generate this exit code; further, there are no .dmp files or Java crash reports or exceptions logged or any other indication in my code that the process exited. The explanation is simple: my service and all other processes were receiving CTRL_SHUTDOWN_EVENT. My service process ignores this, but Java does not (see IBM's excellent document on Java Signal Handling).

The result is that my Java process would exit, my monitor would not know why, and restart it, at which time it would again get a shutdown event, etc, until the machine actually completed its shutdown.

Needless to say my service process now watches for those events...

In all our Windows executables we AddVectoredExceptionHandler so that we can get .dmp files when things crash. However, I recently discovered that doing so prevents Purify from working correctly on Windows. This should probably be expected, and I'm probably not chaining to the next handler correctly, but it's something to watch out for.

Subtopic #1: In all other operating systems, if you want a core dump file, it's very straight-forward; you ulimit -c unlimited, and you get core files. In Windows, you have to write your own Dump file from code that you write. You could depend on Dr. Watson, or Windows Error Reporting, but if you want your own dump file in a place where you or your customers can find it, you have to write it yourself.

Subtopic #2: Fifteen years ago, Purify was the most important development tool you could have. It's still the most important development tool you could have, except that it hasn't really changed in 15 years. I have several friends that are ex-Pure, and they are also dismayed at how it's just not keeping up. For instance:
  • It wasn't until 2004 (or was it 2006?) that you could really use Purify on Linux. Wow.
  • We still can't Purify JNI C code in a Java JVM on Linux. The JVM does magic things that Purify can't handle. Oh well.
  • Purify for Windows doesn't even work on x64.
I spent some time determining that the sample code for ndisprot that is included in the 6001.18001 WDK has a bug in the IRP cancel code. What's worse is that the bug only exists in the directory labeled 5x, not the directory labeled 60.

For some (at least 2) of the samples in the latest WDK's, they made a clean break from the NDIS 5.x and earlier sample code when they wrote the samples for NDIS 6.0. In the NDIS 6.0 sample they wrote the cancel code according to the pseudo code in MSDN. Unfortunately, they didn't back-port those changes/fixes to the sample code in the 5x directory (sample code intended for NDIS 5.x)

My full description/conversation is here:

http://www.osronline.com/showthread.cfm?link=142677

That sounds innocuous, right?

Wrong.

When you call InetAddress.getLocalhost(), a reverse DNS lookup for your hostname is done. In the worst case, you’ve specified a DNS server that isn’t reachable, and so you have to wait for the DNS timeout, which can be quite long, like 30 seconds or 2 minutes. The reason the crypto code in JCE is doing this is for a random seed generator. Seems you could find something else more random than your hostname…

Below I’ve replicated the sample code that I created for this fix, in case it’s of any use to anyone:

I’ve found what I believe is a workaround to this problem, that seems to work against Java6. It works by setting the system property impl.prefix, and using implementations derived from the following classes:

java.net.PlainDatagramSocketImpl java.net.Inet4AddressImpl java.net.Inet6AddressImpl

The override implementations of Inet4AddressImpl and Inet6AddressImpl are designed to make sure that InetAddress.getLocalHost() returns an answer without causing any network access. That means that SSL connections, when constructing their random seed that includes the local hostname, will not hang when DNS cannot be reached.

The reason PlainDatagramSocketImpl is overridden is because the system property impl.prefix is also used to construct it; if impl.prefix is not specified, then a prefix of “Plain” is assumed, and thus PlainDatagramSocketImpl is loaded. Therefore we must provide an implementation that with our own matching prefix.

The main class, DefeatGetLocalHost sets the system property impl.prefix to “DefeatGetLocalHost”. This will cause the following classes to be loaded when they are needed:

java.net.DefeatGetLocalHostDatagramSocketImpl java.net.DefeatGetLocalHostInet4AddressImpl java.net.DefeatGetLocalHostInet6AddressImpl

The reason that these derived classes are set in the same package, java.net, is because constructors and methods are package protected; therefore placing them in the same package provides the highest level of compatibility.

Also, in order to get our derived classes in package java.net to load in the Java runtime, we have to append the boot classpath. This is done with: -Xbootclasspath/a: after which we specify the directory with our class files.

In the next comment are the source files that I wrote to demonstrate. Compile it and execute DefeatGetLocalHost using -Xbootclasspath/a: to include the overridden classes.

java/net/DefeatGetLocalHostDatagramSocketImpl.java:

package java.net;

class DefeatGetLocalHostDatagramSocketImpl extends PlainDatagramSocketImpl {
}

java/net/DefeatGetLocalHostInet4AddressImpl.java:

package java.net;
import java.io.IOException;

class DefeatGetLocalHostInet4AddressImpl extends Inet4AddressImpl {

    public String getLocalHostName() {
        System.out.println("Using implementation " +
                           this.getClass().getName() + ".getLocalHostName");
        return "localhost";
    }

    public InetAddress[] lookupAllHostAddr(String hostname)
        throws UnknownHostException {

        System.out.println("Using implementation " +
                           this.getClass().getName() + ".lookupAllHostAddr");

        if (hostname.equals("localhost")) {
            return new InetAddress[] {
                InetAddress.getByAddress(new byte[] {
                        (byte)127, (byte)0, (byte)0, (byte)1
                    })
            };
        }

        return super.lookupAllHostAddr(hostname);
    }
}

java/net/DefeatGetLocalHostInet6AddressImpl.java:

package java.net;
import java.io.IOException;

class DefeatGetLocalHostInet6AddressImpl extends Inet6AddressImpl {

    public String getLocalHostName() {
        System.out.println("Using implementation " +
                           this.getClass().getName() + ".getLocalHostName");
        return "localhost";
    }

    public InetAddress[] lookupAllHostAddr(String hostname)
        throws UnknownHostException {

        System.out.println("Using implementation " +
                           this.getClass().getName() + ".lookupAllHostAddr");

        if (hostname.equals("localhost")) {
            return new InetAddress[] {
                InetAddress.getByAddress(new byte[] {
                        (byte)127, (byte)0, (byte)0, (byte)1
                    })
            };
        }

        return super.lookupAllHostAddr(hostname);
    }
}

DefeatGetLocalHost.java:

public class DefeatGetLocalHost {

    public static void main(String[] args) {
        try {
            safeMain(args);
        } catch(Throwable e) {
            e.printStackTrace();
        }
    }

    private static void safeMain(String[] args)
        throws java.net.UnknownHostException, java.net.SocketException {

        System.setProperty("impl.prefix", "DefeatGetLocalHost");

        System.out.println("Getting localhost:");
        System.out.println(java.net.InetAddress.getLocalHost().getHostAddress());

        System.out.println("Creating DatagramSocket:");
        java.net.DatagramSocket dg = new java.net.DatagramSocket();
        dg.close();

        System.out.println("Success");
    }
}
I'm actually now a GNU Emacs user, not an XEmacs user, but in case this is helpful to anyone I've recorded how I figured out how to the get the XEmacs source to compile on Ubuntu 7.10.

The reason I'm no longer an XEmacs user is because it's just not keeping up with GNU Emacs, but more importantly, there is an extremely annoying bug...

Update: It appears that the "Xlib: unexpected async reply" that I refer to above is fixed in Ubuntu 8.04. Too bad I'm now using GNU Emacs...
Most of the info about this I put in this thread: http://communities.vmware.com/thread/115466

This issue was affecting me quite a bit after upgrading to OS X Leopard VMware Fusion. I helped by collecting the kernel stack traces for the VMware engineer Ben Gertzfield. From there he reproduced the problem and determined that this is a bug in the OS X kernel. Apple Radar bug 5679432 was filed, and fixed in 10.5.3.

Read this forum entry I wrote about what could be a common NSIS coding mistake with macros. For instance, this code:

IfErrors 0 +2
!insertmacro LogProgressMessage '"There was an error..."'

will likely cause the error "Installer corrupted: invalid opcode" at runtime. Instead of using +2, you should use a label.

Just filed a bug against log4j; it can swallow InterruptedException delivered to a thread that uses logging. This is because it catches IOException and ignores it, but IOException can also be InterruptedIOException, which can be generated on some platforms when a thread is interrupted.

Update: As of 8/14/2008, the log4j developer Curt Arnold has bug 44157 fixed, and the current plan is to ship the fix with log4j 1.2.16.

The code I'm working on manipulates routing tables on three different platforms: Linux, Windows and Solaris. Each of them has a different behavior for different scenarios. Here I attempt to document those differences.

First some definitions:

  • Interface A device which can directly reach a subnet via ARP or other protocols. An example is eth0.
  • Direct route A route which indicates which Interface to use to reach a directly connected subnet.
  • Gateway route A route which indicates a gateway to use to reach a subnet which is connected via a router.
  • Default route A special case of a Gateway route in which the destination subnet is all possible addresses.
  • Host route A special case of any of either a Direct route or a Gateway route in which the destination is a single machine.
  • Multicast route A special case of a Direct route in which the destination subnet is the multicast address space, 224.0.0.0/8 or a subset thereof.

Note that example route entries in this table is based on the format emitted from Linux's /sbin/ip/route.

  Linux 2.6 Windows 2003 Solaris 10
The interface chosen to access a gateway via a route is determined by traversing the route table, not hardcoded into the route entry.
For instance, default via 192.168.0.1 does not need the specification dev eth0, because that is determined by finding the direct route 192.168.0.0/24 dev eth0.
nonoyes1
A Gateway route which is also a Host route can be added where the destination is an address that exists in a subnet of another Direct route.
For instance, the route 192.168.0.20/32 via 192.168.0.1 dev eth0 exists while the direct route 192.168.0.0/24 dev eth0 also exists.
yes2yes2yes2
A Direct route which is also a Host route can be added where the destination is an address that exists in a subnet of another Direct route.
For instance, the route 192.168.0.20/32 dev eth0 exists while the direct route 192.168.0.0/24 dev eth0 also exists.
yesyesyes
Direct routes can be deleted.yesyes3yes
Route priority can be programmatically controlled.yesyesno4
When an interface is administratively taken down, do the associated Direct route entries disappear?yesyesyes
When an interface is administratively taken down, and associated Direct route entries disappear, do they return when the interface is brought up again?yesyesyes
When an interface is administratively taken down, do the associated Gateway route entries disappear?yesyesyes
When an interface is administratively taken down, and associated Gateway route entries disappear, do they return when the interface is brought up again?noyesno
When an interface is administratively down, is it an error to add a route that references that interface?yesyesyes
When an interface is unplugged, do the associated route entries disappear?noyesno
When an interface is unplugged, and associated route entries disappear, do they return when the interface is plugged in again?N/AyesN/A
When an interface is unplugged, is it an error to add a route that references that interface?noyes5no
When an interface is unplugged, and associated route entries do not disappear, will an alternate route be chosen because the interface is unplugged?noN/Ano
If two interfaces are connected to the same subnet, will ARP respond on either interface for an address on one of the interfaces?yesnono
Can routes be modified for all attributes including priority? An answer of no means they must be destroyed and recreated to modify attributes.noyesno
Does the operating system create Multicast routes by default?noyesno
Can multicast routes be deleted?yesyes3yes
If Multicast routes do not exist, do multicast packets exit the machine? To where?yes6nono
Can two routes be created with the same destination and priority, but a different interface?noyesyes7
Can a Gateway route be specified with an interface, where that interface does not have a Direct route for the gateway in the Gateway route?no8yes9no8
When you remove a Direct route which is required by a Gateway route to reach the gateway, does the Gateway route disappear?nonono
If you specify a Gateway route with a gateway that is not reachable via a Direct route, is this allowed?noyes10no

1 This configuration is possible if the route entry is set for this behavior. The route entry can also be configured for a specific interface.

2 When a ping is performed on the destination address, it is sent to the gateway, not via the direct route; this is what should be expected by following the rules in reading a route table. However, the gateway in my test was a Linux 2.6 machine, and it rejected the ping with an ICMP of "unreachable." This means such a configuration is possible, but useless.

3 One cannot directly delete a default route or other "protected" routes, but there is a way to fool Windows into deleting it. I found this fascinating discussion

4 The metric attribute cannot be set for a route in Solaris.

5 The error that's returned is ERROR_INVALID_PARAMETER. That doesn't differentiate this condition from other problems.

6 It appears to choose the first available interface.

7 This question is partly irrelevant in Solaris; the priority or metric cannot be set for a route. However, you can create two routes with the same destination but different interfaces.

8 It works even if the Direct route is on another interface, but it must exist.

9 You can set the route, but it doesn't do anything.

10 This strange behavior is apparently allowed; the source address that's used is the address on the interface that is preferred for the Default gateway.

Java5 includes a great metadata system called Annotations. One of them is really good for catching a common bug--a method that should override a base class method, but is misnamed or has the wrong argument types. It's called @Override.

But what would make this annotation really useful is an optional compiler warning that would inform you of all the overridden methods that do not have the @Override annotation, so that you can cover all your methods to make sure that things that should override something really do and things that should not really do not.

There's an outstanding filed bug for this feature, but it doesn't look like it's going to get implemented...



I was trying to figure out where a memory leak was coming from on Windows, and didn't have the luxury of using Purify, and this really helped:

http://blogs.msdn.com/greggm/archive/2004/02/12/72209.aspx

Essentially, VirtualAlloc is the equivalent of sbrk in other OS's, and allocates virtual pages to the process. If you can find out what's calling that all the time, you're likely to discover what's allocating memory.

Thanks Roy...

| No Comments | No TrackBacks

Roy West was nice enough to quote me in his blog: "You don't want to ship an experiment to a customer." Generally not a good idea...

Signals in Java

| No Comments | No TrackBacks

Revelations on Java Signal Handling is an important document to know how to catch signals in Java.

You should read this link to for a recent discussion on how to interrupt threads that are listening on Socket's in Java:

I suddenly seem to have all kinds of problems with the JVM crashing when I'm creating it in our monitor code. The way things work is that I have an executable that links java.so instead of using the shipped java exectuable. I call this the "driver." Here's what I've found:

The driver will often (but not most of the time) crash, only when -Xdebug is given, with the following stack trace:

gdb build/debug.linux.x86.rhel3/bin/scdriver_debug core.28224
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Core was generated by `/home/jared.oberhaus/jared.oberhaus-linux3-all/shared/1.2/build/debug.linux.x86'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjava.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjava.so
Reading symbols from /lib/tls/libpthread.so.0...done.
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libverify.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libverify.so
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/native_threads/libhpi.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/native_threads/libhpi.so
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_ldap.so.2...done.
Loaded symbols for /lib/libnss_ldap.so.2
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
Reading symbols from /usr/lib/sasl/libanonymous.so...done.
Loaded symbols for /usr/lib/sasl/libanonymous.so
Reading symbols from /usr/lib/sasl/libcrammd5.so...done.
Loaded symbols for /usr/lib/sasl/libcrammd5.so
Reading symbols from /usr/lib/sasl/libdigestmd5.so...done.
Loaded symbols for /usr/lib/sasl/libdigestmd5.so
Reading symbols from /usr/kerberos/lib/libdes425.so.3...done.
Loaded symbols for /usr/kerberos/lib/libdes425.so.3
Reading symbols from /usr/kerberos/lib/libkrb5.so.3...done.
Loaded symbols for /usr/kerberos/lib/libkrb5.so.3
Reading symbols from /usr/kerberos/lib/libcom_err.so.3...done.
Loaded symbols for /usr/kerberos/lib/libcom_err.so.3
Reading symbols from /usr/kerberos/lib/libk5crypto.so.3...done.
Loaded symbols for /usr/kerberos/lib/libk5crypto.so.3
Reading symbols from /usr/lib/sasl/libgssapiv2.so...done.
Loaded symbols for /usr/lib/sasl/libgssapiv2.so
Reading symbols from /usr/kerberos/lib/libgssapi_krb5.so.2...done.
Loaded symbols for /usr/kerberos/lib/libgssapi_krb5.so.2
Reading symbols from /usr/lib/sasl/liblogin.so...done.
Loaded symbols for /usr/lib/sasl/liblogin.so
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libpam.so.0...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /usr/lib/sasl/libplain.so...done.
Loaded symbols for /usr/lib/sasl/libplain.so
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libzip.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libzip.so
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjdwp.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjdwp.so
#0  0x0066e6c1 in pthread_mutex_init () from /lib/tls/libpthread.so.0
(gdb) where
#0  0x0066e6c1 in pthread_mutex_init () from /lib/tls/libpthread.so.0
#1  0x01070e3c in ObjectMonitor::ObjectMonitor ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#2  0x01000517 in CreateRawMonitor ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#3  0x0039a872 in JVM_OnLoad ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjdwp.so
#4  0x00ff8a2e in JvmdiInternal::post_event ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#5  0x01002a0e in jvmdi::post_vm_initialized_event ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#6  0x010f109c in Threads::create_vm ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#7  0x00fb4388 in JNI_CreateJavaVM ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#8  0x08048e6b in exec_java (java_library_path=0x0, 
    jre_home=0xbfffcdda "/home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre", 
    java_class=0xbfffce27 "com/scalent/shared/tools/test/MonitorTest3", 
    classpath=0x0) at driver.c:300
#9  0x0804895d in main (argc=5, argv=0xbfffb054) at driver.c:81
  • I thought it was something I did because in the stack trace I can see that classpath and java_library_path, parameters to exec_java are null, and sometimes contain other bad values. Examining this with the debugger I've determined that this is just the optimizer. The compiler is passing in the right values for these when they're needed, but otherwise they reflect the value of the register $esi which can vary.
  • I tried to use Purify on this, but there is something seriously broken with Purify on the machine that I'm running on right now. It seems to work better with root, but when I try it as my own user, I get a MSE on almost every malloc and pthread operation, whether my code does it or not. Another red/green/blue herring.
  • I tried Valgrind on it to try to find something, but that didn't seem to discover anything either. Of course, Valgrind can't really execute the whole JVM, but that's not what I was looking for; I was just trying to get it to execute my non-JVM code and find some sort of memory corruption.
  • I also tried the j2sdk1.4.2_06, better than our j2sdk1.4.2_03. That didn't help at all. It still crashes at least 1/3 times.
  • Finally, I went into our code and turned off all the options. After 34 runs of com.scalent.shared.tools.test.TestMonitor it did not fail once. I believe the whole thing has something to do with the -Xdebug and related options, as I've never seen a crash in the non-debug version of the driver.
  • I think I really proved that it has something devious to do with -Xdebug and friends. I commented out just the -Xdebug and -Xrunjdwp:transport=dt_socket,address=9300,server=y,suspend=n options and ran the test com.scalent.shared.tools.test.TestMonitor 160 times and it didn't fail once.
  • I tried putting a 5 second delay between tests in com.scalent.shared.tools.test.TestMonitor, but that didn't help. It still failed on the third test.
  • I tried again with strict=y on the -Xrunjdwp:transport line, but that didn't help.
    I also tried using the dt_shmem transport for -Xrunjdwp, but that didn't help either.
  • I have resigned myself to the fact that this is a bug in the JVM, at least with the way that I'm calling it. Fortunately it only happens while we have -Xdebug turned on.

Find your .jar file

| No Comments | No TrackBacks

This should help you find the jarfile a class comes from...

We use RedHat Enterprise Linux 3.1 at work for a developer box, and I was seeing problems where the machine would get into 100% iowait lockups; it wouldn't completely lock, but it would get REALLY slow. The answer is here, and for me it worked instantly:

Let's say you wrote some code using Java JNI and you wanted to Purify that code so that you could find memory leaks and other bugs.

Short answer: you can't.

Here's the long description about what I went through to get to there.

These are the software versions I'm working with:

RedHat Enterprise Linux 3.1
Linux kernel 2.4.21-9EL
PurifyPlus.2003a.06.13.FixPack.0155
Java Runtime 1.4.2_03-b02

One of the most important steps is the .purify file that I had constructed that suppress hundreds of thousands of warnings and allowed me to run things in a reasonable amount of time--but apparently I forgot to save that in a safe place and it's been destroyed. But easy to recreate if you follow these steps.

Anyhow, where I'm stuck is that when an attempt is made by Java to bind to a socket and start listening, it just sits there. There's no activity that I can see via strace, no CPU taken up by the process. But the rtslave is still responsive. It never goes past that step.

I can see this in two different ways; if I turn on Java debugging for my process using the appropriate flags, as soon as the JVM starts up it attempts to bind to that socket. The result is that it just hangs there before executing any Java code. However, if I turn off the Java debugging flag, much Java code is executed up to the point where my Java code attempts to bind to a socket and listen. Then it just sits there again.

In a previous exercise trying to debug Java and listening on a socket, I found that when Java opens a socket it apparently uses rtnetlink to turn off the multicast flag for that socket. I don't know if that has anything to do with it, but it might be interesting...

However, to get this far, here are the steps:

  • You generally have to build a purified executable on the same machine that you're executing on. If anything is different it will crash instantly.
  • The Purify rtslave process just eats tons of memory when it stores errors. If you suppress those errors, it will use much less (or no) memory for those suppressions. The reporting of those errors also takes a huge amount of time, so the purify process ran for a very long time, getting nowhere.
  • The JVM has lots and lots of things that look like MSE's and UMR's. Once you suppress those, the JVM can get somewhere under Purify.
  • You have to set DISPLAY, otherwise Purify will dump everything to stdout, which usually isn't very helpful.
  • I modified our startup environment to pass the environment variables DISPLAY, PUREOPTIONS and PURIFYOPTIONS so that they can affect the operation of Purify.
  • I'm running the JVM with -Xint so that the HotSpot compiler is not invoked, which probably would introduce lots and lots of interesting challenges to get things to work. Update: I got stuck and tried my luck with the HotSpot compiler, and now I'm getting farther. So you should not use -Xint.
  • I found out that IBM has a newer version of Purify that seems to work much better than the previous version against the JVM. It's PurifyPlus.2003a.06.13.FixPack.0155.
    There is an undocumented parameter when building with purify, called -handle-calls-to-java. I added this to my PUREOPTIONS environment variable.
  • Because of -handle-calls-to-java, Purify goes into its cache and sets up symbolic links to "help" the JVM find stuff. For instance, I have -cache-dir set to /var/purify/cache. In /var/purify/cache/opt/scalent/jre/lib/ there are lots of symbolic links back to /opt/scalent/jre/lib/. That is where our JRE is stored in the file system.
  • The JVM still needs at least one more (that I know about so far) symbolic link to find stuff. First you have to run the JVM and have it fail with the message: "Error occurred during initialization of VM java.lang.UnsatisfiedLinkError: no zip on java.library.path". This is because when java looks for a library to open called "zip", on Linux it's going to look for libzip.so on its java.library.path. But since the name has been Purify-mangled, it can't find it. Therefore, do the following:
cd /var/purify/cache/opt/scalent/jre/lib/i386/
ln -s /opt/scalent/jre/lib/i386/libawt.so
ln -s /opt/scalent/jre/lib/i386/libcmm.so
ln -s /opt/scalent/jre/lib/i386/libdcpr.so
ln -s /opt/scalent/jre/lib/i386/libdt_socket.so
ln -s /opt/scalent/jre/lib/i386/libfontmanager.so
ln -s /opt/scalent/jre/lib/i386/libhprof.so
ln -s /opt/scalent/jre/lib/i386/libioser12.so
ln -s /opt/scalent/jre/lib/i386/libjaas_unix.so
ln -s /opt/scalent/jre/lib/i386/libjavaplugin_jni.so
ln -s /opt/scalent/jre/lib/i386/libjawt.so
ln -s /opt/scalent/jre/lib/i386/libjcov.so
ln -s /opt/scalent/jre/lib/i386/libJdbc0dc.so
ln -s /opt/scalent/jre/lib/i386/libjdwp.so
ln -s /opt/scalent/jre/lib/i386/libjpeg.so
ln -s /opt/scalent/jre/lib/i386/libsig.so
ln -s /opt/scalent/jre/lib/i386/libjsoundalso.so
ln -s /opt/scalent/jre/lib/i386/libjsound.so
ln -s /opt/scalent/jre/lib/i386/libmlib_image.so
ln -s /opt/scalent/jre/lib/i386/libnative_chmod.so
ln -s /opt/scalent/jre/lib/i386/libnet.so
ln -s /opt/scalent/jre/lib/i386/libnio.so
ln -s /opt/scalent/jre/lib/i386/librmi.so
ln -s /opt/scalent/jre/lib/i386/libverify.so
ln -s /opt/scalent/jre/lib/i386/libzip.so
  • I found another directory that needs to be linked. I got the error "ZoneInfo: /var/purify/cache/opt/scalent/jre/lib/zi/ZoneInfoMappings (No such file or directory)". I also found lots of other directories in a similar state:
cd /var/purify/cache/opt/scalent/jre/lib
ln -s /opt/scalent/jre/lib/zi
ln -s /opt/scalent/jre/lib/locale
ln -s /opt/scalent/jre/lib/images
ln -s /opt/scalent/jre/lib/im
ln -s /opt/scalent/jre/lib/fonts
ln -s /opt/scalent/jre/lib/ext
ln -s /opt/scalent/jre/lib/cmm
ln -s /opt/scalent/jre/lib/audio
  • When the Java code starts up, it forks off processes that are written in C. The result is that Purify follows the fork with another Purify rtslave that immediately does an exec. Purify takes this as a process exit, and so immediately starts looking for leaks in that process. We don't care about leaks at this point; we'll find the leaks in the original JVM process when we want by clicking on the leak button. So until I fix process forking, I'm adding the options -inuse-at-exit=no -leaks-at-exit=no to my PURIFYOPTIONS environment variable.

In case you're wondering, Valgrind won't work either.

Java and MT

| No Comments | No TrackBacks

Java's memory model is very aggresive, and you have to be very careful when accessing memory from multiple threads. You of course have to synchronize access to memory locations, but you have to synchronize them even when it looks like you don't have to. There are several cases where you must use synchronize:

  • To provide a mutual exclusion barrier to prevent one thread from modifying a data structure while the other is reading it.
  • To provide a memory barrier to prevent memory operation reordering from doing something you didn't want to have happen.
  • To make the memory you're accessing volatile so that the runtime optimizer doesn't throw away your request to read a memory location.

Here's a good web page that discusses this.

A good rule to use is that when in doubt, synchronize.

Reordering can only hit you with a multiple-cpu machine, but the problems that I've been running into recently happen on my single CPU machine, with something like this:

(Note that everything after this is speculation based on behavior I've seen):

int m_y = 0;
Thread1() {
    synchronized(m_x) {
        m_y = 1;
    }
}
void Thread2() {
    while(true)
        System.out.println(m_y);
}

Even after the code in Thread1 has executed in its thread, the code in Thread2 will print 0; I believe this is because the runtime optimizer doesn't bother to look at the value of m_y after the first access. This is similar to a compile-time optimizer, which you'd fix with volatile. But a compile-time optimizer couldn't do anything in this situation.

But in Java the runtime optimizer will make it so that the first access gets the value, but it won't bother reading the value from memory anymore after that.

This strange behavior goes away by putting the synchronize(m_x) around the access to m_y. I believe this tells the runtime optimizer that something is likely to have been changed by another thread.

The software that we're developing creates SSL connections when it starts up, and it does so at S13 (has to be after network, but before other services start). The result is that on an NFS booted Linux machine, it sits there forever, and never completes the connection.

Clue #1: if you move the mouse or type on the machine's keyboard, eventually the connection will complete.

Of course the reason for it hanging is that Java is using /dev/random to generate the keys for the SSL connection. And /dev/random gets all of its entropy from the physical environment, and refuses to return random values until it gets some input from the outside world.

We don't see this on a machine that boots from disk; I assume that /dev/random gets entropy from the interaction with the drive, via interrupts and so forth. For some reason the network activity doesn't yield the same entropy data, or at least not enough.

I found this article that discusses the usefulness of /dev/random given its current design.

In order to work around this, we decided to use /dev/urandom. We could do this by a link in the file system, but a much superior solution is to set the following system property in Java:

-Djava.security.egd=file:/dev/urandom

Now all you have to worry about is attacks against your SSL connection from those who know that you are using the pseudo-random number generator...


Today I learned something very interesting. I learned that you can't setuid on a process in Linux; not when you have multiple threads. Please see #8 in this list.

What they refer to as interesting times probably includes the following:

  • When calling setuid, only the caller thread will actually get its uid changed. All other existing threads in the "process" retain their original uid.
  • I believe any sane person should recognize this as meaning Linux is broken when using threads and setuid.
  • This is a security hole, because root threads still exist in the process. If the non-root threads are hijacked by an attacker, they can stack stomp on the root threads and execute arbitrary code as root.
  • Because synchronization depends on the ability to deliver signals, and delivering signals depends on priviledges, it's easy to see how synchronization between a thread running as root and another running as non-root can wedge the process.
  • Even if I did call setuid in the first bytecode instruction in a Java process, it's too late; Java has already forked threads to do things like garbage collection, and those threads present the security hole described above, and the synchronization problem described above.
  • I'm sure there's a long list of other reasons why this is bad, but I can't think of them now, and the above is sufficient.

In our project we have a Java process that uses forked processes written in C; the purpose of these forked processes is to run as root, or at least elevated privileges, while the Java process runs as some sort of nobody user. Unfortunately this doesn't work very well at all on Linux because we cannot downgrade the uid of the Java process after it starts.

This also means that if we want to listen on a port under 1024, we'll have to do that some other way; there's no way we could get the Java process to bind to that port as root and then downgrade to a nobody uid.

Also the processes I refer to have to be forked before the JVM starts. This means that we have to rendezvous with them in some manner that either means some sort of JNI code to hook up the file descriptors in the pipes, or use some other form of IPC.

Here's what we think is happening, thanks to Evan's suggestion to use gdb and Carol's assistance in recreating the loopback's IP address and route:

The first attempt by Java to open a socket is preceeded with an initialization of its socket code.
The socket initialization code calls java.net.PlainDatagramSocketImpl.leave, as is indicated in this stack trace from gdb:

#0  0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb75e3bf8 in connect () from /lib/tls/libpthread.so.0
#2  0xaa4b7c24 in Java_java_net_PlainDatagramSocketImpl_leave ()
   from /opt/scalent/jre/lib/i386/libnet.so
#3  0xaa4b8029 in Java_java_net_PlainSocketImpl_initProto ()
   from /opt/scalent/jre/lib/i386/libnet.so
#4  0xb2fa6bf2 in ?? ()
#5  0xb2fa0ddb in ?? ()
#6  0xb2f9e104 in ?? ()
#7  0xb721bb44 in JavaCalls::call_helper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#8  0xb72cfa6d in os::os_exception_wrapper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#9  0xb721bd96 in JavaCalls::call ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#10 0xb7200f6f in instanceKlass::call_class_initializer_impl ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#11 0xb720569c in instanceKlass::call_class_initializer ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#12 0xb72001cb in instanceKlass::initialize_impl ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#13 0xb72059af in instanceKlass::initialize ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#14 0xb720d6d4 in InterpreterRuntime::_new ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#15 0xb2fad510 in ?? ()
#16 0xb2fa0ddb in ?? ()
#17 0xb2fa0ddb in ?? ()
#18 0xb2fa0ddb in ?? ()
#19 0xb2fa0ddb in ?? ()
#20 0xb2fa0ddb in ?? ()
#21 0xb2fa0d04 in ?? ()
#22 0xb2fa0ddb in ?? ()
#23 0xb2fa10e1 in ?? ()
#24 0xb2f9e104 in ?? ()
#25 0xb721bb44 in JavaCalls::call_helper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#26 0xb72cfa6d in os::os_exception_wrapper ()
---Type  to continue, or q  to quit---
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#27 0xb721bd96 in JavaCalls::call ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#28 0xb721b666 in JavaCalls::call_virtual ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#29 0xb721c1df in JavaCalls::call_virtual ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#30 0xb7274f25 in thread_entry ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#31 0xb7319caa in JavaThread::thread_main_inner ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#32 0xb7315674 in JavaThread::run ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#33 0xb72d1083 in _start () from /opt/scalent/jre/lib/i386/client/libjvm.so
#34 0xb75dedac in start_thread () from /lib/tls/libpthread.so.0
  • The call in question seems to be to make the machine leave the multicast group. See here.
  • Leaving the multicast group must involve connecting to loopback on a random port (the port it chooses changes and is always above 32768), and then shoving some random bytes through. That's my theory. Update: This is almost certainly rtnetlink.
  • It just sits there trying to communicate with itself, and times out after ~3.5 minutes.
  • It appears that it has its problem because the loopback device is not configured with an IP address and is not in the route table.
  • After we issued the following commands, everything works just fine and the 3.5 minute delay turns into 18ms delay:
    • ip addr add 127.0.0.1/8 dev lo
    • /sbin/route add -net 127.0.0.0/8 dev lo
  • We found a most interesting thread on the Java forum that seems to mirror our problem. But the guy apparently never figured it out. Maybe I should post a solution there.
  • We went through the following other possible problems:
    • I always thought it was some kind of nfs file locking problem, but that's not the case at all.
    • We redirected the logging output to a local disk on the machine. That didn't help at all.
    • We saw the process was blocked reading /dev/random; /dev/random must use loopback to generate random numbers. To solve this we used /dev/urandom which is not as random, but removed the block. But the connection delay persisted. This might explain why it took 50 minutes to send the first message on the SSL connection. Once we removed the block on /dev/random the 50 minute message send delay seemed to go away.
    • We wrote some code to try to connect without SSL, but I'm not convinced that ever worked. It's still in the code and can be activated with a configuration setting, and I tested that configuration setting in my client.
    • We then thought it was a delay caused by doing a reverse DNS lookup on the peer's IP address--likely so it could do certificate validation/throw nice exceptions. We saw in gdb that the stack trace was deep in some Java code that was trying to do some kind of DNS operation. /etc/resolv.conf was empty, so we added our name server to it and rebooted the machine. That didn't help, but the stack trace changed.
    • Then the stack trace was stuck in Java_java_net_PlainDatagramSocketImpl_leave; I thought that might have still been some DNS hosage, so I changed /etc/hosts to include the addresses of the peers. That didn't help and didn't change the stack trace.
    • Finally we typed /sbin/ifconfig. That showed us that lo did not have an IP address.
    • Carol told us the correct magic commands to type.
Sometimes we would see on some of our Linux boxes a 3 minute delay between an attempt to open a socket and a successful connection. This did not make sense... but I eventually determined that this was caused by the loopback device not having an address, or not having a route in the route table.

You can probably fix this by typing the following:
/sbin/ip addr add 127.0.0.1/8 dev lo
/sbin/route add -net 127.0.0.0/8 dev lo

Also, we had these other symptoms:
  • /dev/random would block for about 3 minutes, probably because it depends on loopback to get its results.
  • Java trying to do a reverse DNS lookup would block for about 3 minutes, probably because it was trying to get results from 0.0.0.0, because /etc/resolv.conf was empty, and 0.0.0.0 was being interpreted as 127.0.0.1... Update: see this post about a related fix...

Here is a thread where I replied with this information; the thread also suggests other solutions, perhaps to the same or similar problems.

Now if I could just figure out how that Linux box got into that situation...

I've been studying the C/C++ build and I think I learned some things:

  • glibc is inextricably linked with the Linux operating system. You can't run with a new glibc.
  • LD_LIBRARY_PATH can affect libc, but cannot affect ld-linux.so.2 (ld-2.3.2.so). It seems you can get around this with chroot, but then you have other problems.
  • glibc 2.3.2 has the symbol GLIBC_PRIVATE which is in ld-2.3.2.so, but not in ld-2.2.x.
  • libstdc++ 3.2 (it comes with g++ 3.2) requires glibc 2.3. Redhat 7 ships with 2.2 or earlier. See previous point. You cannot take a libstdc++ from Redhat 9 and run it on Redhat 7 unless you upgrade glibc and just about everything else in the OS, at which point it's not really Redhat 7.
  • libstdc++ is more than STL. It's the C++ runtime and STL. Therefore STLport can never replace libstdc++.
  • I can chroot with Redhat 7 (actually Mandrake 8) and get my Redhat 9 compiled binary and libstdc++ 3.2 shared object. However, once I do that I can't do things like read /proc or modify /etc which is something we need to do.
  • Starting with g++ 3.2, libstdc++ is attempting to be forward/backwards compatible in its ABI where possible. At this point compatibility was completely broken.
  • Redhat 9 ships with compat-libstdc++ which contains the C++ runtime libraries for gcc 2.96 as used in Redhat 7.3. This means C++ stuff compiled on Redhat 7 will work on Redhat 9, but only when this package is installed.
  • glibc works very well forward/backwards compatibility-wise, with the GLIBC_2.0, GLIBC_2.1, etc. symbols. If you build a binary that is C only, it's probably going to run anywhere, as long as it's glibc v2 or better, preferably glibc v2.1.
  • It is impossible to statically link libstdc++ into an executable when exceptions are thrown/caught. This is because symbols such as _Unwind_DeleteException exist in libgcc.so but do not exist in libgcc.a.

While setting up our development system and source control, I'm taking the philosophy that all tools are to be checked into source control, not installed on individual machines; in that way a developer's tools are never out of date. Unfortunately some tools don't like this approach, they like to hard-code or "relocate" their position during installation.

One of those is perl.

This link explains a bit how ActivateState relocates perl on install.

What happens is that the @INC path must be embedded in the perl executable on
Unix platforms, or so they claim. When install.sh is run, it calls reloc_perl,
which uses an ActiveState perl module Relocate which then uses this trick to
replace things like
/tmp/.TheInstallScriptWasNotRunTheInstallScriptWasNotRunTheInstallScriptWasNotRun-perl/lib/5.8.0
with the appropriate path. Unfortunately, when I first tried this, the path just happens to be my home directory where I downloaded it.

By the way, there is only 0x80 (128) bytes of space to put the path in, so there is a limit to what location it can be relocated into.

So, the procedure I used to get an ActivePerl that works on anyone's machine no matter where their source directory is mapped to their file system:

  • Installed ActiveState Perl normally, into a place such as your home directory: in my case this was /home/jared.oberhaus/p4/tools/linux/ActivePerl-5.8.3.809
  • Found all instance of text and binary files under the installation directory that contain /home/jared.oberhaus and replaced them with the original files from the install tar. The original files still have encoded strings such as /tmp/.TheInstallScriptWasNotRunTheInstallScriptWasNotRunTheInstallScriptWasNotRun-perl/lib/5.8.0 inside them.
  • Submitted these files to source control as-is.
  • Modified ActiveState's install.sh by adding to it (not removing the original install procedures). First it links the magic /tmp path to the file location where the source control version is mapped. This is controlled by detecting where the install script exists and processing that. When reloc_perl executes it will copy everything into /home/user/p4/tools/linux/perl-5.8.3 and at the same time replace the magic /tmp string with the correct location.

You can prevent System.exit() by setting the appropriate thing in the SecurityManager. Try something like this in your JUnit test:

public void setUp() {
    System.setSecurityManager(new CatchSystemExit());
}
public void tearDown() {
    System.setSecurityManager(null);
}
private static class CatchSystemExit extends SecurityManager {
    /** @see SecurityManager */
    public void checkExit(int status) {
        m_exitCode = status;
        throw new SecurityException("System.exit() attempt caught");
    }
    /** @see SecurityManager */
    public void checkPermission(Permission perm, Object context) {
    }
    /** @see SecurityManager */
    public void checkPermission(Permission perm) {
    }
}

cpio and rpm

| No Comments | No TrackBacks

I'm trying to submit tools and libraries into source control, and the tools (such as glibc) arrive in rpm's. Instead of installing them (which is what I don't want to do), I'm ripping the contents out. Of course, rpm doesn't give you an easy way to do that.

But I have determined the correct syntax to do it, as rpm's are really cpio files:

rpm2cpio rpmfile.rpm | cpio -id

I noticed that everytime checkstyle runs it contacts www.puppycrawl.com to get its DTD. This is annoying...

I thought it was a bug in the checkstyle code, but it turns out I put the wrong DTD identifier at the top of all the checkstyle config files. Once I fixed that, it stopped phoning home.

Recent Comments

  • Andy Davey: If you want to tell git "don't bother checking these read more
  • Karoline: omg! u r so smart! that is so kewl. can't read more
  • Kourtney: OK. I can't believe you didn't figure that out sooner. read more
  • terence: Thanks for the tip! Same thing here. Tried multiple USB read more
  • Pal Engstad: Patrick, Bryan doesn't want to ignore the files, he just read more
  • Darwin Yuan: There was already a project hosted on google code named read more
  • Patrick Berkeley: Bryan, Git-ignore [1] accomplishes what you've described, but in the read more
  • Matt: It won't sync via FireWire, only charge. What I want read more
  • Bryan Ischo: A thought I just had ... perhaps the solution would read more
  • Jared Oberhaus: I found this blog article that also claims that git read more

Find recent content on the main index or look in the archives to find all content.