Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NIO for zip/jar access #9683

Open
scabug opened this issue Mar 2, 2016 · 11 comments
Open

Use NIO for zip/jar access #9683

scabug opened this issue Mar 2, 2016 · 11 comments

Comments

@scabug
Copy link

scabug commented Mar 2, 2016

#9632 and #9682 stand as examples of why the JarFile API is really really buggy (especially on Windows),

Java 7 introduced a new API for dealing with zips/jars http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html which is also much higher performance.

It would be good if scala 2.12 could make use of these NIO APIs, since the target platform has shifted to Java 8.

@scabug
Copy link
Author

scabug commented Mar 2, 2016

Imported From: https://issues.scala-lang.org/browse/SI-9683?orig=1
Reporter: Samuel Halliday (fommil)

@scabug
Copy link
Author

scabug commented Mar 3, 2016

@retronym said:
I took some baby steps in this direction in scala/scala@2.12.x...retronym:topic/jdk8ism

@scabug
Copy link
Author

scabug commented Mar 3, 2016

@soc said:
@retronym sorry, didn't know that. Feel free to assign it back to you!

@scabug
Copy link
Author

scabug commented Mar 3, 2016

@retronym said:
@soc I won't have time to pick this up prior to 2.12.0. The job will be somewhat smaller if we can remove -Yclasspath-impl:recursive. We should at least get the ball rolling by switching the default to :flat before 2.12.0.

@scabug
Copy link
Author

scabug commented Mar 10, 2016

@Ichoran said:
[~fommil] Do you have a reference for the performance difference? I couldn't immediately find a set of good benchmarks that indicate that the performance is in fact better.

@scabug
Copy link
Author

scabug commented Mar 11, 2016

@Ichoran said:
[~fommil] - I don't see any evidence that ZipFileSystem is faster than simply ZipFile, and it's only slightly faster than ZipInputStream. In comparing

def zipsumA(p: java.nio.file.Path) = {
  import java.nio.file._
  val fs = FileSystems.newFileSystem(p, null)
  val root = fs.getPath("/")
  var l = 0L
  var n = 0L
  Files.walkFileTree(root, new SimpleFileVisitor[Path]{
    override def visitFile(path: Path, attr: attribute.BasicFileAttributes) = {
      val bb = java.nio.ByteBuffer.wrap(Files.readAllBytes(path))
      n += bb.remaining
      var i = 0
      while (bb.remaining > 4) i ^= bb.getInt
      l += i
      FileVisitResult.CONTINUE
    }
  })
  fs.close
  (n, l)
}

with older-style

def zipsumB(p: java.nio.file.Path) = {
  import java.util.zip._
  val zf = new ZipFile(p.toFile)
  val zes = zf.entries
  var n, l = 0L
  while (zes.hasMoreElements) {
    val ze = zes.nextElement
    val b = new Array[Byte](ze.getSize.toInt)
    val zis = zf.getInputStream(ze)
    var i = 0
    while (i < b.length) i += zis.read(b, i, b.length - i)
    n += b.length
    val bb = java.nio.ByteBuffer.wrap(b)
    var j = 0
    while (bb.remaining > 4) j ^= bb.getInt
    l += j
    zis.close
  }
  zf.close
  (n, l)
}

and

def zipsumC(p: java.nio.file.Path) = {
  import java.util.zip._
  val zis = new ZipInputStream(new java.io.FileInputStream(p.toFile))
  var n, l = 0L
  var ze = zis.getNextEntry
  while (ze ne null) {
    if (ze.getSize > 0) {
      val b = new Array[Byte](ze.getSize.toInt)
      var i = 0
      while (i < b.length) i += zis.read(b, i, b.length - i)
      n += b.length
      val bb = java.nio.ByteBuffer.wrap(b)
      var j = 0
      while (bb.remaining > 4) j ^= bb.getInt
      l += j
    }
    zis.closeEntry
    ze = zis.getNextEntry
  }
  else if (ze.getSize < 0) throw new Exception("This only works for ZipEntries with known size annotation")
  zis.close
  (n, l)
}

I see that ZipFileSystem (method A) is about 6% slower than ZipFile (method B) and 5% faster than ZipInputStream (method C).

Thus, given the modest differences I wouldn't bother prioritizing this for speed, and if it is using ZipInputStream now, I'd tend to go for ZipFile before ZipFileSystem.

If you know differently, can you please provide some evidence?

(Granted, this is only reading, but generally there's more reading than writing.)

@scabug
Copy link
Author

scabug commented Mar 14, 2016

@retronym said:
Here's a good collection of migration tips to help move from the legacy IO to NIO: https://github.com/marschall/memoryfilesystem#guidelines-for-testable-file-code

That project itself that offers these guidelines is an in-memory filesystem implementation. We currently have the an implementation of the same concept (https://github.com/scala/scala/blob/v2.11.6/src/reflect/scala/reflect/io/VirtualDirectory.scala) built in top our our internal IO abstractions. I'd like to get rid of that code in favour and push the logic down into the NIO filesystem abstraction.

@scabug
Copy link
Author

scabug commented Mar 16, 2016

@retronym said:
I've received some more background / tips on NIO that I can share here:

The advantages that NIO and NIO.2 has over File APIS -

  - NIO provides platform specific APIS ( this is probably not useful for you)
  - NIO Paths are a bit like files, but maintain platform specific information, so down require lots of stat calls to get files attributes
  - File scanners for example (https://docs.oracle.com/javase/tutorial/essential/io/walk.html) in the windows implementation load all of the OS information as one stat call per directory (actually in a weak reference last time I looked), so you down need one per attribute per file as you do with File API ( file name, size, isDirectory …)
  - NIO is also faster on IO as you have ByteBuffer access, and can avoid the double buffering required to transfer to native buffers
  - The APIs are a lot cleaner and generally a view to performance has been taken
  - NIO is inherently parallelisable (not sure if this is useful given the single threaded compiler though), but perhaps we can async complete the IO
  - NIO supports WatchService. This is probably not inherently useful for an app that dies, but in the future we perhaps a resident compiler service could use this to limit input scanning
  - File deletes are faster and provide proper exceptions. Not sure of the perf reasons, but parallelism and the cache of the attributes seems likely


A few links:

http://mail.openjdk.java.net/mailman/listinfo/nio-dev
http://mail.openjdk.java.net/mailman/listinfo/nio-discuss
http://www.oracle.com/technetwork/articles/javase/nio-139333.html

I have witnessed poor performance of SBT and scalac on Windows file systems in the past which I believe was related to lots of lookups of file attributes of source and class files, so the ability to do this en-masse sounds enticing.

Our next step towards this is to standardize on -Yclasspath:flat, which is more memory efficient than the older, default -Yclasspath:recursive. This will reduce the surface area of the compiler that interacts with Zip files. We plan to flip the default to flat this week as part of 2.12.0-M4, then remove the old implementation by 2.12.0-M5.

@scabug
Copy link
Author

scabug commented Mar 16, 2016

@soc said:
@retronym I could do the classpath migration. Is there already a ticket for it?

@scabug
Copy link
Author

scabug commented Mar 23, 2016

@lrytz said:
comment by @dragos: "Regarding AbstractFile, there’s a few implementations out there. For sure Eclipse is using it to feed files into the presentation compiler, and probably Ensime as well."

@scabug
Copy link
Author

scabug commented Sep 7, 2016

@SethTisue said:
flat-classpath is now the default (and only!) implementation in 2.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant