Massively parallel computer costs $99

Even though dual, quad, and octo-core CPUs have been around for a while, it’s a far cry from truly massive parallel computing platforms. The chip manufacturer Adapteva is looking to put dozens of CPUs in a small package with their Parallella project. As a bonus, they’re looking for funding on Kickstarter, and plan to open source their 16 and 64-core CPUs after funding is complete.

The Parallella computer is based on the ARM architecture, and will be able to run Ubuntu with 1 Gig of RAM, a dual-core ARM A9 CPU, Ethernet, USB, and HDMI output. What makes the Parallella special is it’s Epiphany Multicore Accelerator – a coprocessor containing up to 64 parallel cores.

Adapteva is turning to Kickstarter for their Parallella computer to get the funding to take their Epiphany multicore daughterboard and shrink it down into a single chip. Once that’s complete, Adapteva will start shipping an ARM-powered Linux supercomputer that’s about the size of a credit card, or a Raspberry Pi under the new system of dev board measurements.

With any luck, the Parallella multicore computer will be available for $99, much less than a comparable x86 multicore computer. It’ll certainly be interesting to see what the Parallella can do in the future.

Comments

  1. xobmo says:

    Wonder if they’ll call it Vapor Parallella 64…

  2. Halexander9000 says:

    Um… wouldn’t an operating system and subsequently all applications running under it have to be optimized to use all 64 cores in order for them to really take advantage of the parallel processing power that thing has? I don’t want it ending up like a sort of Redundant Array of Inexpensive Processors instead of the Bad Ass Parallel Processor it’s supposed to be. Do you?

    • Anonymous says:

      Redundant Array of Inexpensive Processors…or RAIP for short. As in, it’ll RAIP your computer.

    • Jon says:

      Well, it’s a co-processor daughterboard chip. What I got from the article was that the OS runs on the main CPU [dual-core Arm A9], and you could code an application to run on the massively parallel co-processor. They will likely have some sort of C/C++/Python API and library to accomplish this. That’s just my take.

      • Mikey says:

        I’m not a Linux guy — but couldn’t someone just port pthreads to take advantage of this and then recompile any software they wanted and it would automatically take *some* advantage of the RAIP?

        • Frederik says:

          No. The coprocessor is -likely- not an ARM processor, or even if it is, it might not be compatible with the Cortex A9 API — hence, it will not be that simple. I suspect that it would realtively easy to write small kernels for the coprocessors and use some simple MPI to signal between the main program and the kernel.

    • hospadar says:

      I got the feel that that was just the sort of reason they want to build it – to provide a test bed for such applications.

      There was a system kinda like this before I think, can’t recall the name for the life of me. Some system with lots of CPUs that was supposed to be great, but never made it to the big leagues because of the difficulties of designing software for such a system.

      Itanium died for not entirely dissimilar reasons.

      A low intensity testbed like this sounds like fun.

    • adapteva says:

      I figured I would chime in and answer some of the questions. Really appreciate the skepticism, we obviously haven’t done a good enough job of putting out convincing data and storied. I can tell you one thing for sure. This is NOT vaporware!

      First of all, we are not trying to replace CPUs, we are trying to replace GPGPUs for certain tasks. At the end of the day, imaginen a PC with two PCIe slots, one with a GPU and the other witha a multicore “something, call it what you like”. Both would be accessed through the same programming framework (like OpenCL). The OpenCL code would need to be written in a parallel way, but that seems to be the chosen approach until we come up with something better (one of the goals of Parallella by the way).

      Andreas

      • JB says:

        Wasn’t this the goal of PhysX? (stares at PhysX board in PC) They got absorbed by nVidia and their tech is part of their 3D accelerators. Now it is only useful for graphics. I was hoping for many better applications.

        Unless that’s what you are shooting for? Quick sale to nVidia or ATI/AMD? ;)

  3. tulcod says:

    @Halexander9000: If I understood it correctly, there will be a host Cortex chip, and then the parallella as an external “accelerator” chip (kinda like CPU vs GPU, except this GPU is rather sophisticated).

    • Leithoa says:

      I’m architecture ignorant, but how exactly(aside from price) is this different than what many researchers are doing by using their GPU(s) to process models? There are low end(<100USD) graphics cards have +300 CUDA cores. What's the advantage of this form factor?

      • GPUs are only useful for a limited number of computations (like pixel shaders, or really anything to do with graphics). Also, all GPU cores calculate the same thing at the same time.

        This is a more general purpose solution – it can calculate anything – and each core can be controlled separately.

      • Finger says:

        @Brian Benchoff

        That is just untrue of GPU’s. Having programmed using CUDA on many of nVidia’s GPU’s, you can run different code on different cores and not necessarily at the same time. Many people use GPUs for general purpose computing (which is why the term GPGPU exists). Granted, it is a different programming style as to how you organize your code but it can be and *is* done.

      • tulcod says:

        You’re right in that the matrix multiplication demo they made a video of could be done just as fast, or presumably even faster, on GPUs.

      • rasz says:

        depends on precision, integer 512×512 matrix mul takes <1ms on old cuda cards :)

  4. Pete says:

    Interesting, they clearly know how much of a problem the difficulty of parallel computing is, and want the community to help them solve it so their chips are useful. As said above, requires quite a lot of software work.

  5. twdarkflame says:

    Would this be possible as a USB stick you plug into a regular PC?
    So you could code from your platform of choice, send it to the stick, and hit “go”. Could be great for specific tasks I think.

    • po says:

      I would assume you could effectively do that with this system if you wanted. It supports USB and would just require a little coding to accept parallel tasks over a wire. But you could just have it suck code to run from a git repo or similar and make things easier. Or just get your platform of choice running on the thing, which seems most reasonable.

    • rasz says:

      16 core chips have 512KB of ram. they compute quite fast. you want FAST connection to feed them data, usb is NOT fast
      usb 3.0 interface chips are expensive.
      Maybe something like Esata could work, but thats still retarded

      • Mikey says:

        You could give them more ram (DDR4?), so that they don’t need to send so much traffic back and fourth — you end up with the same bottle neck issues that have always existed with out sourcing computations (bandwidth between master and slave) — USB 2.0 could be acceptable depending on the complexity of the application, but ultimately, I’d want a PCIe (or whatever the latest, greatest, fastest bus you can grab at the time is).

  6. Barefoot says:

    Great, they’re going to Open Source it! I’ll just go down to my basement clean-room CPU manufacturing lab…. oh, wait…

    I used to think the Open Source movement was targeted at the maker/DIY community. Not so much anymore.

    • Richard says:

      “Open Source: The Parallella platform will be based on free open source development tools and libraries. All board design files will be provided as open source once the Parallella boards are released.”

      Sounds like they’re not opening up the chip itself, just the board it runs on and the software.

    • adapteva says:

      When we say open,we mean open datasheets, arch ref manuals, drivers,SDKs, board design files. This may seem trivial, but most semi vendors today won’t release their latest chip specs without an NDA. As a small startup(5 people) opening up like this exposes us to the big companies so we feel like it’s a pretty bug deal. Still, it’s the right thing to do for us and for everyone else. The only way to get any kind of long term traction is to publish the specs.

  7. Wretch says:

    Ask Samsung to invest in it and they can put it in the Galaxy S5 phone; the world’s first 64-core mobile phone, the first to feature holographic display and keyboard.

  8. jaybee says:

    I always though the Open Source movement were targeted at the semi-professional/freelance. nowadays with some open source stuff getting so complicated, even professionals have a hard time customizing them

  9. bothersaidpooh says:

    DIY neural net anyone?

    Something like this would be very handy for pattern recognition, such as speech translation.

    Only major problem is that the 64 cores would use a lot of power, you’d need to have them dynamically underclock when unused so that each core runs from an array clock.
    Clock skipping would also work..

  10. zing says:

    Slight correction to the title: the $99 version is only 16 core, the 64 core is $199 and will only be available if they significantly exceed the base funding.

    I’m still in for the 16 core version, but I wish the 64 core was more obtainable. Even at $199, I would still impulse buy it, but it probably has lower yield and substantially higher production costs.

  11. GaspingSpark says:

    You can buy the GreenArrays GA144 today. It has 144 simple cores for $20. But you have to program it in Forth.

  12. rasz says:

    $750,000 wont even get them a tapeout and a spot at TSMC. Are they for real?

    • rasz says:

      Ok, watched kickstarter video, looked at their site.

      They already have a chip ready and working. They want money for scaling up manufacturing. Makes sense.

      Still 64 core version is only 100 GFLOPS (at 2Watts)
      old Radeon 4850 is 1Tflops on paper (more like 300 GFLOPS real performance) at 150W and <$100

      • wallacoloo says:

        2W vs 150W is a HUGE difference. You’re not going to get good battery life on a laptop or tablet PC, etc with a 150W processor. At 2W for 100GFlops, that’s 20 micro-watts per Mflops vs 500 micro-watts for the Radeon 4850’s real performance. So even if it costs more, and doesn’t provide as much yield, it still has its places in low-power applications. But I’m not saying it’s the best solution out there by any means, and there’s a whole lot more to performance than just flops anyways…

  13. Dave M says:

    Remember the Connection Machine? I wonder what easily available information survives about the architecture, implementation and application software tools.

    I remember reading Danny Hillis’ book “The Connection Machine” in the mid 1980’s, and found it pretty interesting. I have a hard time now thinking about 64 cores as “massive parallelism”. The CM-1 was basically a coprocessor attachment to a Digital Eqpt Corp VAX.

  14. bty says:

    guys, if this were feasible, don’t you think the giants would of already tried this or sourced funding for something like this. this is not just about slapping a bunch of cores on a die, there are some really difficult issues like interconnects etc.

    • SavannahLion says:

      While it is difficult, I think the real problem as to why we don’t see more of these designs is a chicken and egg problem. Consumers don’t want to spend extra dough for parallel hardware if there isn’t software that takes advantage of it, software developers need to code for the lowest common hardware that allows them to reach the largest consumer base and chip manuracturers don’t want to produce massive parrallel designs if there’s no demand.

      This is also on top of developer experience. I’ve dealt with mutli-CPU devs in the past and they generally say the same thing, it’s got a much steeper learning curve, but once you’re over it, you won’t go back.

      We’ll all get there eventually.

  15. Jim Panse says:

    The Probelem today is NOT insufficient computing power or not enough cpus. The problem is not fast enough I/O. What do you want with 64 cores if you don’t get the data to it fast enough?

  16. addidis says:

    bitcoins will never know what hit them .

  17. bluesteelbass says:

    anyone else think of the playstation 3 and xbox 360? those things have multiple cores, and they still don’t utilize them to their full extent. it ends up being not about the computing power, but the instruction set you give the developers to make their programs efficiently utilize the hardware to its full extent…
    and since efficient un-bloated code seems to have gone the way of the dinosaur with our increases in processing power…

    • rasz says:

      ps3 has about 100Gflops DP and IS utilized fully (by few suckers that bought into Sonys “lets make clusters with them .. oh wait no more linux go away” plot)

      x360 has no real processing power (3 in order execution cores + old ati gfx)

      • bluesteelbass says:

        people running linux still have two cores locked down on the ps3…
        even though the ps3 is ‘more powerful’, developers for the games don’t utilize them to the fullest extent… either due to the libraries not being there, they are too lazy to synchronize things running on multiple cores… or they could just not want to develop the same thing several completely different ways.

      • rasz says:

        at this point you can use all spus on few remaining real ps3s with linux

  18. krylenko says:

    Can we please, *please* treat projects in progress as in progress, not as done?

    This computer (which doesn’t exist yet) WILL cost $99 if and when it gets produced and goes on sale at that price point. It does NOT cost $99 right now. And there’s a lot that needs to happen between now and that point.

    I’m seeing this more and more, even outside HaD. Wired just had a piece about a guy who wants to fly to the edge of space in a balloon.

    What’s he done so far? Cobbled together a prototype pressure suit from parts he got on eBay and at Ace.

    What’s left, according to the article? “Just” testing the suit, rebuilding it with quality parts, building a balloon, getting a balloon pilot’s license, and getting FAA approval on his flight plan.

    That guy might well do all that, and this computer might well actually come out at $99.

    It’s great to talk about cool projects in progress and interesting future plans. But everyone and their mother seem to be assuming that if something appears on Kickstarter, it’s basically done.

    We all know different, and HaD should be accurate about it.

    • rasz says:

      to be fair to these guys they do have chips working
      they want money for retooling = to get yields that enable economies of scale
      From the description they dont want to design new chips, they will bundle OMAP or something chinese like Rockchip next to their 16 cores chip and call it a day

  19. polossatik says:

    I really wonder how they hope to compete against GPU’s who are fabricated in mass and are rather cheap.
    Why would their architecture ibetter/easier to work with?

    • adapteva says:

      We use fabs just like Nvidia does so as soon as do full mask set tapouts our per chip pricing is not far off from Nvidia. Let’s say we lose out a factor of 2X on price,but our dies are more than twice as efficient as Nvidia in terms of performance per mm^2 so we should be pretty close. Still,we are less than 1000th the size of Nvidia so selling our mousetrap is a real challenge.(as we have found in
      the last year).

  20. Chris C. says:

    “The Parallella project will make parallel computing accessible to everyone.”

    It already is:

    1) The GreenArrays GA144 (and its predecessors, the MuP21 and F21).
    2) XMOS’ XCore series.
    3) The Parallax Propeller (and Prop II if they ever finish it).
    4) Numerous GPUs.
    5) Relatively few tasks can be massively parallelized. Typically it’s just one or two fixed algorithms for any given project, so FPGAs are an option too.
    6) By no means a complete list.

    Still, I don’t see “everyone” flocking to parallel processing.

    XMOS has been running a monthly design competition for a while now. Not a SINGLE ONE of the winners I’ve seen does anything new, novel, or truly utilizes the multi-core architecture in more than a token way. They’re all just ports of existing projects that have already been done on a single-core MCU. The competition is a joke, merely an easy way to win a prize.

    Will the Parallella will somehow change this? If so, why? Will it enable something specific, that existing options won’t? Ask yourself those questions.

    Adapteva should have asked, and been able to answer those questions, before embarking on this project; and especially before seeking Kickstarter funding. So what is their answer?

    “If we can pull this off, who knows what kind of breakthrough applications could arise.”

    Even they don’t know.

    So to be brutally honest: it’s a redundant solution, looking for a problem, looking for funding.

    • adapteva says:

      Chris,
      Thanks for the thoughtful comments! I hope you believe me when I tell you that I have been struggling with this question of parallel programming for 4 years now and I don’t have a great answer.(but I don’t think anyone else does either) The idea (maybe its naive?) is that if we put this platform in a lot of different universities for close to nothing,then at least it could be used as a tool for quickly teaching all the current methods. We don’t see this happening without access to cheap and orthogonal hardware.

      I am familiar with the examples you mentioned, and I do think there are some differences:

      GreenArrays GA144–>people didn’t want forth

      XMOS–>great effort but not well known,
      no floating point, not high enough performance.
      (please correct me if I am wrong)

      Parallax–>not modern enough.

      GPUs–>not general purpose enough, not really
      ANSI-C programmable. Constrains
      programming model too much.

      FPGAs–>not really software programmable

      We do feel that the Epiphany would serve as a better experimentation platform and teaching platform for parallel programming. We already support C/C++/OpenCL and we have people interested in porting openMP and MPI(lite). Halmstad U in Sweden is even playing around with Occam.

      The future is parallel, and nobody has really figured out the parallel programming model. I must have heard the question “how are we going to program this thing” over a 100 times in the last 4 years, and my answer was never good enough.(we did well in places where nobody asked the question like in HPC and military). Without broad parallel programming adoption our architecture will never survive, so we obviously have some self serving interest in trying to provide a platform for people to do parallel programming on.

      Andreas

      • Chris C. says:

        Thanks for the detailed and honest reply. I’m taking this project more seriously now.

        I also mostly agree with your comments on the alternatives; except the GA144, which I have some thoughts on.

        You’re right that no one likes programming exclusively in Forth, with the exception of a few that I’d say have a masochistic streak. But consider that few program exclusively in the native language of *any* CPU. Most use high-level languages, compiled either to the native language, or to p-code which is often executed by a stack-based virtual machine.

        And if C++ (or any other language) can compiled to run on a stack-based virtual machine, it can also run on a stack-based real machine, with much better performance. Admittedly not quite as fast as on a machine with a C-optimized instruction set, but having massive parallelism at your disposal could potentially make up for it in practice.

        From a business perspective, GreenArrays should have already pursued this. I can’t say I’m too surprised they haven’t, as the CEO is the inventor of Forth, and he might be loathe to make other languages accessible on his baby. But I’m *extremely* surprised that no third party has seen past the typical recoil response to “Forth”, and opened up the potential of this chip by making it more accessible.

        Being strictly an issue of compiler design, this is an possible example of a more direct approach to fulfilling your stated goal, of making parallel computing available to the masses. The path you’re taking involves significantly more expense and risk, the magnitude of which I doubt the average Kickstarter investor realizes. I’d really hate to see something like this fail, for their sake and yours. Still, I don’t mean to discourage trying outright, without which there is no possibility of success. So to that end, I do wish you luck, success, and ultimately proving myself and other doubters wrong!

      • Dave M says:

        Have you seriously looked at the Thimking Machines (aka Connection Machine) information? They were building massively parallel supercomputers in the 1980’s (CM-1 had 64K simple SIMD nodes). A few minutes with google got me these links:

        The Connection Machine (pdf)

        Data Parallel Algorithms (pdf)

        I also found online manuals for the Thinking Machines parallelized languages (e.g. CM-Lisp, CM Fortran, C*, etc.).

      • Alex Cole says:

        Loads of people have figured out the parallel programming model, they just get ignored by people banging on about how hard threads and locks are to use and thus they require more work.

        I would direct your attention to CSP – Communicating Sequential Processes, the underlying process calculus behind the Occam language that was used to program the Transputer, a massively parallel architecture in the 1980s. This calculus was developed and PROVEN (mathematically) by Tony Hoare, and can be used to develop highly parallel programs that can be proven to never deadlock or livelock, and never have race conditions.

        That sounds pretty “solved” to me and I can put you in touch with a number of experts in this field, who regularly program hugely parallel code every day with no issues what so ever. In fact, because it is SO easy, they often model what many people may write three functions for as three separate processes running in parallel. Not just people, but research groups and conferences dedicated to the subject. Heck, they’ve stuck parallel architectures on the lego mindstorms kit – that’s pretty mainstream!

        There are also other similar architectures, but I’m less familiar with them.

    • rasz says:

      >So to be brutally honest: it’s a redundant
      >solution, looking for a problem, looking for
      >funding.

      same old same old. I think “If you build it, he will come” will work just fine. Look what happened with GPUs. People didnt know how to use it for GPGPU and now every university has a cluster of those, and a ton of software has at least plugins (photoshop for example)

      One thing is for sure – Future is parallel.

      • Alex Cole says:

        I can tell you for a fact that the cause and effect here is entirely the opposite to what this project is talking about. People were starting to use GPUs for GPGPU long before CUDA/OpenCL/DirectCompute were developed, hacking low-level graphics libraries to get it done on the available hardware. That’s a case of the hardware following the market, not new hardware trying to invent a market.

  21. Kevin Keith says:

    I have no idea where they are in terms of producing this thing. Their website only mentions the two 16 and 64 core iterations of the device that are mentioned in the kickstarter. Both have pictures of some device, but I can’t tell whether the picture is real or not. Even if it were, it could very easily just be a demonstration model without anything inside.

    However, they let’s look at the cost of bringing a chip to market. The first expense is the soldermasks. These need to be very durable and precise. As far as I know, these are patterned onto quartz glass using e-beam lithography (very expensive). These can easily cost upwards of a million dollars, depending on the complexity, for the entire set (you need one for each layer, and something like this will definitely have a lot of layers). They could be using something like MOSIS, which brings the cost down by making a grouping together several different chips from other clients. The obvious problem here is throughput, its designed for prototyping. If they have any real silicon its probably something like that.

    In terms of actual manufacturer, you basically have two options: FPGA transfer, and cell-based. I’m not sure how the former works exactly, but what I do know is that the performance is not nearly as good as a cell-based chip. Cell-based use standard cells and let you lay out each gate.

    Moreover, there is the cost of validation and software. Software is self explanatory. Validation is to make sure the designers didn’t screw up. If you mess up, everything you’ve done is worthless.

    Here’s why I remain skeptical. If they managed to get enough investments to get the soldermasks (supposing they have them) then why do are they turning to kickstarter to get funding? They’re a business venture, I personally wouldn’t feel uncomfortable “donating” to something like this, as opposed to a non profit. On the other hand, if they don’t have the masks made yet, than 750k is nowhere near enough money. Even besides those problems, how are they going to get anyone to make this for them? Something which doesn’t require as high a quantity like MOSIS is very expensive per chip. Whereas if they were to approah TSMC they’d probably be laughed at.

    Now, if they were able to actually make the chip, I still don’t think it’s a good investment. As others have mentioned, you can get chips with much higher performance, for much less, with the added advantage of not being a first generation adopter. Another problem is that the array uses some sort of custom RISC. They plan to open source the tools, but this does not mean they will not be highly immature. The cynic in me also thinks that open sourcing the tools could very easily be an excuse for their lacking in quality and/or a way to harvest free labor.

    • adapteva says:

      Kevin,

      I appreciate the skepticism. Here’s some more info on the technology.

      -We use Global Foundries (AMD spinut) as a manufacturer and have direct access to the fab(no middle man)

      -We have done 4 chips so far, 3 in 65nm and one in 28nm. The first two were prototypes and the last two are real products. The last two are working perfectly. The 65nm version has been in the field since last May and tested by partners like BittWare, Brown Deer Technology, Embecosm, Northeastern,Bristol, Halmstad as well as many big OEMs that we can’t mention. Those guys can all testify to our chip and our tool quality, and some of those testimonials are up on our website.

      The Parallella project is about reducing the manufacturing cost of our chips and creating a small PCB board around these chips so that we can get them int he right hands for $99. Not a trivial task, but certainly not as risky as developing a whole new architecture from scratch. That work is done.

      Let me know what you would like to see from us to convince you that this is for real?

      Andreas

  22. dave says:

    I can’t get over the part where there’s only 1 gig of ram. Isn’t that a bit weak to be supporting 64 cores with?

    • Greenaum says:

      Many graphics cards have had more processors with less RAM. It’s not like each chip runs an operating system, just small blocks of code to do some calculation or other. A few K each would be fine.

      As others have said, it’s the interconnect that’s the bottleneck, getting inputs and results into and out of the chips. Fabric architectures for things like this are still state-of-the-art, there’s established methods available, but plenty of research still to be done.

  23. utomo says:

    Competition is great way to make better technology.
    I hope somebody fund this project and release the products to public.

    don’t just limiting to small solutions only.
    but also plan for normal and big server version. with standard Linux and also NGINX.

  24. adapteva says:

    I don’t want to butt in on the conversation, but there was just too much speculation for me to resist. I am here if anyone wants to ask me something directly. Would really appreciate to hear what else we need to show off to overcome the skepticism…

    Andreas Olofsson
    Founder @Adapteva

  25. sillyness says:

    As has been stated, I don’t see this as being a rival for CUDA GPU multiprocessing. In fact this is over half a decade out of date before release. If they could find a way to integrate a sizable amount of L1 cache into every one of those cores along with a healthy L2 cache just to handle the massively wasteful overhead of running many self contained cores.

    Look here:

    http://www.adapteva.com/products/silicon-devices/e64g401/

    It appears to be a great deal of jargon appendage without understanding the implications.

    “The CPU has an efficient general-purpose instruction set that excels at compute intensive applications while being efficiently programmable in C/C++ without any need to write code using assembly or processor specific intrinsics.”

    You can’t have it all. Otherwise there would be no need for all those OS design classes many of us have taken.

    “The Epiphany memory architecture is based on a flat memory map in which each compute node has a small amount of local memory as a unique addressable slice of the total 32-bit address space.”

    This is a really overhead intensive idea. It’s the CPU equivalent of storing every one of your OS files on a different cloud server and waiting for it to boot from the network.

    Finally the last piece of evidence, the damning one:

    “100 GFLOPS Sustained Performance”

    That is the equivalent of a chilled-water cooled i7 overclocked to 5 Ghz and consuming more energy than an average first world home. How can they say with such nonchalance that they can beat a top of the line consumer processor at a fraction of the cost and less than 1% of the power draw?

    Lies.

    That’s how.

    • twdarkflame says:

      why not post this directly too them (see the post above yours) and give them a chance to respond?

    • bob says:

      I don’t know if 100 gflops at 2 watts is believable but i7 is not the part to compare it to, since the i7 is mostly a massive OOO engine with a lot of cache, etc. The comparison with an nvidia gpu is more relevant since the gpu is mostly arithmetic. Also relevant would be comparison with current floating-point DSP’s, say from TI.

      Adapteva, how do you plan to be 2x as efficient per mm**2 than Nvidia? Those guys are not fools and I doubt they want to be wasting silicon, and they get an efficiency win (in the SIMD parallelizeable case) by not having all those separate instruction decoders, etc.

      Do you have any benchmarks or simulations of real workloads?

      I notice nobody has mentioned Tilera, which makes a 64 core MIPS array used in stuff like network routers.

      Also not mentioned so far: Tilera’s 64-core MIPS array, used in stuff like network routers. I think someone has mentioned the Xeon Phi (Knight’s Corner), Intel’s 64-core x86 announced recently.

      It’s not real clear that anyone wants a GPGPU in a phone. They want specific applications like video codecs, and the GPU hardware evolves around the requirements of those applications.

      For workstations and supercomputers, people want max performance without melting down. Instead of 100 gflops at 2 watts, they want 5000 gflops at 100 watts.

      The 64 core chip is maybe interesting for internet packet filtering, but again, no floating point is needed, and the board should have 10 gigabit ethernet instead of 1 gig.

      • adapteva says:

        Bob, I totally agree with your market analysis. Spot on. The Parallella project is not trying to address any of those markets, it’s just meant to be an open and cheap platforms that can get people going easily with different parallel programming models.

        >Adapteva, how do you plan to be 2x as efficient >per mm**2 than Nvidia? Those guys are not
        >fools and I doubt they want to be wasting >silicon, and they get an efficiency win (in the >SIMD >parallelizeable case) by not having all >those separate instruction decoders, etc.

        We actually use a lot less silicon area than both Nvidia and TI in terms of raw GFLOPS/mm^2 and I think our story gets even more interesting when you look at achievable GFLOPS per mm^2 on real applications based on products that we are sampling.(measured data on our web site) I can’t speak for Nvidia or TI engineers, but from my experience working at Analog Devices and TI for over 10 years I found that building products that have to cover multiple application spaces is a bad idea. GPGPUs have to keep their graphics customers AND HPC customers happy. OpenGL and high performance computing are very different beasts so there is going to be a lot of wasted transistors trying to cover both. Sure, they gain some advantage in being SIMD(or SIMT), but not as much as you would think. With a smalll RISC engine, the overhead of the sequencing, decoding is really not that big. The false assumption that SIMD is much more efficient (as in 10x) than MIMD was one of the reasons I started Adapteva in 2008.

        >>Do you have any benchmarks or simulations of real workloads?
        We have code/benchmarks for filters, FFTs, BLAS, Coremark, face detection benchmarks/demos. Until we finished our OpenCL compiler a couple of months ago we were struggling with benchmarks b/c the programming took too long. We are moving faster now, but we are still very resource constrained (5 person company). Benchmarks have always been a pita for semi companies with a lot of ugly tricks being used(and sources rarely shared). We want to put out our platform openly to let people measure our performance on their own.(hopefully it won’t disappoint..)

  26. Stu says:

    Yes, HaD please correct your article. Nowhere does it say you get 64 cores for $99. That’s a sensationalist title and you know it!
    I find it interesting that you have to dig a little bit deeper than the Kickstarter to find what you actually get for your $99. Obfuscation anybody?

  27. Pardon my ignorance on parallell computing, but I’m curious.

    Parallell computing seems mostly to be used for very specific calculations, weather modeling and huge databases requiring specialized applications, so what is the use of a consumer level/priced parallell processor unit be?

    If the idea is to get the technology into the hands of eager engineers and programmers, that seems laudable, (although it sounds as it’s already available through GPUPU) but I’m mostly curious if it could be of any use in rendering two hours of HD video faster out of Premiere…

  28. David says:

    Wouldn’t it have been easier to use an existing instruction set so you didn’t have to start from scratch? Is it just that you didn’t want to pay for an ARM license?

    • adapteva says:

      It would definitely have been easier much easier to use the ARM instruction set, but then we wouldn’t have been able to tune the processor for multicore which was a must. The size of our core is significantly smaller than the ARM core which leads to much better energy efficiency.

  29. quert says:

    They should make PCIE version

  30. willrandship says:

    Why would you use a multiple core CPU in one of these? Wouldn’t it be more efficient to have several single core cpus ganging together, and only have one layer of communication, rather than each cell having sub-cells?

  31. kyre says:

    There is a data propagation problem.
    Low bandwith at best with asimetric latency.
    A data ring is way better an cheaper.

  32. rasz says:

    And they finally released documentation
    SINGLE PRECISION only, means my $10 Celeron is faster than 16 core chip.

    • adapteva says:

      We are not trying to compare Intel’s costs to ours. Obviously they have a huge operational advantage. They have 100,000 employees and $50B in revenue. Adapteva has less than 5 employees and about $1M in revenue. I think if you look at different metrics for performance per mm^2 at equivalent technology nodes, you’ll see that the Epiphany architecture has a >10x advantage. (that’s what really matters).

  33. 21valy says:

    Hi

    adapteva, I’ve spent hours trying to understand what you do.

    Really seems great to me:
    – unexpensive (unlike Tilera64, I know it personally ; by the way GA144 needs a 450$ board + a 10-pack CPU * 20$…)
    – much more “green” than GPU’s way (a shame IMHO)
    – simple (C ! no FPGA/VHDL/Verilog !)
    – futuristic
    – David vs Goliath
    I really, really love that.
    Quite a pity to be so much criticized with such a great passion for what you do.
    I am impressed by the achievement and I say one word: respect.

    Now with my questions.

    1°) I use a quite expensive 8-core CPU on a daily basis. With (X)ubuntu 11.04. And gcc 4.7. Now I want your Epiphany board to be a host coprocessor. Will it be the case, plug&play-like ? with a simple USB cable or Ethernet or both ?
    Or should I wait for a PCI Express-based board ?

    2°) I’ve written some code for GPU. OpenCL, for fun. Interconnection between my CPU/GPU was really disappointing. Performance, too, with a simplistic kernel though.
    How good does your credit card-sized board will communicate with the native CPU ? mine, not ARM’s.

    FPGA seems to rock, except you have to pay thousands of dollars to get boards with really good transfer rates between the FPGA and the DDR3 interface.
    So I really need to know if&how I can transfer data quickly from my computer’s memory/CPU to my Epiphany’s board.
    In my dreams I’d like to combine the best of the worlds: a general-purpose CPU with cheap and huge memory, together with a cheap and green multicore coprocessor – with boolean operations welcome.

    3°) is there an Epiphany developer forum somewhere ? with real people writing some real code for real mainstream computers ? like OpenCL or any other funny coding experience I need some people to show me the way.

    Best regards,
    21valy

    • adapteva says:

      21valy,
      Thanks for being supportive!

      In terms of questions:

      1°) The Parallella board is a full computer with a GigE connection, so if you can tolerate the latencies, maybe could offload to the board through the network. Alternatively, there is a USB 2.0 connection on the board that can be plugged in to the host.

      2°)Getting high bandwidth with low power is always a challenge. The Parallella board has 1.4GB/s bw between the ARM host chip and the Epiphany coprocessor chip.

      3°) We’ll start a forum at parallella.org once the project is funded.

      Cheers,
      Andreas

  34. Ravi C says:

    it’s all about the interconnect!

    as you scale up the number of cores/processors, with a constant workload the per core work decreases, i.e. it gets split up into smaller and smaller units. the limiting factor is then how fast can the work be pushed out to the nodes and their results collected at the end – the communication channels.

    if these cores are general purpose enough then this is really the ideal network/switch configuration (maybe not being pushed as much as it could be).

  35. AJ says:

    Fully funded. Needed $750K currently at $811K with 7 Hours to go! Way to go!

  36. xcore fan says:

    http://www.adapteva.com/parallella-kickstarter/update-44-today-was-a-good-day/

    Update #44: Today was a good day…
    This entry was posted in Parallella Kickstarter Blog on December 12, 2013 by Andreas Olofsson.

    The initial batch of the “hopefully final” Parallella boards came back to today and they work great! No smoke on power-up, everything that worked on the Gen1 board still works, and….

    i really wish Andreas would just pick up the phone and call david may of xcore and insist they work together with his team of young exec’s on interconnected xcore slices to increase interconnects,data throuput and above all else look to the near future and for god’s sake put some actual current IO on each slice/board such as the current USB3.1 with real 10Gigabit/s IO that you can both utilise on the cheap and get a generic virtual 10Gbit Ethernet as the primary data sharing a generic fire and forget incremental massively parallel SOHO P2P network device ASAP , and will some 3rd party PLEASE make a slice that has 10GbE IP/chip as standard for cheap so we end users can finally get a way from the antiquated ethernet vendors that refuse to provide even this old 10 gigbit interconnect for a reasonable price ASAP

  37. AJ ONeal says:

    I just got mine, but I don’t have the time for another hobby project, so I’ve put her up for sale. Any takers?

    http://www.ebay.com/itm/Adapteva-Parallella-16-core-/321390051726?pt=LH_DefaultDomain_0&hash=item4ad457018e

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 96,449 other followers