Hugging Face and ServiceNow partnered to develop StarCoder, an open-source language model for code. BigCode Initiative created an improved version of the StarCoderBase model, which trained on 35 billion Python tokens. StarCoder is a free AI code-generating system that serves as an alternative to GitHub’s Copilot, DeepMind’s AlphaCode, and Amazon’s CodeWhisperer.
StarCoder
StarCoder trained in over 80 programming languages and text from GitHub repositories, including documentation and Jupyter programming notebooks. It trained on over 1 trillion tokens with a context window of 8192 tokens, boasting an impressive 15.5 billion parameters. It also outperformed larger models like PaLM, LaMDA, and LLaMA, proving to be on par with or even better than closed models like OpenAI’s code-Cushman-001.
With its open-source nature, the community can help improve it and integrate custom models. While StarCoder may not offer as many features as GitHub Copilot, the community’s contributions can enhance its capabilities over time.
Leandro von Werra, one of the co-leaders on StarCoder
The StarCoder LLM is trained using code from GitHub, so it may not be the optimal model for certain requests, such as creating a function that computes the square root. However, by following the on-screen instructions, the model can serve as a helpful technical aid. The model’s Fill-in-the-Middle method uses tokens to determine the input and output’s prefix, middle, and suffix. This pretraining dataset only includes content with permissive licenses, enabling the model to produce source code word for word. It is important to follow the code’s license requirements regarding attribution and other guidelines.
This new VSCode plugin complements software development by facilitating interaction with StarCoder.
Users can press CTRL+ESC to check if the current code was included in the pretraining dataset.
Although similarly to other LLMs, it also has limitations that may lead to the generation of incorrect or inappropriate information, it is available under the OpenRAIL-M license, which imposes legally binding restrictions on its use and modification. Researchers evaluated that the program’s coding capabilities and natural language understanding by comparing them to English-only benchmarks. To broaden the application of these models, further research is needed to understand their effectiveness and limitations in different natural languages.
AI-powered coding tools can significantly reduce development expenses while allowing developers to focus on more creative projects. According to research from the University of Cambridge, engineers spend at least half of their time debugging instead of actively working, resulting in an estimated annual cost of $312 billion for the software industry.